GPTQ Math Derivation

GPTQ is a one-shot weight quantization method that leverages approximate second-order information to achieve both high accuracy and efficiency.

In this blog post, we will begin with Optimal Brain Damage (OBD)¹, the foundational concept behind GPTQ, and progressively derive the mathematical theory of GPTQ.

For convenience, let’s define $\mathbf{w}$ as a column vector, formulated as:

\mathbf{w} = (w_1, w_2, \cdots, w_n)^{\mathrm{T}}

Next, $\mathbf{g}$ represents the gradient matrix and is defined as:

\mathbf{g}_i = \frac{\partial E}{\partial w_i}

Lastly, $\mathbf{H}$ is the Hessian matrix, given by:

\mathbf{H}_{i,j} = \frac{\partial^2 \mathrm{E}}{\partial w_i \partial w_j}

Open Table of Contents

Optimal Brain Damage (OBD)
Optimal Brain Surgeon (OBS)
Optimal Brain Compression (OBC)
Optimal Brain Quantization (OBQ)
GPTQ
Summary
What’s Next?
Reference

Optimal Brain Damage (OBD)¹

This method employs second-order derivatives to prune weights while minimizing the impact on the network’s output. The underlying theory involves constructing a second-order approximate function to model the minimal disturbance to the local points (weights) of the objective function.

To introduce a small disturbance to the weights, we apply the Taylor Expansion to the following formula:

E(\mathbf{w}+\Delta \mathbf{w})=E(\mathbf{w})+\mathbf{g}^{\mathrm{T}} \Delta \mathbf{w}+\frac{1}{2} \Delta \mathbf{w}^{\mathrm{T}} \mathbf{H} \Delta \mathbf{w}+O\left(\|\Delta \mathbf{w}\|^3\right)

The change in loss due to the disturbance in $\mathbf{w}$ is:

\Delta E=E(\mathbf{w}+\Delta \mathbf{w})-E(\mathbf{w})=\mathbf{g}^{\mathrm{T}} \Delta \mathbf{w}+\frac{1}{2} \Delta \mathbf{w}^{\mathrm{T}} \mathbf{H} \Delta \mathbf{w}+O\left(\|\Delta \mathbf{w}\|^3\right)

In the OBD paper, this equation is expressed as:

\delta E=\sum_i g_i \delta u_i+\frac{1}{2} \sum_i h_{i i} \delta u_i^2+\frac{1}{2} \sum_{i !=j} h_{i j} \delta u_i \delta u_j+O\left(\|\delta U\|^3\right)

To simplify, OBD assumes that each parameter’s contribution to the loss $\mathbf{E}$ is independent, effectively ignoring the mutual influence between parameters. This assumption is not strictly accurate but serves as an approximation.

Under this diagonal assumption, all off-diagonal elements in the Hessian matrix are zero, allowing us to simplify the equation to:

\delta E=\sum_i g_i \delta u_i+\frac{1}{2} \sum_i h_{i i} \delta u_i^2+O\left(\|\delta U\|^3\right)

When the model has converged, the weight gradients are essentially zero ( $\delta u_i \approx 0$ ). Ignoring third-order terms, the equation further simplifies to:

\delta E=\frac{1}{2} \sum_i h_{i i} \delta u_i^2

This method uses second-order derivatives to assess the impact of weight disturbances on model outputs. In this local minimum, the loss is a convex function, and the second derivative is positive, meaning that weight disturbances will introduce additional errors.

The algorithm proceeds as follows:

Build the neural network.
Train the model until convergence.
Compute the Hessian matrix for each weight.
Calculate the salience of each weight.
Sort the weights by salience, delete those with low salience, set them to zero, and freeze them.
Repeat from step 2.

Optimal Brain Surgeon (OBS)²

While the OBD paper focuses solely on the diagonal elements of the Hessian matrix, the OBS method considers both diagonal elements and the mutual influence between parameters. The change in loss $\delta E$ can be expressed as:

\delta \mathrm{E}=\underbrace{\left(\frac{\partial \mathrm{E}}{\partial \mathbf{w}}\right)^{\mathrm{T}} \cdot \delta \mathbf{w}}_{\approx 0}+\frac{1}{2} \delta \mathbf{w}^{\mathrm{T}} \cdot \mathbf{H} \cdot \delta \mathbf{w}+\underbrace{O\left(\|\delta \mathbf{w}\|^3\right)}_{\approx 0}

When the model has converge, we can ignore the gradients and third-order terms, simplifying the equation as follows:

\begin{equation} \delta E \approx \frac{1}{2} \delta \mathbf{w}^{\mathrm{T}} \mathbf{H}\delta \mathbf{w} \end{equation}

Pruning the weight $w_q$ can be expressed as:

\delta w_q + w_q = 0 \quad \text{or} \quad e_q^{\mathrm{T}} \cdot \delta \mathbf{w} + w_q = 0

Here, $e_q$ is a unit vector where the q-th element is one, and all other elements are zeros. We can set the q-th dimension in $\delta \mathbf{w}$ to $-w_q$ for pruning weights. The other dimensions in $\delta \mathbf{w}$ can be adjusted to balance out the errors introduced by pruning q-th dimension in $\mathbf{w}$ .

This is a constrained optimization problem that can be solved using the Lagrange multiplier method:

L = \frac{1}{2} \delta \mathbf{w}^{\mathrm{T}} \mathbf{H}\delta \mathbf{w} + \lambda (e_q^{\mathrm{T}} \cdot \delta \mathbf{w} + w_q)

Taking the derivative of $L$ with respect to $\delta \mathbf{w}$ and setting it to zero, we get:

\begin{equation}\partial L/\partial (\delta \mathbf{w}) = \mathbf{H} \delta \mathbf{w} + \lambda e_q = 0 \end{equation}

From this, $\delta \mathbf{w} = -\mathbf{H}^{-1} \lambda e_q$ , substituting it into the constrained equation, we can get:

\lambda = \frac{w_q}{[\mathbf{H}^{-1}]_{qq}}

Here, $[\mathbf{H}^{-1}]_{qq}$ represents the diagonal position with q-th element in inverse Hessian matrix, since the following equation holds.

e_q^{\mathrm{T}} \cdot \mathbf{H} \cdot e_q = \mathbf{H}_{qq}

Substituting $\lambda$ in Equation(2), we can get:

\delta \mathbf{w} = - \frac{w_q}{[\mathbf{H}^{-1}]_{qq}} \mathbf{H}^{-1} \cdot e_q

Substituting $\delta \mathbf{w}$ in equation(1), the least errors introduced can be computed as:

\delta E \approx \frac{1}{2} \frac{w_q^2}{[\mathbf{H}^{-1}]_{qq}}

This allows us to compute $\delta \mathbf{w}$ to adjust the weights and minimize the errors introduced by pruning $w_q$ .

The algorithm proceeds as follows:

Train the model until convergence.
Compute the inverse Hessian matrix $\mathbf{H}^{-1}$ .
Iterate over all parameters and find $w_q$ to minimize $\delta E$ .
Prune $w_q$ and adjust weights using $\delta \mathbf{w}$ .
Repeat from step 2;

The following illustration highlights the advantage of OBS over OBD:

For the minimal loss weights $\mathrm{w}^*$ :

If we prune weights based on magnitude, $w_2$ would be pruned and set to zero, making it impossible to approach the minimum by adjusting $w_1$ .
Using OBS/OBD, we can compute $\delta E$ for both $w_1$ and $w_2$ and choose the one that minimize $\delta E$ , allowing us to correctly prune $w_1$ .
Furthermore, OBS allow us to adjust $w_2$ to reduce the error introduced.

Optimal Brain Compression (OBC)³

While OBS offers a more mathematically complete approach, its engineering efficiency poses a challenge.

Storing the Hessian matrix for OBS requires $\text{d}\times \text{d}$ space, where $\text{d}= \text{d}_{\text{row}} \cdot \text{d}_{\text{col}}$ .

Moreover, each update step has a time complexity of $O(\text{d}^3)$ , leading to a total time complexity of $O(\text{d}^4)$ . This is prohibitively slows for modern neural network.

To address this, the OBC paper reformulates the objective function as:

E = \sum_{i=1}^{\text{d}_{\text{row}}} ||\textbf{W}_{\text{i}, :} \textbf{X} - \hat{\textbf{W}}_{\text{i}, :} \textbf{X}||^2

With this formulation, pruning can be carried out along each row independently. If $k$ parameters are pruned in in each row $\text{d}_{\text{col}}$ , the time complexity can be reduced to $O(\text{k} \cdot \text{d}_{\text{col}}^2)$ .

Optimal Brain Quantization (OBQ)³

Besides pruning, the OBC paper also proposes a quantization method named OBQ, as pruning can be considered a special form of quantization (quantizing parameters to zero).

The OBQ method can be derived by modifying the OBC constraints as follows:

e_q^{\mathrm{T}} \cdot \delta \mathbf{w} + w_q = \text{quant}(w_q)

By substituting $w_q - \text{quant}(w_q)$ for $w_q$ , the quantization equations can be computed as:

\delta \mathbf{w} = - \frac{w_q - \text{quant}(w_q)}{[\mathbf{H}^{-1}]_{qq}} \mathbf{H}^{-1} \cdot e_q

\delta E \approx \frac{1}{2} \frac{(w_q - \text{quant}(w_q))^2}{[\mathbf{H}^{-1}]_{qq}}

In the same way as OBC, quantization for each row can be carried out independently. The time complexity is $O(\text{d}_{\text{row}} \cdot \text{d}_{\text{col}}^3)$ since all parameters needs to be quantized.

GPTQ

Although OBQ has improved computational efficiency, it’s still not suitable for Large Language Models (LLMs). GPTQ comes to the rescue, further boosting efficiency with the following main optimizations:

Use arbitrary order instead of greedy one

The OBQ paper always picks the weights which currently incurs the least additional errors. But the GPTQ paper finds that greedy order’s improvement over quantizing weights in arbitrary order is small.
The original OBQ method quantizes rows of $\mathbf{W}$ independently with a specific order defined by corresponding errors. But in GPTQ, all rows can be quantized with the same order. As a consequence, each column can be quantized at one time.

In this way, the runtime can reduce from $O(\text{d}_{\text{row}} \cdot \text{d}_{\text{col}}^3)$ to $O(\max (\text{d}_{\text{row}} \cdot \text{d}_{\text{col}}^2), \text{d}_{\text{col}}^3)$ .

Lazy Batch-Updates

In OBQ, quantizing a single element requires updating all elements using the following equation:

\mathbf{w} \leftarrow \mathbf{w} - \frac{w_q - \text{quant}(w_q)}{[\mathbf{H}^{-1}]_{qq}} \mathbf{H}^{-1}_{:, q}

This lead to a relatively low compute-to-memory-access ratio. For example, if the weight matrix is $\text{k} \times \text{k}$ and quantizing is done along columns, the total memory-access IO is $\text{k}^3$ .

Fortunately, this problem can be resolved by the following observation: The final rounding decisions for column i are only affected by updates performed on this very column, and so updates to later columns are irrelevant at this point in the process.

This makes it possible for “lazily batch” updates for later parameters together, thus achieving much better GPU utilization.

Cholesky Reformulation

The inverse of Hessian $\mathbf{H}^{-1}_F$ computations has numerical stability problem when scaling GPTQ to existing models.

Except for applying dampening, that is adding a small constant $\epsilon$ to diagonal elements, leverage state-of-the-art Cholesky kernels to compute all information we will need from $\mathbf{H}^{-1}$ upfront.

GPTQ proceeds as follows:

Compute the $\mathbf{H}^{-1}$ upfront.
Split the elements into multiple blocks.
Quantize each block along columns.
Update the elements in this block.
Repeat step 3 if there are more blocks.

Summary

This blog post traces the development of GPTQ, starting from its roots in OBD, through OBS, and finally to OBC. Unlike quantization methods based on statistical approaches, GPTQ is grounded in rigorous theoretical proof, making it a more robust and reliable solution.

What’s Next?

This blog post has covered the theoretical underpinnings of GPTQ, and in the upcoming post, we will get hands dirty with the code implementation.

Stay tuned for a deep dive into the code that powers this fascinating approach to weight quantization!

GPTQ Math Derivation

Table of Contents

Optimal Brain Damage (OBD)1

Optimal Brain Surgeon (OBS)2

Optimal Brain Compression (OBC)3

Optimal Brain Quantization (OBQ)3