Skip to content

GPTQ Math Derivation

Posted on:September 9, 2023 at 12:00 AM

GPTQ is a one-shot weight quantization method that leverages approximate second-order information to achieve both high accuracy and efficiency.

In this blog post, we will begin with Optimal Brain Damage (OBD)1, the foundational concept behind GPTQ, and progressively derive the mathematical theory of GPTQ.

For convenience, let’s define w\mathbf{w} as a column vector, formulated as:

w=(w1,w2,,wn)T\mathbf{w} = (w_1, w_2, \cdots, w_n)^{\mathrm{T}}

Next, g\mathbf{g} represents the gradient matrix and is defined as:

gi=Ewi\mathbf{g}_i = \frac{\partial E}{\partial w_i}

Lastly, H\mathbf{H} is the Hessian matrix, given by:

Hi,j=2Ewiwj\mathbf{H}_{i,j} = \frac{\partial^2 \mathrm{E}}{\partial w_i \partial w_j}

Table of Contents

Open Table of Contents

Optimal Brain Damage (OBD)1

This method employs second-order derivatives to prune weights while minimizing the impact on the network’s output. The underlying theory involves constructing a second-order approximate function to model the minimal disturbance to the local points (weights) of the objective function.

To introduce a small disturbance to the weights, we apply the Taylor Expansion to the following formula:

E(w+Δw)=E(w)+gTΔw+12ΔwTHΔw+O(Δw3)E(\mathbf{w}+\Delta \mathbf{w})=E(\mathbf{w})+\mathbf{g}^{\mathrm{T}} \Delta \mathbf{w}+\frac{1}{2} \Delta \mathbf{w}^{\mathrm{T}} \mathbf{H} \Delta \mathbf{w}+O\left(\|\Delta \mathbf{w}\|^3\right)

The change in loss due to the disturbance in w\mathbf{w} is:

ΔE=E(w+Δw)E(w)=gTΔw+12ΔwTHΔw+O(Δw3)\Delta E=E(\mathbf{w}+\Delta \mathbf{w})-E(\mathbf{w})=\mathbf{g}^{\mathrm{T}} \Delta \mathbf{w}+\frac{1}{2} \Delta \mathbf{w}^{\mathrm{T}} \mathbf{H} \Delta \mathbf{w}+O\left(\|\Delta \mathbf{w}\|^3\right)

In the OBD paper, this equation is expressed as:

δE=igiδui+12ihiiδui2+12i!=jhijδuiδuj+O(δU3)\delta E=\sum_i g_i \delta u_i+\frac{1}{2} \sum_i h_{i i} \delta u_i^2+\frac{1}{2} \sum_{i !=j} h_{i j} \delta u_i \delta u_j+O\left(\|\delta U\|^3\right)

To simplify, OBD assumes that each parameter’s contribution to the loss E\mathbf{E} is independent, effectively ignoring the mutual influence between parameters. This assumption is not strictly accurate but serves as an approximation.

Under this diagonal assumption, all off-diagonal elements in the Hessian matrix are zero, allowing us to simplify the equation to:

δE=igiδui+12ihiiδui2+O(δU3)\delta E=\sum_i g_i \delta u_i+\frac{1}{2} \sum_i h_{i i} \delta u_i^2+O\left(\|\delta U\|^3\right)

When the model has converged, the weight gradients are essentially zero (δui0\delta u_i \approx 0). Ignoring third-order terms, the equation further simplifies to:

δE=12ihiiδui2\delta E=\frac{1}{2} \sum_i h_{i i} \delta u_i^2

This method uses second-order derivatives to assess the impact of weight disturbances on model outputs. In this local minimum, the loss is a convex function, and the second derivative is positive, meaning that weight disturbances will introduce additional errors.

The algorithm proceeds as follows:

  1. Build the neural network.
  2. Train the model until convergence.
  3. Compute the Hessian matrix for each weight.
  4. Calculate the salience of each weight.
  5. Sort the weights by salience, delete those with low salience, set them to zero, and freeze them.
  6. Repeat from step 2.

Optimal Brain Surgeon (OBS)2

While the OBD paper focuses solely on the diagonal elements of the Hessian matrix, the OBS method considers both diagonal elements and the mutual influence between parameters. The change in loss δE\delta E can be expressed as:

δE=(Ew)Tδw0+12δwTHδw+O(δw3)0\delta \mathrm{E}=\underbrace{\left(\frac{\partial \mathrm{E}}{\partial \mathbf{w}}\right)^{\mathrm{T}} \cdot \delta \mathbf{w}}_{\approx 0}+\frac{1}{2} \delta \mathbf{w}^{\mathrm{T}} \cdot \mathbf{H} \cdot \delta \mathbf{w}+\underbrace{O\left(\|\delta \mathbf{w}\|^3\right)}_{\approx 0}

When the model has converge, we can ignore the gradients and third-order terms, simplifying the equation as follows:

δE12δwTHδw\begin{equation} \delta E \approx \frac{1}{2} \delta \mathbf{w}^{\mathrm{T}} \mathbf{H}\delta \mathbf{w} \end{equation}

Pruning the weight wqw_q can be expressed as:

δwq+wq=0oreqTδw+wq=0\delta w_q + w_q = 0 \quad \text{or} \quad e_q^{\mathrm{T}} \cdot \delta \mathbf{w} + w_q = 0

Here, eqe_q is a unit vector where the q-th element is one, and all other elements are zeros. We can set the q-th dimension in δw\delta \mathbf{w} to wq-w_q for pruning weights. The other dimensions in δw\delta \mathbf{w} can be adjusted to balance out the errors introduced by pruning q-th dimension in w\mathbf{w}.

This is a constrained optimization problem that can be solved using the Lagrange multiplier method:

L=12δwTHδw+λ(eqTδw+wq)L = \frac{1}{2} \delta \mathbf{w}^{\mathrm{T}} \mathbf{H}\delta \mathbf{w} + \lambda (e_q^{\mathrm{T}} \cdot \delta \mathbf{w} + w_q)

Taking the derivative of LL with respect to δw\delta \mathbf{w} and setting it to zero, we get:

L/(δw)=Hδw+λeq=0\begin{equation}\partial L/\partial (\delta \mathbf{w}) = \mathbf{H} \delta \mathbf{w} + \lambda e_q = 0 \end{equation}

From this, δw=H1λeq\delta \mathbf{w} = -\mathbf{H}^{-1} \lambda e_q, substituting it into the constrained equation, we can get:

λ=wq[H1]qq\lambda = \frac{w_q}{[\mathbf{H}^{-1}]_{qq}}

Here, [H1]qq[\mathbf{H}^{-1}]_{qq} represents the diagonal position with q-th element in inverse Hessian matrix, since the following equation holds.

eqTHeq=Hqqe_q^{\mathrm{T}} \cdot \mathbf{H} \cdot e_q = \mathbf{H}_{qq}

Substituting λ\lambda in Equation(2), we can get:

δw=wq[H1]qqH1eq\delta \mathbf{w} = - \frac{w_q}{[\mathbf{H}^{-1}]_{qq}} \mathbf{H}^{-1} \cdot e_q

Substituting δw\delta \mathbf{w} in equation(1), the least errors introduced can be computed as:

δE12wq2[H1]qq\delta E \approx \frac{1}{2} \frac{w_q^2}{[\mathbf{H}^{-1}]_{qq}}

This allows us to compute δw\delta \mathbf{w} to adjust the weights and minimize the errors introduced by pruning wqw_q.

The algorithm proceeds as follows:

  1. Train the model until convergence.
  2. Compute the inverse Hessian matrix H1\mathbf{H}^{-1}.
  3. Iterate over all parameters and find wqw_q to minimize δE\delta E.
  4. Prune wqw_q and adjust weights using δw\delta \mathbf{w}.
  5. Repeat from step 2;

The following illustration highlights the advantage of OBS over OBD:

For the minimal loss weights w\mathrm{w}^*:

  1. If we prune weights based on magnitude, w2w_2 would be pruned and set to zero, making it impossible to approach the minimum by adjusting w1w_1.
  2. Using OBS/OBD, we can compute δE\delta E for both w1w_1 and w2w_2 and choose the one that minimize δE\delta E, allowing us to correctly prune w1w_1.
  3. Furthermore, OBS allow us to adjust w2w_2 to reduce the error introduced.

Optimal Brain Compression (OBC)3

While OBS offers a more mathematically complete approach, its engineering efficiency poses a challenge.

Storing the Hessian matrix for OBS requires d×d\text{d}\times \text{d} space, where d=drowdcol\text{d}= \text{d}_{\text{row}} \cdot \text{d}_{\text{col}}.

Moreover, each update step has a time complexity of O(d3)O(\text{d}^3), leading to a total time complexity of O(d4)O(\text{d}^4). This is prohibitively slows for modern neural network.

To address this, the OBC paper reformulates the objective function as:

E=i=1drowWi,:XW^i,:X2E = \sum_{i=1}^{\text{d}_{\text{row}}} ||\textbf{W}_{\text{i}, :} \textbf{X} - \hat{\textbf{W}}_{\text{i}, :} \textbf{X}||^2

With this formulation, pruning can be carried out along each row independently. If kk parameters are pruned in in each row dcol\text{d}_{\text{col}}, the time complexity can be reduced to O(kdcol2)O(\text{k} \cdot \text{d}_{\text{col}}^2).

Optimal Brain Quantization (OBQ)3

Besides pruning, the OBC paper also proposes a quantization method named OBQ, as pruning can be considered a special form of quantization (quantizing parameters to zero).

The OBQ method can be derived by modifying the OBC constraints as follows:

eqTδw+wq=quant(wq)e_q^{\mathrm{T}} \cdot \delta \mathbf{w} + w_q = \text{quant}(w_q)

By substituting wqquant(wq)w_q - \text{quant}(w_q) for wqw_q, the quantization equations can be computed as:

δw=wqquant(wq)[H1]qqH1eq\delta \mathbf{w} = - \frac{w_q - \text{quant}(w_q)}{[\mathbf{H}^{-1}]_{qq}} \mathbf{H}^{-1} \cdot e_q
δE12(wqquant(wq))2[H1]qq\delta E \approx \frac{1}{2} \frac{(w_q - \text{quant}(w_q))^2}{[\mathbf{H}^{-1}]_{qq}}

In the same way as OBC, quantization for each row can be carried out independently. The time complexity is O(drowdcol3)O(\text{d}_{\text{row}} \cdot \text{d}_{\text{col}}^3) since all parameters needs to be quantized.

GPTQ

Although OBQ has improved computational efficiency, it’s still not suitable for Large Language Models (LLMs). GPTQ comes to the rescue, further boosting efficiency with the following main optimizations:

Use arbitrary order instead of greedy one

The OBQ paper always picks the weights which currently incurs the least additional errors. But the GPTQ paper finds that greedy order’s improvement over quantizing weights in arbitrary order is small.
The original OBQ method quantizes rows of W\mathbf{W} independently with a specific order defined by corresponding errors. But in GPTQ, all rows can be quantized with the same order. As a consequence, each column can be quantized at one time.

In this way, the runtime can reduce from O(drowdcol3)O(\text{d}_{\text{row}} \cdot \text{d}_{\text{col}}^3) to O(max(drowdcol2),dcol3)O(\max (\text{d}_{\text{row}} \cdot \text{d}_{\text{col}}^2), \text{d}_{\text{col}}^3).

Lazy Batch-Updates

In OBQ, quantizing a single element requires updating all elements using the following equation:

wwwqquant(wq)[H1]qqH:,q1\mathbf{w} \leftarrow \mathbf{w} - \frac{w_q - \text{quant}(w_q)}{[\mathbf{H}^{-1}]_{qq}} \mathbf{H}^{-1}_{:, q}

This lead to a relatively low compute-to-memory-access ratio. For example, if the weight matrix is k×k\text{k} \times \text{k} and quantizing is done along columns, the total memory-access IO is k3\text{k}^3.

Fortunately, this problem can be resolved by the following observation: The final rounding decisions for column i are only affected by updates performed on this very column, and so updates to later columns are irrelevant at this point in the process.

This makes it possible for “lazily batch” updates for later parameters together, thus achieving much better GPU utilization.

Cholesky Reformulation

The inverse of Hessian HF1\mathbf{H}^{-1}_F computations has numerical stability problem when scaling GPTQ to existing models.

Except for applying dampening, that is adding a small constant ϵ\epsilon to diagonal elements, leverage state-of-the-art Cholesky kernels to compute all information we will need from H1\mathbf{H}^{-1} upfront.

GPTQ proceeds as follows:

  1. Compute the H1\mathbf{H}^{-1} upfront.
  2. Split the elements into multiple blocks.
  3. Quantize each block along columns.
  4. Update the elements in this block.
  5. Repeat step 3 if there are more blocks.

Summary

This blog post traces the development of GPTQ, starting from its roots in OBD, through OBS, and finally to OBC. Unlike quantization methods based on statistical approaches, GPTQ is grounded in rigorous theoretical proof, making it a more robust and reliable solution.

What’s Next?

This blog post has covered the theoretical underpinnings of GPTQ, and in the upcoming post, we will get hands dirty with the code implementation.

Stay tuned for a deep dive into the code that powers this fascinating approach to weight quantization!

Reference


Footnotes

  1. https://proceedings.neurips.cc/paper/1989/hash/6c9882bbac1c7093bd25041881277658-Abstract.html 2

  2. https://ieeexplore.ieee.org/document/298572

  3. https://arxiv.org/abs/2208.11580 2


Comments