A straightforward data to point out you intuition about quantization with straightforward mathematical derivation and coding in PyTorch.

Sooner than I make clear the diagram above, let me begin with the highlights that you just simply’ll be finding out on this put up.

- At first, you’ll examine regarding the
**what**and**why**of the**quantization**. - Subsequent, you’ll dive in further to review the
**how**of**quantization**with some straightforward mathematical derivations**.** - And ultimately, we’ll write some
**code collectively in PyTorch**to hold out quantization and de-quantization of LLM weights parameters

Let’s unpack all one after the opposite collectively.

**1. What’s quantization and why would you like it?**

**Quantization** is a method of compressing a much bigger dimension model (LLM or any deep finding out model) to a smaller dimension. Primarily in quantization, you’ll quantize the burden parameters and activations of the model. Let’s do a straightforward model dimension calculation to validate our assertion.

Throughout the decide above, **the scale of the underside model Llama 3 8B is 32 GB. After Int 8 quantization, the scale is decreased to 8Gb (75% a lot much less). With Int4 quantization, the scale has further decreased to 4GB (~90% a lot much less)**. It’s a massive low cost in model dimension. And, that’s actually great! isn’t it? An unlimited credit score rating goes to the authors of quantization papers and my massive appreciation for the power of Arithmetic.

**Now that you just simply understand what quantization is, let’s switch on to the why half.** Let’s check out image 1, as an aspiring AI researcher, developer or architect if you wish to perform model fine-tuning in your datasets or inferencing, greater than seemingly you acquired’t discover a option to take motion in your machine or mobile machine as a consequence of memory and processor constraints. More than likely, like me, you’ll even have an indignant face like chance 1b. This brings us to chance 1a, the place you’ll have a cloud provider providing you with all the belongings you want and may merely do any course of with any fashions you want. Nonetheless it could worth you some large money. When you possibly can afford it, good. Nonetheless while you’ve received a restricted worth vary, the good news is you proceed to have chance 2 on the market.** That’s the place you’ll perform quantization methods to reduce the model’s dimension **and conveniently use it in your use situations. You most likely have completed your quantization correctly, you will get roughly the equivalent accuracy similar to that of the distinctive model.

Discover:Once you’ve completed fine-tuning or totally different duties in your model in your native machines should you want to ship your model into manufacturing, I would advise you to host your model throughout the cloud to produce reliable, scalable and secure suppliers to your purchaser.

## 2. How does quantization work? A straightforward mathematical derivation.

Technically, quantization maps the model’s weight value from higher precision (eg. FP32) to lower precision (eg. FP16|BF16|INT8). Although, there are quite a few quantization methods on the market, on this put up we’ll examine one among many extensively used quantization methods often called the linear quantization methodology on this put up. There are two modes in linear quantization: **A. Uneven quantization **and** B. Symmetric quantization**. We’ll discover out about every methods one after the opposite.

**A. Uneven Linear Quantization: **Uneven quantization methodology** **maps the values from the distinctive tensor range (Wmin, Wmax) to the values throughout the quantized tensor range (Qmin, Qmax).

**Wmin, Wmax:**Min and Max value of distinctive tensor (information type: FP32, 32-bit floating degree). The default information type of weight tensors in most fashionable LLM is FP32.**Qmin, Qmax:**Min and Max value of quantized tensor (information type: INT8, 8-bit integer). We are going to moreover choose totally different information types similar to INT4, INT8, FP16, and BF16 for quantization. We’ll use INT 8 in our occasion.**Scale value (S):**All through quantization, the dimensions value scales down the values of the distinctive tensor to get a quantized tensor. All through dequantization, it scales up the price of the quantized tensor to get de-quantized values. The data type of scale value is analogous because the distinctive tensor which is FP32.**Zero degree (Z):**The Zero degree is the non-zero value throughout the quantized tensor range that instantly will get mapped to the price**0**throughout the distinctive tensor range. The data type of zero-point is INT8 because it’s located throughout the quantized tensor range.**Quantization:**The “**A**” a part of the diagram reveals the quantization course of which maps [Wmin, Wmax] -> [Qmin, Qmax].**De-quantization:**The “B” a part of the diagram reveals the de-quantization course of which maps [Qmin, Qmax] -> [Wmin, Wmax].

**So, how will we derive the quantized tensor value from the distinctive tensor value?** it’s pretty straightforward. Once you nonetheless keep in mind your high-school arithmetic, you’ll merely understand the derivation underneath. Let’s do it step-by-step (I like to recommend you discuss with the diagram above whereas deriving your equation for further clear understanding).

I do know quite a lot of you mightn’t must endure the mathematical derivation underneath. Nonetheless think about me, it could positively present you tips on how to to make your concept crystal clear and stop tons of time whereas coding for quantization in a later stage. I felt the equivalent as soon as I used to be researching this.

**Potential Concern 1: What to do if the price of Z runs out of the range? Reply:**Use a straightforward if-else logic to change the price of Z to Qmin whether or not it’s smaller than Qmin and to Qmax whether or not it’s bigger than Qmax. This has been described correctly in Decide A of image 4 underneath.**Potential Concern 2: What to do if the price of Q runs out of the range? Reply:**In PyTorch, there is a function often called**clamp**which adjusts the price to remain all through the actual range (-128, 127 in our occasion). Subsequently, the clamp function adjusts Q value to Qmin whether or not it’s underneath Qmin and to Qmax it is above Qmax. Disadvantage solved, let’s switch on.

Side observe:The range of quantized tensor value is (-128, 127) for INT8, signed integer information type. If the information type of quantized tensor value is UINT8, unsigned integer, the range is perhaps (0, 255).

**B. Symmetric Linear Quantization: **Throughout the symmetric methodology, the 0 degree throughout the distinctive tensor range maps to the 0 degree throughout the quantized tensor range. Subsequently, that is called symmetric. Since 0 is mapped to 0 on each aspect of the range, there is not a Z (Zero Stage) in symmetric quantization. The final mapping happens between (-Wmax, Wmax) of the distinctive tensor range to (-Qmax, Qmax) of the quantized tensor range. The diagram underneath reveals the symmetric mapping in every quantized and de-quantized case.

Since we’ve outlined all the parameters in uneven part, the equivalent applies proper right here as correctly. Let’s get into the mathematical derivation of the symmetric quantization.

**Distinction between Uneven and Symmetric quantization:**

Now that you’ve got found **what, why, and the way in which about linear quantization**, this leads us to the last word part of our put up, **the coding half**.

## 3. Writing code in PyTorch to hold out quantization and de-quantization of LLM weights parameters.

As I mentioned earlier, quantization is perhaps completed on the model’s weights parameters and activations as correctly. However, for simplicity, we’ll solely quantize weight parameters in our coding occasion. Sooner than we begin coding, let’s have a quick check out how weight parameter values change after quantization throughout the transformer model. I think about it should make our understanding further clearer.

After we carried out quantization on merely 16 distinctive weight parameters from FP32 to INT8, the memory footprint was decreased from 512 bits to 128 bits (25% low cost). This proves that for the case of giant fashions, the low cost is perhaps further important.

Underneath, you’ll see the distribution of knowledge types similar to FP32, Signed INT8 and Unsigned UINT8 in exact memory. I’ve completed the exact calculation in 2’s reward. Be glad to watch calculation your self and ensure the outcomes.

Now, that we’ve coated each factor that you’re going to wish to begin coding. I would advocate you observe alongside to get comfortable with the derivation.

**A. Uneven quantization code: **Let’s code step-by-step.

**Step 1:** We’ll first assign random values to the distinctive weight tensor (dimension: 4×4, datatype: FP32)

`# !pip arrange torch; arrange the torch library first in case you have not however completed so`

# import torch library

import torchoriginal_weight = torch.randn((4,4))

print(original_weight)

**Step 2: **We’re going to stipulate two options, one for quantization and one different for de-quantization.

`def asymmetric_quantization(original_weight):`

# define the information type that you just simply must quantize. In our occasion, it's INT8.

quantized_data_type = torch.int8# Get the Wmax and Wmin value from the orginal weight which is in FP32.

Wmax = original_weight.max().merchandise()

Wmin = original_weight.min().merchandise()

# Get the Qmax and Qmin value from the quantized information type.

Qmax = torch.iinfo(quantized_data_type).max

Qmin = torch.iinfo(quantized_data_type).min

# Calculate the dimensions value using the dimensions technique. Datatype - FP32.

# Please discuss with math a part of this put up should you want to find out how the tactic has been derived.

S = (Wmax - Wmin)/(Qmax - Qmin)

# Calculate the zero degree value using the zero degree technique. Datatype - INT8.

# Please discuss with math a part of this put up should you want to find out how the tactic has been derived.

Z = Qmin - (Wmin/S)

# Confirm if the Z value is out of range.

if Z < Qmin:

Z = Qmin

elif Z > Qmax:

Z = Qmax

else:

# Zero degree datatype should be INT8 related as a result of the Quantized value.

Z = int(spherical(Z))

# We now have original_weight, scale and zero_point, now we'll calculate the quantized weight using the tactic we've derived in math half.

quantized_weight = (original_weight/S) + Z

# We'll moreover spherical it and as well as use the torch clamp function to ensure the quantized weight wouldn't goes out of range and will keep inside Qmin and Qmax.

quantized_weight = torch.clamp(torch.spherical(quantized_weight), Qmin, Qmax)

# lastly stable the datatype to INT8.

quantized_weight = quantized_weight.to(quantized_data_type)

# return the last word quantized weight.

return quantized_weight, S, Z

def asymmetric_dequantization(quantized_weight, scale, zero_point):

# Use the dequantization calculation technique derived throughout the math a part of this put up.

# Moreover be sure to remodel quantized_weight to float as substraction between two INT8 values (quantized_weight and zero_point) will give undesirable finish end result.

dequantized_weight = scale * (quantized_weight.to(torch.float32) - zero_point)

return dequantized_weight

**Step 3:** We’re going to calculate the quantized weight, scale and nil degree by invoking the **asymmetric_quantization** function. You presumably can see the output finish end result throughout the screenshot underneath, take observe that the information type of quantized_weight is int8, scale is FP32 and zero_point is INT8.

`quantized_weight, scale, zero_point = asymmetric_quantization(original_weight)`

print(f"quantized weight: {quantized_weight}")

print("n")

print(f"scale: {scale}")

print("n")

print(f"zero degree: {zero_point}")

**Step 4: Now that now we’ve all the values of **quantized weight, scale and nil degree. Let’s get the de-quantized weight value by invoking the asymmetric_dequantization function. Discover that the de-quantize weight value is FP32.

`dequantized_weight = asymmetric_dequantization(quantized_weight, scale, zero_point)`

print(dequantized_weight)

**Step 5: **Let’s find out how right is the last word de-quantized weight value as compared with the distinctive weight tensor by calculating the quantization error between them.

`quantization_error = (dequantized_weight - original_weight).sq.().indicate()`

print(quantization_error)

**Output Consequence: The quantization_error is lots a lot much less. Subsequently, we’ll say that the uneven quantization methodology has completed a wonderful job.**

**B. Symmetric quantization code: **We’re going to utilize the equivalent code that we’ve written throughout the uneven methodology. The one change required throughout the case of symmetric metho is to always assure the price of zero_input to be 0. It is as a result of in symmetric quantization the zero_input value always maps to the 0 value throughout the distinctive weight tensor. And we’ll merely proceed without having to jot down further code.

**And that’s it! **we’ve come to the tip of this put up. I hope this put up has helped you assemble secure intuition on quantization and a clear understanding of the mathematical derivation half.

**My remaining concepts…**

- On this put up, now we’ve coated all of the required issues that is required as a way to get entangled in any LLM or deep finding out quantization-related course of.
- Although, we’ve effectively completed quantization on weight tensor and as well as achieved good accuracy. That’s satisfactory ample in lots of the situations. However, If you happen to want to apply quantization on a much bigger model with further accuracy, you will wish to perform channel quantization (quantize each row or column of the burden matrix) or group quantization (make smaller groups throughout the row or column and quantize them individually). These methods are further tough. I am going to cowl them in my upcoming put up.

**Hold tuned, and Thanks a lot for finding out!**

**References**