A easy information to show you instinct about quantization with easy mathematical derivation and coding in PyTorch.

Earlier than I clarify the diagram above, let me start with the highlights that you just’ll be studying on this put up.

- At first, you’ll study in regards to the
**what**and**why**of the**quantization**. - Subsequent, you’ll dive in additional to study the
**how**of**quantization**with some easy mathematical derivations**.** - And at last, we’ll write some
**code collectively in PyTorch**to carry out quantization and de-quantization of LLM weights parameters

Let’s unpack all one after the other collectively.

**1. What’s quantization and why do you want it?**

**Quantization** is a technique of compressing a bigger dimension mannequin (LLM or any deep studying mannequin) to a smaller dimension. Primarily in quantization, you’ll quantize the burden parameters and activations of the mannequin. Let’s do a easy mannequin dimension calculation to validate our assertion.

Within the determine above, **the dimensions of the bottom mannequin Llama 3 8B is 32 GB. After Int 8 quantization, the dimensions is decreased to 8Gb (75% much less). With Int4 quantization, the dimensions has additional decreased to 4GB (~90% much less)**. It is a big discount in mannequin dimension. And, that is certainly wonderful! isn’t it? An enormous credit score goes to the authors of quantization papers and my big appreciation for the facility of Arithmetic.

**Now that you just perceive what quantization is, let’s transfer on to the why half.** Let’s take a look at picture 1, as an aspiring AI researcher, developer or architect if you want to carry out mannequin fine-tuning in your datasets or inferencing, more than likely you received’t find a way to take action in your machine or cellular machine as a consequence of reminiscence and processor constraints. Most likely, like me, you’ll even have an indignant face like possibility 1b. This brings us to possibility 1a, the place you’ll be able to have a cloud supplier offering you with all of the assets you need and might simply do any process with any fashions you need. However it would price you some huge cash. When you can afford it, nice. However when you’ve got a restricted price range, the excellent news is you continue to have possibility 2 out there.** That is the place you’ll be able to carry out quantization strategies to scale back the mannequin’s dimension **and conveniently use it in your use instances. You probably have finished your quantization properly, you’ll get roughly the identical accuracy just like that of the unique mannequin.

Notice:When you’ve finished fine-tuning or different duties in your mannequin in your native machines if you wish to deliver your mannequin into manufacturing, I’d advise you to host your mannequin within the cloud to supply dependable, scalable and safe providers to your buyer.

## 2. How does quantization work? A easy mathematical derivation.

Technically, quantization maps the mannequin’s weight worth from greater precision (eg. FP32) to decrease precision (eg. FP16|BF16|INT8). Though, there are numerous quantization strategies out there, on this put up we’ll study one of many extensively used quantization strategies known as the linear quantization methodology on this put up. There are two modes in linear quantization: **A. Uneven quantization **and** B. Symmetric quantization**. We’ll find out about each strategies one after the other.

**A. Uneven Linear Quantization: **Uneven quantization methodology** **maps the values from the unique tensor vary (Wmin, Wmax) to the values within the quantized tensor vary (Qmin, Qmax).

**Wmin, Wmax:**Min and Max worth of unique tensor (knowledge sort: FP32, 32-bit floating level). The default knowledge sort of weight tensors in most trendy LLM is FP32.**Qmin, Qmax:**Min and Max worth of quantized tensor (knowledge sort: INT8, 8-bit integer). We will additionally select different knowledge sorts comparable to INT4, INT8, FP16, and BF16 for quantization. We’ll use INT 8 in our instance.**Scale worth (S):**Throughout quantization, the size worth scales down the values of the unique tensor to get a quantized tensor. Throughout dequantization, it scales up the worth of the quantized tensor to get de-quantized values. The information sort of scale worth is similar as the unique tensor which is FP32.**Zero level (Z):**The Zero level is the non-zero worth within the quantized tensor vary that immediately will get mapped to the worth**0**within the unique tensor vary. The information sort of zero-point is INT8 since it’s situated within the quantized tensor vary.**Quantization:**The “**A**” part of the diagram reveals the quantization course of which maps [Wmin, Wmax] -> [Qmin, Qmax].**De-quantization:**The “B” part of the diagram reveals the de-quantization course of which maps [Qmin, Qmax] -> [Wmin, Wmax].

**So, how will we derive the quantized tensor worth from the unique tensor worth?** it’s fairly easy. When you nonetheless bear in mind your high-school arithmetic, you’ll be able to simply perceive the derivation under. Let’s do it step-by-step (I recommend you confer with the diagram above whereas deriving your equation for extra clear understanding).

I do know a lot of you mightn’t need to undergo the mathematical derivation under. However imagine me, it would definitely show you how to to make your idea crystal clear and prevent tons of time whereas coding for quantization in a later stage. I felt the identical once I was researching this.

**Potential Concern 1: What to do if the worth of Z runs out of the vary? Answer:**Use a easy if-else logic to alter the worth of Z to Qmin whether it is smaller than Qmin and to Qmax whether it is larger than Qmax. This has been described properly in Determine A of picture 4 under.**Potential Concern 2: What to do if the worth of Q runs out of the vary? Answer:**In PyTorch, there’s a operate known as**clamp**which adjusts the worth to stay throughout the particular vary (-128, 127 in our instance). Therefore, the clamp operate adjusts Q worth to Qmin whether it is under Qmin and to Qmax it’s above Qmax. Drawback solved, let’s transfer on.

Aspect observe:The vary of quantized tensor worth is (-128, 127) for INT8, signed integer knowledge sort. If the info sort of quantized tensor worth is UINT8, unsigned integer, the vary might be (0, 255).

**B. Symmetric Linear Quantization: **Within the symmetric methodology, the 0 level within the unique tensor vary maps to the 0 level within the quantized tensor vary. Therefore, that is known as symmetric. Since 0 is mapped to 0 on both facet of the vary, there isn’t a Z (Zero Level) in symmetric quantization. The general mapping occurs between (-Wmax, Wmax) of the unique tensor vary to (-Qmax, Qmax) of the quantized tensor vary. The diagram under reveals the symmetric mapping in each quantized and de-quantized case.

Since we’ve outlined all of the parameters in uneven section, the identical applies right here as properly. Let’s get into the mathematical derivation of the symmetric quantization.

**Distinction between Uneven and Symmetric quantization:**

Now that you’ve discovered **what, why, and the way about linear quantization**, this leads us to the ultimate a part of our put up, **the coding half**.

## 3. Writing code in PyTorch to carry out quantization and de-quantization of LLM weights parameters.

As I discussed earlier, quantization might be finished on the mannequin’s weights parameters and activations as properly. Nevertheless, for simplicity, we’ll solely quantize weight parameters in our coding instance. Earlier than we start coding, let’s have a fast take a look at how weight parameter values change after quantization within the transformer mannequin. I imagine it will make our understanding additional clearer.

After we carried out quantization on simply 16 unique weight parameters from FP32 to INT8, the reminiscence footprint was decreased from 512 bits to 128 bits (25% discount). This proves that for the case of huge fashions, the discount might be extra vital.

Under, you’ll be able to see the distribution of information sorts comparable to FP32, Signed INT8 and Unsigned UINT8 in precise reminiscence. I’ve finished the precise calculation in 2’s praise. Be happy to observe calculation your self and confirm the outcomes.

Now, that we’ve coated every thing that you’ll want to start coding. I’d recommend you observe alongside to get snug with the derivation.

**A. Uneven quantization code: **Let’s code step-by-step.

**Step 1:** We’ll first assign random values to the unique weight tensor (dimension: 4×4, datatype: FP32)

`# !pip set up torch; set up the torch library first in case you've not but finished so`

# import torch library

import torchoriginal_weight = torch.randn((4,4))

print(original_weight)

**Step 2: **We’re going to outline two features, one for quantization and one other for de-quantization.

`def asymmetric_quantization(original_weight):`

# outline the info sort that you just need to quantize. In our instance, it is INT8.

quantized_data_type = torch.int8# Get the Wmax and Wmin worth from the orginal weight which is in FP32.

Wmax = original_weight.max().merchandise()

Wmin = original_weight.min().merchandise()

# Get the Qmax and Qmin worth from the quantized knowledge sort.

Qmax = torch.iinfo(quantized_data_type).max

Qmin = torch.iinfo(quantized_data_type).min

# Calculate the size worth utilizing the size method. Datatype - FP32.

# Please confer with math part of this put up if you wish to learn how the method has been derived.

S = (Wmax - Wmin)/(Qmax - Qmin)

# Calculate the zero level worth utilizing the zero level method. Datatype - INT8.

# Please confer with math part of this put up if you wish to learn how the method has been derived.

Z = Qmin - (Wmin/S)

# Verify if the Z worth is out of vary.

if Z < Qmin:

Z = Qmin

elif Z > Qmax:

Z = Qmax

else:

# Zero level datatype must be INT8 similar because the Quantized worth.

Z = int(spherical(Z))

# We now have original_weight, scale and zero_point, now we will calculate the quantized weight utilizing the method we have derived in math part.

quantized_weight = (original_weight/S) + Z

# We'll additionally spherical it and in addition use the torch clamp operate to make sure the quantized weight would not goes out of vary and may stay inside Qmin and Qmax.

quantized_weight = torch.clamp(torch.spherical(quantized_weight), Qmin, Qmax)

# lastly solid the datatype to INT8.

quantized_weight = quantized_weight.to(quantized_data_type)

# return the ultimate quantized weight.

return quantized_weight, S, Z

def asymmetric_dequantization(quantized_weight, scale, zero_point):

# Use the dequantization calculation method derived within the math part of this put up.

# Additionally be certain to transform quantized_weight to drift as substraction between two INT8 values (quantized_weight and zero_point) will give undesirable end result.

dequantized_weight = scale * (quantized_weight.to(torch.float32) - zero_point)

return dequantized_weight

**Step 3:** We’re going to calculate the quantized weight, scale and nil level by invoking the **asymmetric_quantization** operate. You possibly can see the output end result within the screenshot under, take observe that the info sort of quantized_weight is int8, scale is FP32 and zero_point is INT8.

`quantized_weight, scale, zero_point = asymmetric_quantization(original_weight)`

print(f"quantized weight: {quantized_weight}")

print("n")

print(f"scale: {scale}")

print("n")

print(f"zero level: {zero_point}")

**Step 4: Now that now we have all of the values of **quantized weight, scale and nil level. Let’s get the de-quantized weight worth by invoking the asymmetric_dequantization operate. Notice that the de-quantize weight worth is FP32.

`dequantized_weight = asymmetric_dequantization(quantized_weight, scale, zero_point)`

print(dequantized_weight)

**Step 5: **Let’s learn how correct is the ultimate de-quantized weight worth as in comparison with the unique weight tensor by calculating the quantization error between them.

`quantization_error = (dequantized_weight - original_weight).sq.().imply()`

print(quantization_error)

**Output Consequence: The quantization_error is a lot much less. Therefore, we will say that the uneven quantization methodology has finished an excellent job.**

**B. Symmetric quantization code: **We’re going to make use of the identical code that we’ve written within the uneven methodology. The one change required within the case of symmetric metho is to at all times guarantee the worth of zero_input to be 0. It’s because in symmetric quantization the zero_input worth at all times maps to the 0 worth within the unique weight tensor. And we will simply proceed while not having to jot down extra code.

**And that is it! **we’ve come to the tip of this put up. I hope this put up has helped you construct stable instinct on quantization and a transparent understanding of the mathematical derivation half.

**My remaining ideas…**

- On this put up, now we have coated all the required matters that’s required so that you can get entangled in any LLM or deep studying quantization-related process.
- Though, we’ve efficiently finished quantization on weight tensor and in addition achieved good accuracy. That is adequate sufficient in many of the instances. Nevertheless, If you wish to apply quantization on a bigger mannequin with extra accuracy, you’ll want to carry out channel quantization (quantize every row or column of the burden matrix) or group quantization (make smaller teams within the row or column and quantize them individually). These strategies are extra difficult. I’ll cowl them in my upcoming put up.

**Keep tuned, and Thanks so much for studying!**

**References**