Grad-CAM is a powerful visualization software program initially designed for CNN architectures to deal with what parts of an image have an effect on neural neighborhood selections. As we converse, I’ll current you the best way I’ve tailor-made Grad-CAM to work with an image-to-text transformer model, significantly using the TrOCR model from Hugging Face.
Step 1: Token Expertise from the Model
The first step entails producing tokens from our TrOCR model. These tokens are primarily the model’s interpretation of the image in a textual format, which we’ll later use for gradient computation.
import torch
from transformers import TrOCRProcessor, VisionEncoderDecoderModel
from PIL import Image
import matplotlib.pyplot as plt
import numpy as npprocessor = TrOCRProcessor.from_pretrained("microsoft/trocr-base-printed")
model = VisionEncoderDecoderModel.from_pretrained("microsoft/trocr-base-printed")
image_path = "your_image_path.jpg"
image = Image.open(image_path).convert("RGB")
def get_generated_tokens(image, model, processor):
pixel_values = processor(footage=image, return_tensors="pt").pixel_values
generated_tokens = model.generate(pixel_values=pixel_values, max_length=50)
return generated_tokens, pixel_values
generated tokens will look one factor like tensor([[ 2, 14200, 2022, 2]]). the place ‘2’ token is the actual token, representing the start and end.
Step 2: Layer Alternative for Grad-CAM
Choosing the right layer is crucial because of the effectiveness of Grad-CAM relies upon upon capturing associated activations that correlate with output predictions. In transformers, that’s typically one in all many closing layers.
If you happen to’re undecided in regards to the layer and output type merely print the model or use torchsummary library for detailed output shapes of each layer.
For the above model, I’ve chosen the ultimate layer of the ViT encoder.
layer_name = model.encoder.encoder.layer[-1].output
Observe: Proper right here I’ve used .output inside the layer_name because of Huggingface model can return a dictionary or tuple, if it is a torch model merely the determine of the layer is good adequate.
Step 3: Attaching Hooks to Seize Outputs and Gradients
We join a forward hook to the chosen layer to grab the outputs all through the forward go and retain them for computing gradients all through the backward go.
last_layer_output = None
def save_output(module, enter, output):
world last_layer_output
last_layer_output = output
output.retain_grad()last_layer = layer_name
last_layer.register_forward_hook(save_output)
Step 4: Specializing in Explicit Tokens
We select specific tokens to compute how lots each part of the enter image contributed to predicting that token, providing insights into model selections.
Step 5: Reshaping Layer Outputs
Transformers output activations in a particular format as compared with CNNs. We rework these to mimic CNN operate maps, enabling us to make use of Grad-CAM efficiently:
Understanding the output type of the chosen layer:
- (Batch_size, Tokens, Choices or Channels) ->(1, 577,796)
- Take away the first token [CLS] whether or not it’s ViT ->(1, 576, 796)
- If the operate map is sq., which is true on this case ->(1, 24, 24, 796)
- Apply transpose, so choices turns into an identical to CNN ->(1, 796, 24, 24)
def reshape_transform_vit_huggingface(x):
activations = x[:, 1:, :] # Take away the first token which is used for classification in some architectures
side_length = int(np.sqrt(activations.type[1])) # Assuming the operate map is sq.
activations = activations.view(activations.type[0], side_length, side_length, activations.type[2])
activations = activations.transpose(2, 3).transpose(1, 2)
return activations
Step 6: Making use of Grad-CAM
Lastly, we apply the Grad-CAM algorithm to deal with the required areas of the image for each token. The algorithm makes use of gradients of the aim token wrt the activations from our chosen layer, weighted and summed to create a heatmap.
transform_output = reshape_transform_vit_huggingface(layer_output)
transform_grad = reshape_transform_vit_huggingface(grad)
weights = torch.suggest(transform_grad, dim=(2, 3), keepdim=True) # Frequent all through the spatial dimensions# Step 2: Weighted combination of activation maps
grad_cam = torch.sum(weights * transform_output, dim=1, keepdim=True) # Sum over the operate maps
# Step 3: Apply ReLU
grad_cam = torch.relu(grad_cam) # Solely take constructive contributions
grad_cam = grad_cam.squeeze(0) # Take away batch dimension for visualization
# Step 4: Normalize (optionally out there nevertheless helps in visualization)
grad_cam = grad_cam / grad_cam.max()
print("Grad-CAM type:", grad_cam.type)
heatmap = torch.nn.helpful.interpolate(grad_cam.unsqueeze(0), measurement=(image.measurement[1], image.measurement[0]), mode='bilinear', align_corners=False)
heatmap = heatmap.squeeze().detach().numpy()
Implementation in Python
Proper right here’s the entire Python code that accomplishes all the above steps using PyTorch, PIL for image coping with, and matplotlib for visualization:
import torch
from transformers import TrOCRProcessor, VisionEncoderDecoderModel
from PIL import Image
import matplotlib.pyplot as plt
import numpy as np# Load the pre-trained processor and model
processor = TrOCRProcessor.from_pretrained("microsoft/trocr-base-printed")
model = VisionEncoderDecoderModel.from_pretrained("microsoft/trocr-base-printed")
# Load and course of the image
image_path = "00809.jpg"
image = Image.open(image_path).convert("RGB")
def reshape_transform_vit_huggingface(x):
activations = x[:, 1:, :]
x = np.sqrt(activations.type[1])
activations = activations.view(activations.type[0], int(x), int(x), activations.type[2])
activations = activations.transpose(2, 3).transpose(1, 2)
return activations
def get_generated_tokens(image, model, processor):
pixel_values = processor(footage=image, return_tensors="pt").pixel_values
# Forward go
generated_tokens = model.generate(pixel_values=pixel_values, max_length=50)
return generated_tokens, pixel_values
last_layer_output = None
def get_activations_and_gradient(pixel_values, model, processor, generated_tokens, layer_name, token_index = 0):
textual content material = processor.decode(generated_tokens[0, token_index], skip_special_tokens=False)
def save_output(module, enter, output):
world last_layer_output
last_layer_output = output
output.retain_grad()
last_layer = layer_name
last_layer.register_forward_hook(save_output)
outputs = model(pixel_values=pixel_values, decoder_input_ids=generated_tokens[:, :-1], return_dict=True)
# Backward go on a selected logit
selected_logit = outputs.logits[0, token_index, generated_tokens[0, token_index]]
selected_logit.backward()
return last_layer_output, last_layer_output.grad, textual content material
def apply_gradcam(layer_output, grad, image, index, textual content material):
from skimage import color
transform_output = reshape_transform_vit_huggingface(layer_output)
transform_grad = reshape_transform_vit_huggingface(grad)
weights = torch.suggest(transform_grad, dim=(2, 3), keepdim=True) # Frequent all through the spatial dimensions
# Step 2: Weighted combination of activation maps
grad_cam = torch.sum(weights * transform_output, dim=1, keepdim=True) # Sum over the operate maps
# Step 3: Apply ReLU
grad_cam = torch.relu(grad_cam) # Solely take constructive contributions
grad_cam = grad_cam.squeeze(0) # Take away batch dimension for visualization
# Step 4: Normalize (optionally out there nevertheless helps in visualization)
grad_cam = grad_cam / grad_cam.max()
print("Grad-CAM type:", grad_cam.type)
heatmap = torch.nn.helpful.interpolate(grad_cam.unsqueeze(0), measurement=(image.measurement[1], image.measurement[0]), mode='bilinear', align_corners=False)
heatmap = heatmap.squeeze().detach().numpy()
blended = Image.combine(image.convert('RGBA'), Image.fromarray((plt.cm.jet(heatmap)* 255).astype(np.uint8)).convert('RGBA'), alpha=0.5)
blended.save(f"blended_image_{index}.png", format='PNG')
return {f"{textual content material}": f"blended_image_{index}.png"}
layer_name = model.encoder.encoder.layer[-1].output
generated_tokens, pixel_values = get_generated_tokens(image, model, processor)
print(generated_tokens)
for index, tokens in enumerate(generated_tokens[:, :-1].numpy().tolist()[0]):
layer_output, grad, textual content material = get_activations_and_gradient(pixel_values, model, processor, generated_tokens, layer_name, token_index=index)
info = apply_gradcam(layer_output, grad, image, index, textual content material)
print(info)
By adapting Grad-CAM for use with a transformer model, we’ll purchase insights into which parts of the image the model focuses on when producing textual content material. This technique is perhaps extraordinarily useful for debugging and bettering model effectivity, considerably in capabilities like automated content material materials description and OCR.
I hope you found this info helpful. For further insights and discussions on experience and innovation, be completely happy to adjust to me on Linkedin: www.linkedin.com/in/meetvpatel. I look forward to connecting with you!