Grad-CAM is a robust visualization software program program initially designed for CNN architectures to cope with what elements of a picture affect neural neighborhood choices. As we converse, I’ll present you one of the best ways I’ve tailored Grad-CAM to work with an image-to-text transformer mannequin, considerably utilizing the TrOCR mannequin from Hugging Face.
Step 1: Token Experience from the Mannequin
Step one entails producing tokens from our TrOCR mannequin. These tokens are primarily the mannequin’s interpretation of the picture in a textual format, which we’ll later use for gradient computation.
import torch
from transformers import TrOCRProcessor, VisionEncoderDecoderModel
from PIL import Picture
import matplotlib.pyplot as plt
import numpy as npprocessor = TrOCRProcessor.from_pretrained("microsoft/trocr-base-printed")
mannequin = VisionEncoderDecoderModel.from_pretrained("microsoft/trocr-base-printed")
image_path = "your_image_path.jpg"
picture = Picture.open(image_path).convert("RGB")
def get_generated_tokens(picture, mannequin, processor):
pixel_values = processor(footage=picture, return_tensors="pt").pixel_values
generated_tokens = mannequin.generate(pixel_values=pixel_values, max_length=50)
return generated_tokens, pixel_values
generated tokens will look one issue like tensor([[ 2, 14200, 2022, 2]]). the place ‘2’ token is the precise token, representing the beginning and finish.
Step 2: Layer Various for Grad-CAM
Choosing the proper layer is essential due to the effectiveness of Grad-CAM depends upon upon capturing related activations that correlate with output predictions. In transformers, that is sometimes one among many closing layers.
If you happen to occur to’re undecided regarding the layer and output kind merely print the mannequin or use torchsummary library for detailed output shapes of every layer.
For the above mannequin, I’ve chosen the last word layer of the ViT encoder.
layer_name = mannequin.encoder.encoder.layer[-1].output
Observe: Correct proper right here I’ve used .output contained in the layer_name due to Huggingface mannequin can return a dictionary or tuple, if it’s a torch mannequin merely the decide of the layer is nice sufficient.
Step 3: Attaching Hooks to Seize Outputs and Gradients
We be part of a ahead hook to the chosen layer to seize the outputs all via the ahead go and retain them for computing gradients all via the backward go.
last_layer_output = None
def save_output(module, enter, output):
world last_layer_output
last_layer_output = output
output.retain_grad()last_layer = layer_name
last_layer.register_forward_hook(save_output)
Step 4: Specializing in Specific Tokens
We choose particular tokens to compute how tons every a part of the enter picture contributed to predicting that token, offering insights into mannequin choices.
Step 5: Reshaping Layer Outputs
Transformers output activations in a specific format as in contrast with CNNs. We rework these to imitate CNN function maps, enabling us to utilize Grad-CAM effectively:
Understanding the output kind of the chosen layer:
- (Batch_size, Tokens, Decisions or Channels) ->(1, 577,796)
- Take away the primary token [CLS] whether or not or not it is ViT ->(1, 576, 796)
- If the function map is sq., which is true on this case ->(1, 24, 24, 796)
- Apply transpose, so decisions turns into an equivalent to CNN ->(1, 796, 24, 24)
def reshape_transform_vit_huggingface(x):
activations = x[:, 1:, :] # Take away the primary token which is used for classification in some architectures
side_length = int(np.sqrt(activations.kind[1])) # Assuming the function map is sq.
activations = activations.view(activations.kind[0], side_length, side_length, activations.kind[2])
activations = activations.transpose(2, 3).transpose(1, 2)
return activations
Step 6: Making use of Grad-CAM
Lastly, we apply the Grad-CAM algorithm to cope with the required areas of the picture for every token. The algorithm makes use of gradients of the purpose token wrt the activations from our chosen layer, weighted and summed to create a heatmap.
transform_output = reshape_transform_vit_huggingface(layer_output)
transform_grad = reshape_transform_vit_huggingface(grad)
weights = torch.counsel(transform_grad, dim=(2, 3), keepdim=True) # Frequent all via the spatial dimensions# Step 2: Weighted mixture of activation maps
grad_cam = torch.sum(weights * transform_output, dim=1, keepdim=True) # Sum over the function maps
# Step 3: Apply ReLU
grad_cam = torch.relu(grad_cam) # Solely take constructive contributions
grad_cam = grad_cam.squeeze(0) # Take away batch dimension for visualization
# Step 4: Normalize (optionally on the market nonetheless helps in visualization)
grad_cam = grad_cam / grad_cam.max()
print("Grad-CAM kind:", grad_cam.kind)
heatmap = torch.nn.useful.interpolate(grad_cam.unsqueeze(0), measurement=(picture.measurement[1], picture.measurement[0]), mode='bilinear', align_corners=False)
heatmap = heatmap.squeeze().detach().numpy()
Implementation in Python
Correct proper right here’s the whole Python code that accomplishes all of the above steps utilizing PyTorch, PIL for picture dealing with, and matplotlib for visualization:
import torch
from transformers import TrOCRProcessor, VisionEncoderDecoderModel
from PIL import Picture
import matplotlib.pyplot as plt
import numpy as np# Load the pre-trained processor and mannequin
processor = TrOCRProcessor.from_pretrained("microsoft/trocr-base-printed")
mannequin = VisionEncoderDecoderModel.from_pretrained("microsoft/trocr-base-printed")
# Load and course of the picture
image_path = "00809.jpg"
picture = Picture.open(image_path).convert("RGB")
def reshape_transform_vit_huggingface(x):
activations = x[:, 1:, :]
x = np.sqrt(activations.kind[1])
activations = activations.view(activations.kind[0], int(x), int(x), activations.kind[2])
activations = activations.transpose(2, 3).transpose(1, 2)
return activations
def get_generated_tokens(picture, mannequin, processor):
pixel_values = processor(footage=picture, return_tensors="pt").pixel_values
# Ahead go
generated_tokens = mannequin.generate(pixel_values=pixel_values, max_length=50)
return generated_tokens, pixel_values
last_layer_output = None
def get_activations_and_gradient(pixel_values, mannequin, processor, generated_tokens, layer_name, token_index = 0):
textual content material materials = processor.decode(generated_tokens[0, token_index], skip_special_tokens=False)
def save_output(module, enter, output):
world last_layer_output
last_layer_output = output
output.retain_grad()
last_layer = layer_name
last_layer.register_forward_hook(save_output)
outputs = mannequin(pixel_values=pixel_values, decoder_input_ids=generated_tokens[:, :-1], return_dict=True)
# Backward go on a specific logit
selected_logit = outputs.logits[0, token_index, generated_tokens[0, token_index]]
selected_logit.backward()
return last_layer_output, last_layer_output.grad, textual content material materials
def apply_gradcam(layer_output, grad, picture, index, textual content material materials):
from skimage import shade
transform_output = reshape_transform_vit_huggingface(layer_output)
transform_grad = reshape_transform_vit_huggingface(grad)
weights = torch.counsel(transform_grad, dim=(2, 3), keepdim=True) # Frequent all via the spatial dimensions
# Step 2: Weighted mixture of activation maps
grad_cam = torch.sum(weights * transform_output, dim=1, keepdim=True) # Sum over the function maps
# Step 3: Apply ReLU
grad_cam = torch.relu(grad_cam) # Solely take constructive contributions
grad_cam = grad_cam.squeeze(0) # Take away batch dimension for visualization
# Step 4: Normalize (optionally on the market nonetheless helps in visualization)
grad_cam = grad_cam / grad_cam.max()
print("Grad-CAM kind:", grad_cam.kind)
heatmap = torch.nn.useful.interpolate(grad_cam.unsqueeze(0), measurement=(picture.measurement[1], picture.measurement[0]), mode='bilinear', align_corners=False)
heatmap = heatmap.squeeze().detach().numpy()
blended = Picture.mix(picture.convert('RGBA'), Picture.fromarray((plt.cm.jet(heatmap)* 255).astype(np.uint8)).convert('RGBA'), alpha=0.5)
blended.save(f"blended_image_{index}.png", format='PNG')
return {f"{textual content material materials}": f"blended_image_{index}.png"}
layer_name = mannequin.encoder.encoder.layer[-1].output
generated_tokens, pixel_values = get_generated_tokens(picture, mannequin, processor)
print(generated_tokens)
for index, tokens in enumerate(generated_tokens[:, :-1].numpy().tolist()[0]):
layer_output, grad, textual content material materials = get_activations_and_gradient(pixel_values, mannequin, processor, generated_tokens, layer_name, token_index=index)
data = apply_gradcam(layer_output, grad, picture, index, textual content material materials)
print(data)
By adapting Grad-CAM to be used with a transformer mannequin, we’ll buy insights into which elements of the picture the mannequin focuses on when producing textual content material materials. This method is maybe terribly helpful for debugging and bettering mannequin effectivity, significantly in capabilities like automated content material materials supplies description and OCR.
I hope you discovered this data useful. For additional insights and discussions on expertise and innovation, be fully joyful to regulate to me on Linkedin: www.linkedin.com/in/meetvpatel. I look ahead to connecting with you!