Figure 1 | PaliGemma’s architecture: a SigLIP image encoder feeds into a Gemma decoder LM
This repository contains my clean, modular re-implementation of PaliGemma, a vision–language model architecture that integrates a SigLIP vision encoder with a Gemma-based language model to enable grounded multimodal reasoning.
My primary aim in this project was to study the internal mechanisms of large VLMs by recreating their major components from scratch in PyTorch while keeping the codebase accessible for further experimentation.
Current multimodal models must efficiently align visual representations with language signals while remaining scalable and sample-efficient. PaliGemma offers a compelling design: a powerful transformer-based vision encoder, a lightweight yet expressive decoder-only language model, and a simple projection mechanism for cross-modal fusion.
This implementation allowed me to investigate core research questions such as:
How do contrastive vision models like SigLIP integrate with decoder-only LMs?
What is the minimal architecture required for high-quality VQA and captioning?
How do positional embeddings, attention mechanisms, and KV caching behave in multimodal generation?
What trade-offs arise between architectural simplicity and model expressivity?
This repository reflects my attempt to answer these questions through a modular and transparent codebase.
PaliGemma follows a straightforward yet effective design:
SigLIP Vision Encoder: Produces semantically rich patch embeddings through a sigmoid-based contrastive training objective.
Multimodal Projection Layer: Maps visual representations into the language embedding space.
Gemma Language Model: A decoder-only transformer with grouped-query attention, RoPE, and SwiGLU feed-forward networks.
Autoregressive Generation: Supports efficient KV caching and multiple decoding strategies.
PaliGemma has three components:
SigLIP Vision Encoder (ViT-So400m) contrastively pretrained, produces patch tokens.
Linear Multimodal Projector projects SigLIP outputs into Gemma’s embedding space (zero-init linear layer). Simplicity here was found advantageous.
Gemma Decoder-only LM (Gemma-2B) autoregressive language model that consumes image tokens + text prompt and produces text outputs.
High-level token layout:
[ image_tokens ... , BOS, prefix_text_tokens..., SEP, suffix_tokens..., EOS, PAD ] image tokens are placed at the front and the model uses block attention (image + prefix full, suffix autoregressive).
Each module is designed for readability and extensibility, enabling future research on architectural variants, training strategies, and fine-tuning methods.
Implemented from first principles, including:
- Patch embedding layers
- Multi-head attention blocks
- MLP feed-forward sublayers
- Sigmoid-based contrastive alignment (conceptual basis)
Key design elements:
- Grouped-query attention for memory efficiency
- Rotary positional embeddings
- RMSNorm normalization
- SwiGLU MLP layers
- Autoregressive LM Head for generation
A lightweight linear layer that aligns visual patch embeddings with the text embedding space.
- Image resizing, rescaling, and normalization
- Special token handling (
tokens)
- Unified tokenizer and processor for multimodal prompts
python main.pyfrom inference import load_hf_model, test_inference
from processing import PaliGemmaProcessor
import torch
model, tokenizer = load_hf_model("./paligemma-3b-pt-224", "cuda")
processor = PaliGemmaProcessor(tokenizer, model.config.vision_config.num_image_tokens,
model.config.vision_config.image_size)
with torch.no_grad():
test_inference(
model=model,
processor=processor,
prompt="Describe this image",
image_file_path="./your_image.jpg",
max_tokens_to_generate=100,
temperature=0.8,
top_p=0.9,
do_sample=True
)- Grouped-query attention significantly reduces memory overhead while maintaining high-quality attention maps.
- Investigates how a simple linear projector can effectively fuse modalities without complex fusion layers.
- KV caching enables efficient, scalable token generation.
- Multiple decoding strategies (greedy, temperature, top-p) allow comparative experiments.
- Every component is implemented with clarity to support reproducibility and ablation studies.
You can adjust model behavior in main.py:
model_path = "./paligemma-3b-pt-224"
prompt = "Describe this image"
image_file_path = "./image.jpg"
max_tokens_to_generate = 500
temperature = 0.2
top_p = 0.7
do_sample = False- Python 3.8+
- PyTorch 2.0+
- Transformers
- PIL
- NumPy
- SafeTensors
- HuggingFace Hub
See requirements.txt for full version details.
