Skip to content

This project is my PyTorch reproduction of PaliGemma, a compact 3B vision–language model that integrates SigLIP vision features with a Gemma decoder. I implemented the full multimodal pipeline from vision encoding to autoregressive text generation to study modern VLM architectures from a research perspective.

Notifications You must be signed in to change notification settings

mostafabahaa25/multi-modal_language_model_pali-gemma

Repository files navigation

PaliGemma: A Research-Oriented Multimodal Vision–Language Model

Model Architecture

Figure 1 | PaliGemma’s architecture: a SigLIP image encoder feeds into a Gemma decoder LM

This repository contains my clean, modular re-implementation of PaliGemma, a vision–language model architecture that integrates a SigLIP vision encoder with a Gemma-based language model to enable grounded multimodal reasoning. My primary aim in this project was to study the internal mechanisms of large VLMs by recreating their major components from scratch in PyTorch while keeping the codebase accessible for further experimentation.

Table of contents

1. Research Motivation

Current multimodal models must efficiently align visual representations with language signals while remaining scalable and sample-efficient. PaliGemma offers a compelling design: a powerful transformer-based vision encoder, a lightweight yet expressive decoder-only language model, and a simple projection mechanism for cross-modal fusion.

This implementation allowed me to investigate core research questions such as:
How do contrastive vision models like SigLIP integrate with decoder-only LMs?
What is the minimal architecture required for high-quality VQA and captioning?
How do positional embeddings, attention mechanisms, and KV caching behave in multimodal generation?
What trade-offs arise between architectural simplicity and model expressivity?

This repository reflects my attempt to answer these questions through a modular and transparent codebase.

2. High-Level Architecture

PaliGemma follows a straightforward yet effective design:

Key Components

SigLIP Vision Encoder: Produces semantically rich patch embeddings through a sigmoid-based contrastive training objective.
Multimodal Projection Layer: Maps visual representations into the language embedding space.
Gemma Language Model: A decoder-only transformer with grouped-query attention, RoPE, and SwiGLU feed-forward networks.
Autoregressive Generation: Supports efficient KV caching and multiple decoding strategies.

3. Model overview

PaliGemma has three components:
SigLIP Vision Encoder (ViT-So400m) contrastively pretrained, produces patch tokens.
Linear Multimodal Projector projects SigLIP outputs into Gemma’s embedding space (zero-init linear layer). Simplicity here was found advantageous.
Gemma Decoder-only LM (Gemma-2B) autoregressive language model that consumes image tokens + text prompt and produces text outputs.
High-level token layout:
[ image_tokens ... , BOS, prefix_text_tokens..., SEP, suffix_tokens..., EOS, PAD ] image tokens are placed at the front and the model uses block attention (image + prefix full, suffix autoregressive).

4. Repository Structure

Each module is designed for readability and extensibility, enabling future research on architectural variants, training strategies, and fine-tuning methods.

5. Component Overview

Vision Encoder (SigLIP)

Implemented from first principles, including:

  • Patch embedding layers
  • Multi-head attention blocks
  • MLP feed-forward sublayers
  • Sigmoid-based contrastive alignment (conceptual basis)

Language Model (Gemma)

Key design elements:

  • Grouped-query attention for memory efficiency
  • Rotary positional embeddings
  • RMSNorm normalization
  • SwiGLU MLP layers
  • Autoregressive LM Head for generation

Multimodal Projector

A lightweight linear layer that aligns visual patch embeddings with the text embedding space.

Input Processing

  • Image resizing, rescaling, and normalization
  • Special token handling ( tokens)
  • Unified tokenizer and processor for multimodal prompts

6. Running Inference

Basic Example

python main.py

Custom Inference Script

from inference import load_hf_model, test_inference
from processing import PaliGemmaProcessor
import torch

model, tokenizer = load_hf_model("./paligemma-3b-pt-224", "cuda")
processor = PaliGemmaProcessor(tokenizer, model.config.vision_config.num_image_tokens,
                               model.config.vision_config.image_size)

with torch.no_grad():
    test_inference(
        model=model,
        processor=processor,
        prompt="Describe this image",
        image_file_path="./your_image.jpg",
        max_tokens_to_generate=100,
        temperature=0.8,
        top_p=0.9,
        do_sample=True
    )

7. Key Research Features

Attention Mechanisms

  • Grouped-query attention significantly reduces memory overhead while maintaining high-quality attention maps.

Cross-Modal Alignment

  • Investigates how a simple linear projector can effectively fuse modalities without complex fusion layers.

Autoregressive Behavior

  • KV caching enables efficient, scalable token generation.
  • Multiple decoding strategies (greedy, temperature, top-p) allow comparative experiments.

Architectural Transparency

  • Every component is implemented with clarity to support reproducibility and ablation studies.

8. Configuration

You can adjust model behavior in main.py:

model_path = "./paligemma-3b-pt-224"
prompt = "Describe this image"
image_file_path = "./image.jpg"
max_tokens_to_generate = 500
temperature = 0.2
top_p = 0.7
do_sample = False

9. System Requirements

  • Python 3.8+
  • PyTorch 2.0+
  • Transformers
  • PIL
  • NumPy
  • SafeTensors
  • HuggingFace Hub

See requirements.txt for full version details.

About

This project is my PyTorch reproduction of PaliGemma, a compact 3B vision–language model that integrates SigLIP vision features with a Gemma decoder. I implemented the full multimodal pipeline from vision encoding to autoregressive text generation to study modern VLM architectures from a research perspective.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Languages