This project implements a research‑grade image captioning system that generates human‑like descriptions by learning not only what objects appear in an image, but also how they relate to each other.
Unlike conventional CNN‑RNN captioning pipelines, this system explicitly models object relationships using Graph Convolutional Networks (GCNs) and uses Dual Multi‑Head Attention to align visual relations with natural language.
Traditional captioning models treat an image as a flat vector:
Image → CNN → RNN → Caption
This project treats an image as a structured graph of interacting objects:
Image → Objects → Relation Graph → GCN → Dual Attention → LSTM → Caption
This enables reasoning like:
“A man in a red shirt riding a motorcycle”
instead of
“man motorcycle”
Input Image
│
▼
VGG19 CNN (global + object features)
│
▼
Faster‑RCNN Object Detection
│
▼
Object‑Relationship Graph (IoU‑based)
│
▼
Graph Convolution Network (2 layers)
│
▼
Edge Readout (2048‑D)
│
├─ Global CNN features (2048‑D)
▼
Concatenation → 4096‑D Encoder Embedding
│
▼
Dual Attention (Self + Cross)
│
▼
LSTM Decoder
│
▼
Caption
Two pretrained networks are used:
| Model | Purpose |
|---|---|
| VGG19 | Extracts spatial feature maps |
| Faster‑RCNN (ResNet‑50‑FPN) | Detects objects and bounding boxes |
VGG19 outputs:
(batch, 49, 2048)
These 49 spatial regions represent different parts of the image.
Each detected object becomes a graph node.
Edges are created when two objects overlap beyond an IoU threshold.
If no objects are detected, the entire image becomes a single node with a self‑loop.
Each node stores VGG19 features extracted from the cropped object region.
Two graph convolution layers propagate relational information between objects:
H1 = ReLU(GCN1(G, H))
H2 = ReLU(GCN2(G, H1))
For each edge (u → v):
EdgeFeature = concat(H2[u], H2[v])
All edge features are passed through a linear layer and mean‑pooled to produce a 2048‑D graph embedding.
This is concatenated with 2048‑D CNN features → 4096‑D encoder output.
The decoder uses Dual Multi‑Head Attention:
Models word‑to‑word dependencies inside the generated caption.
Aligns each generated word with image‑graph features.
Combines both attention contexts to predict the next word.
(Self‑Attention + Cross‑Attention) → LSTM → Softmax → Next Word
Attention maps can be visualized to show which image regions influence each word.
Flickr8k
- 8,000 images
- 5 captions per image
(Designed to scale to Flickr30k / MS‑COCO.)
Images
Resize → 224×224 → Normalize
Captions
- Tokenized using spaCy
- Lowercased
- Vocabulary built with frequency threshold
- and tokens added
For each image:
- Faster‑RCNN detects objects
- Crop object regions
- Extract VGG19 features
- Build IoU‑based graph
At timestep t:
Previous words → Self Attention
Image relations → Cross Attention
→ LSTM → Predict next word
Loss:
Cross‑Entropy
Optimizer:
Adam
| Metric | Score |
|---|---|
| BLEU‑1 | ≈ 0.55 |
| BLEU‑2 | ≈ 0.33 |
The GCN + Dual Attention model significantly outperforms the Bi‑LSTM baseline and produces captions that better capture object interactions, colors, and actions.
- PyTorch
- VGG19
- Faster‑RCNN
- DGL (Graph Learning)
- Multi‑Head Attention
- LSTM
- spaCy
- Larger datasets (MS‑COCO, Flickr30k)
- Scene‑graph supervision
- Transformer‑based decoders
- Custom‑trained object detectors
The complete technical report (methodology, algorithms, experiments, and results) is available here:
👉 Project Report (PDF):
Download / View Full Documentation
This document contains:
- Mathematical formulation of the GCN + Dual Attention model
- Graph construction algorithm
- Training and inference procedures
- BLEU score evaluation
- Attention visualizations and qualitative results
- Limitations and future scope