🧠 Mimic Human-Level Understanding of Images

Graph-based Image Captioning with Dual Attention

📌 Project Overview

This project implements a research‑grade image captioning system that generates human‑like descriptions by learning not only what objects appear in an image, but also how they relate to each other.

Unlike conventional CNN‑RNN captioning pipelines, this system explicitly models object relationships using Graph Convolutional Networks (GCNs) and uses Dual Multi‑Head Attention to align visual relations with natural language.

🧠 Core Idea

Traditional captioning models treat an image as a flat vector:

Image → CNN → RNN → Caption

This project treats an image as a structured graph of interacting objects:

Image → Objects → Relation Graph → GCN → Dual Attention → LSTM → Caption

This enables reasoning like:

“A man in a red shirt riding a motorcycle”
instead of
“man motorcycle”

🏗️ System Architecture

Input Image
     │
     ▼
VGG19 CNN (global + object features)
     │
     ▼
Faster‑RCNN Object Detection
     │
     ▼
Object‑Relationship Graph (IoU‑based)
     │
     ▼
Graph Convolution Network (2 layers)
     │
     ▼
Edge Readout (2048‑D)
     │
     ├─ Global CNN features (2048‑D)
     ▼
Concatenation → 4096‑D Encoder Embedding
     │
     ▼
Dual Attention (Self + Cross)
     │
     ▼
LSTM Decoder
     │
     ▼
Caption

🔍 Image Encoding

1️⃣ Global and Object Features

Two pretrained networks are used:

Model	Purpose
VGG19	Extracts spatial feature maps
Faster‑RCNN (ResNet‑50‑FPN)	Detects objects and bounding boxes

VGG19 outputs:

(batch, 49, 2048)

These 49 spatial regions represent different parts of the image.

2️⃣ Object‑Relationship Graph

Each detected object becomes a graph node.
Edges are created when two objects overlap beyond an IoU threshold.

If no objects are detected, the entire image becomes a single node with a self‑loop.

Each node stores VGG19 features extracted from the cropped object region.

3️⃣ Graph Convolution Network (GCN)

Two graph convolution layers propagate relational information between objects:

H1 = ReLU(GCN1(G, H))
H2 = ReLU(GCN2(G, H1))

For each edge (u → v):

EdgeFeature = concat(H2[u], H2[v])

All edge features are passed through a linear layer and mean‑pooled to produce a 2048‑D graph embedding.

This is concatenated with 2048‑D CNN features → 4096‑D encoder output.

🧠 Caption Decoder

The decoder uses Dual Multi‑Head Attention:

🔹 Multi‑Head Self‑Attention

Models word‑to‑word dependencies inside the generated caption.

🔹 Multi‑Head Cross‑Attention

Aligns each generated word with image‑graph features.

🔹 LSTM Generator

Combines both attention contexts to predict the next word.

(Self‑Attention + Cross‑Attention) → LSTM → Softmax → Next Word

Attention maps can be visualized to show which image regions influence each word.

🧪 Training Pipeline

Dataset

Flickr8k

8,000 images
5 captions per image

(Designed to scale to Flickr30k / MS‑COCO.)

Preprocessing

Images

Resize → 224×224 → Normalize

Captions

Tokenized using spaCy
Lowercased
Vocabulary built with frequency threshold
and tokens added

Graph Construction

For each image:

Faster‑RCNN detects objects
Crop object regions
Extract VGG19 features
Build IoU‑based graph

Training

At timestep t:

Previous words → Self Attention
Image relations → Cross Attention
→ LSTM → Predict next word

Loss:

Cross‑Entropy

Optimizer:

Adam

📊 Results

Metric	Score
BLEU‑1	≈ 0.55
BLEU‑2	≈ 0.33

The GCN + Dual Attention model significantly outperforms the Bi‑LSTM baseline and produces captions that better capture object interactions, colors, and actions.

🛠️ Tech Stack

PyTorch
VGG19
Faster‑RCNN
DGL (Graph Learning)
Multi‑Head Attention
LSTM
spaCy

🔮 Future Work

Larger datasets (MS‑COCO, Flickr30k)
Scene‑graph supervision
Transformer‑based decoders
Custom‑trained object detectors

📄 Full Project Documentation

The complete technical report (methodology, algorithms, experiments, and results) is available here:

👉 Project Report (PDF):
Download / View Full Documentation

This document contains:

Mathematical formulation of the GCN + Dual Attention model
Graph construction algorithm
Training and inference procedures
BLEU score evaluation
Attention visualizations and qualitative results
Limitations and future scope

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
README.md		README.md
TestImages.txt		TestImages.txt
VG2COCO_Format.ipynb		VG2COCO_Format.ipynb
captions.txt		captions.txt
image_captioning_with_gcn_attention_pytorch.ipynb		image_captioning_with_gcn_attention_pytorch.ipynb
train_captions.txt		train_captions.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧠 Mimic Human-Level Understanding of Images

Graph-based Image Captioning with Dual Attention

📌 Project Overview

🧠 Core Idea

🏗️ System Architecture

🔍 Image Encoding

1️⃣ Global and Object Features

2️⃣ Object‑Relationship Graph

3️⃣ Graph Convolution Network (GCN)

🧠 Caption Decoder

🔹 Multi‑Head Self‑Attention

🔹 Multi‑Head Cross‑Attention

🔹 LSTM Generator

🧪 Training Pipeline

Dataset

Preprocessing

Graph Construction

Training

📊 Results

🛠️ Tech Stack

🔮 Future Work

📄 Full Project Documentation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🧠 Mimic Human-Level Understanding of Images

Graph-based Image Captioning with Dual Attention

📌 Project Overview

🧠 Core Idea

🏗️ System Architecture

🔍 Image Encoding

1️⃣ Global and Object Features

2️⃣ Object‑Relationship Graph

3️⃣ Graph Convolution Network (GCN)

🧠 Caption Decoder

🔹 Multi‑Head Self‑Attention

🔹 Multi‑Head Cross‑Attention

🔹 LSTM Generator

🧪 Training Pipeline

Dataset

Preprocessing

Graph Construction

Training

📊 Results

🛠️ Tech Stack

🔮 Future Work

📄 Full Project Documentation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages