Modeling Image-Caption Rating from Comparative Judgments

Overview

This study investigates whether comparative learning can serve as an effective alternative to regression-based learning for image-caption rating.

Instead of training directly on numeric human ratings, the comparative model learns from pairwise judgments indicating which of two image-caption pairs better match each other.

Method Summary

Each image-caption pair is represented using multimodal embeddings:

Image encoder: ResNet-50 (2048 dimensions)
Caption encoder: all-MiniLM-L6-v2 (384 dimensions)
Concatenated representation: 2432 dimensions

Two learning paradigms are implemented:

1. Regression Model

Trained on normalized human ratings
Optimized with ranking-penalized MAE
Evaluated using MSE, MAE, Pearson ρ, and Spearman rs

2. Comparative Learning Model

Trained on pairwise preference judgments
Uses hinge loss on score differences
Learns relative utility ordering instead of absolute scores

Architecture

Both models share the same dual-encoder multimodal architecture.

Regression Framework

Comparative Learning Framework

Dataset

Experiments are conducted on the Validated Image Caption Rating (VICR) dataset:

15,646 image-caption pairs
68,217 human ratings
Ratings from 1 to 5

Ratings are normalized to [0, 1] for training.
An 80/20 train-test split is applied.

Sample data is included in the data/ directory.

Repository Structure


comparative_image_caption/
│
├── code/
│ ├── generate_embedding.py
│ ├── regression_model.py
│ ├── pairwise_model.py
│ ├── caption_ranking.py
│ ├── comparative_acc.py
│ ├── comparative_interrater_agreement.py
│ ├── task1_agreement_metrics.py
│ ├── compute_averages.py
│ └── checkpoint/
│
├── data/
│
├── docs/
│ ├── regression_framework.png
│ └── comparative_framework.png
|
|── results/
│
└── README.md

Execution Order

generate_embedding.py
Generates multimodal embeddings using ResNet-50 and MiniLM.
regression_model.py
Trains and evaluates the regression baseline.
pairwise_model.py
Trains the comparative learning model using hinge loss.
caption_ranking.py
Evaluates same-image caption preference modeling.
comparative_acc.py
Computes correlation metrics and observed agreement.
Human agreement analysis:
- task1_agreement_metrics.py
- comparative_interrater_agreement.py
- compute_averages.py

Results

Regression Baseline

MSE = 0.0447
MAE = 0.1593
Pearson ρ = 0.7609
Spearman rs = 0.7089

Comparative Learning (N = 20)

Pearson ρ = 0.6758
Spearman rs = 0.6914

Same-Image Caption Comparison

Observed agreement (po) = 0.8625
Pearson ρ = 0.7074
Spearman rs = 0.7367

Comparative learning approaches regression performance while relying only on relative preference supervision.

Human Evaluation

Eight annotators completed direct rating and comparative tasks.

Inter-rater agreement (between raters):

Task	po	κ
Direct Rating	0.85	0.69
Pairwise (Different Images)	0.95	0.85
Same-Image Comparison	0.90	0.78

Comparative judgments consistently demonstrate higher inter-rater reliability than direct numeric ratings.

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
.idea		.idea
code		code
data		data
docs		docs
results		results
src		src
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Modeling Image-Caption Rating from Comparative Judgments

Overview

Method Summary

1. Regression Model

2. Comparative Learning Model

Architecture

Regression Framework

Comparative Learning Framework

Dataset

Repository Structure

Execution Order

Results

Regression Baseline

Comparative Learning (N = 20)

Same-Image Caption Comparison

Human Evaluation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Languages

Folders and files

Latest commit

History

Repository files navigation

Modeling Image-Caption Rating from Comparative Judgments

Overview

Method Summary

1. Regression Model

2. Comparative Learning Model

Architecture

Regression Framework

Comparative Learning Framework

Dataset

Repository Structure

Execution Order

Results

Regression Baseline

Comparative Learning (N = 20)

Same-Image Caption Comparison

Human Evaluation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Languages

Packages