Skip to content

Image Descriptor with Visual Attention Mechanism Using Long Short-term Memory

License

Notifications You must be signed in to change notification settings

lychengrex/Image-Descriptor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

66 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Image Descriptor with Visual Attention Mechanism Using Long Short-term Memory

This project is regarding image captioning. We follow most state-of-the-art networks based on Convolutional neural networks and Recurrent neural networks (CNN-RNN). We utilize CNN to extract features over the image, and then adopt RNN to generate captions from these features. To address the problem of object missing in the predicted text, we append a attention network to force the visual features to be considered at each time step. We present various configurations of CNN models and using merely LSTM for our RNN model. Throughout this project, we evaluate our networks on MS COCO dataset.

Dataset

  • COCO: COCO is a large-scale object detection, segmentation, and captioning dataset.

Testing Environment

  • Pytorch version: 1.1.0
  • CUDA version: 9.0.176
  • Python version: 3.6.8
  • CPU: Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
  • GPU: GeForce GTX 1080 Ti (11172MB GRAM)
  • RAM: 32GB

Code Organization

download_dataset.sh      --  Download COCO ataset, including images and captions
src/build_vocab.py       --  Build a vocabulary wrapper of COCO dataset captions
src/data.py              --  Module for preprocessing the images
src/demo_training.ipynb  --  Run a demo of training a model
src/demo_testing.ipynb   --  Run a demo of testing a model
src/main.py              --  Main file that you could run it in terminal
src/model.py             --  Models of CNN and RNN
src/resize.py            --  Module for resizing the images
src/utils.py             --  Useful functions and our ImageDescriptor model

Usage

1. Install required packages

pip install -r requirements.txt --user  

2. Install COCO API

git clone https://github.com/pdollar/coco.git
cd coco/PythonAPI/
make
python setup.py build
python setup.py install --user

cd ../../
git clone https://github.com/lychengr3x/Image-Descriptor.git
cd Image-Descriptor

3. Download Dataset

If you want to use preprocessed dataset, you can skip this step.

chmod +x download_dataset.sh
./download_dataset.sh

4. Preprocess the data

You can do it from the scratch

cd src
# training set
python build_vocab.py  
python resize.py

# validation set
python build_vocab.py --caption_path='../data/annotations/captions_val2014.json' --vocab_path='../data/vocab_val.pkl'
python resize.py --image_dir='../data/val2014/'

, or simply download preprocessed dataset.

  • annotations: This directory includes two files, captions_train2014.json and captions_val2014.json. (link)

  • vocab: This includes vocabulary of training set and validation set, vocab.pkl and vocab_val.pkl. (link)

  • resized2014: This directory includes all resized images (256x256) of training set and validation set. (link)

5. Train the model in the background save log file

It takes around 30 minutes.

# no attention layer
nohup python main.py --mode='train' > log.txt &  

# with attention layer
nohup python main.py --mode='train' --attention=True > log.txt &  

How to specify a model:

Take resnet152 for example. Assign --encoder=resnet and --encoder_ver=152.

  • Here are 2 of the trained models resnet101: resnet101-epoch-7.ckpt (link), resnet101-epoch-15.ckpt (link)

  • Here is a demo that shows how to train in the jupyter notebook: demo_training.ipynb (link)

6. Evaluate the model

To get a caption for a specific image.

# no attention layer
python main.py --mode=test --encoder=resnet --encoder_ver=101 --image_path=../png/example.png --model_dir=../models --checkpoint=resnet101-epoch-7.ckpt

# with attention layer
python main.py --mode=test --encoder=resnet --encoder_ver=101 --attention=True --image_path=../png/example.png --model_dir=../models --checkpoint=resnet101-epoch-7.ckpt

To get the loss of validation set at specific epoch. (run in the background).
It takes around 20 minutes.

# no attention layer
nohup python main.py --mode=val --encoder=resnet --encoder_ver=101 --model_dir=../models --checkpoint=epoch-7.ckpt > val_loss.txt &

# with attention layer
nohup python main.py --mode=val --encoder=resnet --encoder_ver=101 --attention=True --model_dir=../models --checkpoint=epoch-7.ckpt > val_loss_att.txt &
  • Here is a demo that shows how to test in the jupyter notebook: demo_testing.ipynb (link)

File arrangement

If you want to re-run the demo_testing.ipynb (link) directly, make sure you download files from the above links and put them in the right place, shown as the following. Besides, installing COCO API is required which is shown at step 2.

.
|--- png/
     |--- example.png
     |--- test_01_resize.jpg
     |--- test_02_resize.jpg
     |--- test_03_resize.jpg
     |--- test_04_resize.jpg
|--- src/
     |--- build_vocab.py
     |--- data.py
     |--- demo_training.ipynb
     |--- demo_testing.ipynb
     |--- main.py
     |--- model.py
     |--- resize.py
     |--- utils.py
|--- models/
     |--- config-resnet101.txt
     |--- resnet101-epoch-7.ckpt
     |--- resnet101-epoch-15.ckpt
|--- data/
     |--- resized2014/
     |--- annotations/
          |--- captions_train2014.json
          |--- captions_val2014.json
     |--- vocab.pkl
     |--- vocab_val.pkl

About

Image Descriptor with Visual Attention Mechanism Using Long Short-term Memory

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published