The NIC Wrapper is a framework that augments the training set of Google NIC , the deep-learning based Neural Image Caption Generator. NIC is built to validate a hypothesis that extra training data from Google Image may help Google NIC learn the MS COCO dataset better. During the training process of NIC, we keep inserting image caption pairs from Google as extra training samples. By doing this, we expect the mistakes made by NIC are corrected.
Suppose NIC sees an image of a "cat" and describes it as a "dog". If we use the term "dog" to source extra images from Google and let NIC see the new images, NIC might realize how a dog look like. The NIC Wrapper automates this process. In each epoch of training, the wrapper sources extra training samples from Google, using the captions predicted by the latest model weights as textual queries. The model is expected to be more accurate since it now learns how Google binds images and captions beyond the initial training set.
The following diagram illustrates the architecture of NIC Wrapper.
We see a few decimal BLEU4 points of improvement when evaluating the NIC Wrapper on the MS COCO dataset. That is to say, the model trained with Google images has slightly better performances on the COCO validation. A few pairs of captions generated by the model trained with/without Google images are seen below:
- TensorFlow 1.0 or greater (instructions)
- NumPy (instructions)
- Natural Language Toolkit (NLTK)
- First install NLTK (instructions)
- Then install the NLTK data (instructions)
- Icrawler 0.3.6 or greater(instructions)
Follow the steps at im2txt to get a whole picture of NIC.
Clone the repository:
git clone [email protected]:LEAAN/Source-new-samples-for-NIC.git
Prepare the COCO Data. This may take a few hours.
# Location to save the MSCOCO data.
MSCOCO_DIR="${HOME}/im2txt/data/mscoco"
# Build the preprocessing script.
sh /im2txt/data/download_and_preprocess_mscoco.py "${MSCOCO_DIR}"
Download the Inception v3 Checkpoint.
# Location to save the Inception v3 checkpoint.
INCEPTION_DIR="${HOME}/im2txt/data"
mkdir -p ${INCEPTION_DIR}
wget "http://download.tensorflow.org/models/inception_v3_2016_08_28.tar.gz"
tar -xvf "inception_v3_2016_08_28.tar.gz" -C ${INCEPTION_DIR}
rm "inception_v3_2016_08_28.tar.gz"
Train from scratch, only on COCO training set, until the LSTMs generate sentences that read like human languages. This takes around 1 million steps, nearly one week using a TITAN X (Pascal) with 12 GB of GPU RAM.
# Directory containing preprocessed MSCOCO data.
MSCOCO_DIR="${HOME}/im2txt/data/mscoco"
# Inception v3 checkpoint file.
INCEPTION_CHECKPOINT="${HOME}/im2txt/data/inception_v3.ckpt"
# Directory to save the model.
MODEL_DIR="${HOME}/im2txt/model"
# Run the training script.
python /im2txt/train.py \
--input_file_pattern="${MSCOCO_DIR}/train-?????-of-00256" \
--inception_checkpoint_file="${INCEPTION_CHECKPOINT}" \
--train_dir="${MODEL_DIR}/train" \
--train_inception=false \
--number_of_steps=1000000
Now that the captions generated by NIC already read like human languages, we can feed the predicted captions of COCO training images to Google. Images suggested by Google, together with the textual queries used to source them, are added to the COCO training set. We renew the image caption pairs from Google every epoch and allow the latest model weights only see the extra training data obtained at the beginning of this epoch.
# save a backup of the model checkpoint at step=1,000,000
mkdir ${MODEL_DIR}/train_COCO
mv ${MODEL_DIR}/train/* ${MODEL_DIR}/train_COCO/
# Train NIC with samples from Google.
python /im2txt/train_wrapper.py \
--input_file_pattern="${MSCOCO_DIR}/train-?????-of-?????" \
--train_dir="${MODEL_DIR}/train" \
--train_inception=true \
--number_of_steps=3000000
Run the image crawler in a separate process.
# Ignore GPU devices
export CUDA_VISIBLE_DEVICES=""
# Source image caption pairs from Google.
python /im2txt/data/build_google_data.py
We compare our performances to that of the model fine-tuned with only COCO.
# move the checkpoints of the model trained with Google images to another directory.
mkdir ${MODEL_DIR}/train_Google
mv ${MODEL_DIR}/train/!(1000000) ${MODEL_DIR}/train_Google/
# Restart the training script with --train_inception=true.
python /im2txt/train.py \
--input_file_pattern="${MSCOCO_DIR}/train-?????-of-00256" \
--train_dir="${MODEL_DIR}/train" \
--train_inception=true \
--number_of_steps=3000000
Calculate perplexity values while train_wrapper.py or train.py is running. We evaluate the model by perplexity values during training. Since the perplexity value correlates to the loss value, we expect the perplexity on validating set decreases, either trained with or without extra samples from Google.
# Ignore GPU devices.
export CUDA_VISIBLE_DEVICES=""
# Run the evaluation script. This will run in a loop, periodically loading the
# latest model checkpoint file and computing evaluation metrics.
python /im2txt/evaluate.py \
--input_file_pattern="${MSCOCO_DIR}/val-?????-of-00004" \
--checkpoint_dir="${MODEL_DIR}/train" \
--eval_dir="${MODEL_DIR}/eval"
Evaluation metrics including BLEU4 are calculated after the training is done. An example of using the COCO evaluation API is available via https://github.com/tylin/coco-caption/blob/master/cocoEvalCapDemo.ipynb
Show and Tell: Lessons learned from the 2015 MSCOCO Image Captioning Challenge
Microsoft COCO Captions: Data Collection and Evaluation Server