NeuralTalk is a Python+numpy project for learning Multimodal Recurrent Neural Networks that describe images with sentences.

Overview

#NeuralTalk

Warning: Deprecated. Hi there, this code is now quite old and inefficient, and now deprecated. I am leaving it on Github for educational purposes, but if you would like to run or train image captioning I warmly recommend my new code release NeuralTalk2. NeuralTalk2 is written in Torch and is SIGNIFICANTLY (I mean, ~100x+) faster because it is batched and runs on the GPU. It also supports CNN finetuning, which helps a lot with performance.

This project contains Python+numpy source code for learning Multimodal Recurrent Neural Networks that describe images with sentences.

This line of work was recently featured in a New York Times article and has been the subject of multiple academic papers from the research community over the last few months. This code currently implements the models proposed by Vinyals et al. from Google (CNN + LSTM) and by Karpathy and Fei-Fei from Stanford (CNN + RNN). Both models take an image and predict its sentence description with a Recurrent Neural Network (either an LSTM or an RNN).

Overview

The pipeline for the project looks as follows:

  • The input is a dataset of images and 5 sentence descriptions that were collected with Amazon Mechanical Turk. In particular, this code base is set up for Flickr8K, Flickr30K, and MSCOCO datasets.
  • In the training stage, the images are fed as input to RNN and the RNN is asked to predict the words of the sentence, conditioned on the current word and previous context as mediated by the hidden layers of the neural network. In this stage, the parameters of the networks are trained with backpropagation.
  • In the prediction stage, a witheld set of images is passed to RNN and the RNN generates the sentence one word at a time. The results are evaluated with BLEU score. The code also includes utilities for visualizing the results in HTML.

Dependencies

Python 2.7, modern version of numpy/scipy, perl (if you want to do BLEU score evaluation), argparse module. Most of these are okay to install with pip. To install all dependencies at once, run the command pip install -r requirements.txt

I only tested this code with Ubuntu 12.04, but I tried to make it as generic as possible (e.g. use of os module for file system interactions etc. So it might work on Windows and Mac relatively easily.)

Protip: you really want to link your numpy to use a BLAS implementation for its matrix operations. I use virtualenv and link numpy against a system installation of OpenBLAS. Doing this will make this code almost an order of time faster because it relies very heavily on large matrix multiplies.

Getting started

  1. Get the code. $ git clone the repo and install the Python dependencies
  2. Get the data. I don't distribute the data in the Git repo, instead download the data/ folder from here. Also, this download does not include the raw image files, so if you want to visualize the annotations on raw images, you have to obtain the images from Flickr8K / Flickr30K / COCO directly and dump them into the appropriate data folder.
  3. Train the model. Run the training $ python driver.py (see many additional argument settings inside the file) and wait. You'll see that the learning code writes checkpoints into cv/ and periodically reports its status in status/ folder.
  4. Monitor the training. The status can be inspected manually by reading the JSON and printing whatever you wish in a second process. In practice I run cross-validations on a cluster, so my cv/ folder fills up with a lot of checkpoints that I further filter and inspect with other scripts. I am including my cluster training status visualization utility as well if you like. Run a local webserver (e.g. $ python -m SimpleHTTPServer 8123) and then open monitorcv.html in your browser on http://localhost:8123/monitorcv.html, or whatever the web server tells you the path is. You will have to edit the file to setup the paths properly and point it at the right json files.
  5. Evaluate model checkpoints. To evaluate a checkpoint from cv/, run the evaluate_sentence_predctions.py script and pass it the path to a checkpoint.
  6. Visualize the predictions. Use the included html file visualize_result_struct.html to visualize the JSON struct produced by the evaluation code. This will visualize the images and their predictions. Note that you'll have to download the raw images from the individual dataset pages and place them into the corresponding data/ folder.

Lastly, note that this is currently research code, so a lot of the documentation is inside individual Python files. If you wish to work with this code, you'll have to get familiar with it and be comfortable reading Python code.

Pretrained model

Some pretrained models can be found in the NeuralTalk Model Zoo. The slightly hairy part is that if you wish to apply these models to some arbitrary new image (one not from Flickr8k/30k/COCO) you have to first extract the CNN features. I use the 16-layer VGG network from Simonyan and Zisserman, because the model is beautiful, powerful and available with Caffe. There is opportunity for putting the preprocessing and inference into a single nice function that uses the Python wrapper to get the features and then runs the pretrained sentence model. I might add this in the future.

Using the model to predict on new images

The code allows you to easily predict and visualize results of running the model on COCO/Flickr8K/Flick30K images. If you want to run the code on arbitrary image (e.g. on your file system), things get a little more complicated because we need to first need to pipe your image through the VGG CNN to get the 4096-D activations on top.

Have a look inside the folder example_images for instructions on how to do this. Currently, the code for extracting the raw features from each image is in Matlab, so you will need it installed on your system. Caffe also has a wrapper for Python, but I wasn't yet able to use the Python wrapper to exactly reproduce the features I get from Matlab. The example_images will walk you through the process, and you will eventually use predict_on_images.py to run the prediction.

Using your own data

The input to the system is the data folder, which contains the Flickr8K, Flickr30K and MSCOCO datasets. In particular, each folder (e.g. data/flickr8k) contains a dataset.json file that stores the image paths and sentences in the dataset (all images, sentences, raw preprocessed tokens, splits, and the mappings between images and sentences). Each folder additionally contains vgg_feats.mat , which is a .mat file that stores the CNN features from all images, one per row, using the VGG Net from ILSVRC 2014. Finally, there is the imgs/ folder that holds the raw images. I also provide the Matlab script that I used to extract the features, which you may find helpful if you wish to use a different dataset. This is inside the matlab_features_reference/ folder, and see the Readme file in that folder for more information.

License

BSD license.

Owner
Andrej
I like to train Deep Neural Nets on large datasets.
Andrej
Official code for "Maximum Likelihood Training of Score-Based Diffusion Models", NeurIPS 2021 (spotlight)

Maximum Likelihood Training of Score-Based Diffusion Models This repo contains the official implementation for the paper Maximum Likelihood Training o

Yang Song 84 Dec 12, 2022
Hi Guys, here I am providing examples, which will help you in Lerarning Python

LearningPython Hi guys, here I am trying to include as many practice examples of Python Language, as i Myself learn, and hope these will help you in t

4 Feb 03, 2022
To provide 100 JAX exercises over different sections structured as a course or tutorials to teach and learn for beginners, intermediates as well as experts

JaxTon 💯 JAX exercises Mission 🚀 To provide 100 JAX exercises over different sections structured as a course or tutorials to teach and learn for beg

Rohan Rao 512 Jan 01, 2023
A TensorFlow implementation of SOFA, the Simulator for OFfline LeArning and evaluation.

SOFA This repository is the implementation of SOFA, the Simulator for OFfline leArning and evaluation. Keeping Dataset Biases out of the Simulation: A

22 Nov 23, 2022
LBK 26 Dec 28, 2022
HIVE: Evaluating the Human Interpretability of Visual Explanations

HIVE: Evaluating the Human Interpretability of Visual Explanations Project Page | Paper This repo provides the code for HIVE, a human evaluation frame

Princeton Visual AI Lab 16 Dec 13, 2022
Efficient Conformer: Progressive Downsampling and Grouped Attention for Automatic Speech Recognition

Efficient Conformer: Progressive Downsampling and Grouped Attention for Automatic Speech Recognition Official implementation of the Efficient Conforme

Maxime Burchi 145 Dec 30, 2022
Dataset Condensation with Contrastive Signals

Dataset Condensation with Contrastive Signals This repository is the official implementation of Dataset Condensation with Contrastive Signals (DCC). T

3 May 19, 2022
Neighbor2Seq: Deep Learning on Massive Graphs by Transforming Neighbors to Sequences

Neighbor2Seq: Deep Learning on Massive Graphs by Transforming Neighbors to Sequences This repository is an official PyTorch implementation of Neighbor

DIVE Lab, Texas A&M University 8 Jun 12, 2022
Adversarial-autoencoders - Tensorflow implementation of Adversarial Autoencoders

Adversarial Autoencoders (AAE) Tensorflow implementation of Adversarial Autoencoders (ICLR 2016) Similar to variational autoencoder (VAE), AAE imposes

Qian Ge 236 Nov 13, 2022
This is an easy python software which allows to sort images with faces by gender and after by age.

Gender-age Classifier This is an easy python software which allows to sort images with faces by gender and after by age. Usage First install Deepface

Claudio Ciccarone 6 Sep 17, 2022
Working demo of the Multi-class and Anomaly classification model using the CLIP feature space

👁️ Hindsight AI: Crime Classification With Clip About For Educational Purposes Only This is a recursive neural net trained to classify specific crime

Miles Tweed 2 Jun 05, 2022
Codes and pretrained weights for winning submission of 2021 Brain Tumor Segmentation (BraTS) Challenge

Winning submission to the 2021 Brain Tumor Segmentation Challenge This repo contains the codes and pretrained weights for the winning submission to th

94 Dec 28, 2022
Dense Contrastive Learning (DenseCL) for self-supervised representation learning, CVPR 2021.

Dense Contrastive Learning for Self-Supervised Visual Pre-Training This project hosts the code for implementing the DenseCL algorithm for se

Xinlong Wang 491 Jan 03, 2023
Fast, differentiable sorting and ranking in PyTorch

Torchsort Fast, differentiable sorting and ranking in PyTorch. Pure PyTorch implementation of Fast Differentiable Sorting and Ranking (Blondel et al.)

Teddy Koker 655 Jan 04, 2023
DeepSpeed is a deep learning optimization library that makes distributed training easy, efficient, and effective.

DeepSpeed+Megatron trained the world's most powerful language model: MT-530B DeepSpeed is hiring, come join us! DeepSpeed is a deep learning optimizat

Microsoft 8.4k Dec 28, 2022
A PyTorch implementation of PointRend: Image Segmentation as Rendering

PointRend A PyTorch implementation of PointRend: Image Segmentation as Rendering [arxiv] [Official Implementation: Detectron2] This repo for Only Sema

AhnDW 336 Dec 26, 2022
Repositório para arquivos sobre o Módulo 1 do curso Top Coders da Let's Code + Safra

850-Safra-DS-ModuloI Repositório para arquivos sobre o Módulo 1 do curso Top Coders da Let's Code + Safra Para aprender mais Git https://learngitbranc

Brian Nunes 7 Dec 10, 2022
The code of paper 'Learning to Aggregate and Personalize 3D Face from In-the-Wild Photo Collection'

Learning to Aggregate and Personalize 3D Face from In-the-Wild Photo Collection Pytorch implemetation of paper 'Learning to Aggregate and Personalize

Tencent YouTu Research 136 Dec 29, 2022
Official implementation of our neural-network-based fast diffuse room impulse response generator (FAST-RIR)

This is the official implementation of our neural-network-based fast diffuse room impulse response generator (FAST-RIR) for generating room impulse responses (RIRs) for a given acoustic environment.

12 Jan 13, 2022