PyTorch code for: Learning to Generate Grounded Visual Captions without Localization Supervision

Last update: Nov 17, 2022

Overview

Learning to Generate Grounded Visual Captions without Localization Supervision

This is the PyTorch implementation of our paper:

Learning to Generate Grounded Visual Captions without Localization Supervision
Chih-Yao Ma, Yannis Kalantidis, Ghassan AlRegib, Peter Vajda, Marcus Rohrbach, Zsolt Kira
European Conference on Computer Vision (ECCV), 2020

[arXiv] [GitHub] [Project]

10-min YouTube Video

How to start

Clone the repo recursively:

git clone --recursive [email protected]:chihyaoma/cyclical-visual-captioning.git

If you didn't clone with the --recursive flag, then you'll need to manually clone the pybind submodule from the top-level directory:

git submodule update --init --recursive

Installation

The proposed cyclical method can be applied directly to image and video captioning tasks.

Currently, installation guide and our code for video captioning on the ActivityNet-Entities dataset are provided in anet-video-captioning.

Acknowledgments

Chih-Yao Ma and Zsolt Kira were partly supported by DARPA’s Lifelong Learning Machines (L2M) program, under Cooperative Agreement HR0011-18-2-0019, as part of their affiliation with Georgia Tech. We thank Chia-Jung Hsu for her valuable and artistic helps on the figures.

Citation

If you find this repository useful, please cite our paper:

@inproceedings{ma2020learning,
    title={Learning to Generate Grounded Image Captions without Localization Supervision},
    author={Ma, Chih-Yao and Kalantidis, Yannis and AlRegib, Ghassan and Vajda, Peter and Rohrbach, Marcus and Kira, Zsolt},
    booktitle={Proceedings of the European Conference on Computer Vision (ECCV)},
    year={2020},
    url={https://arxiv.org/abs/1906.00283},
}

PyTorch code for: Learning to Generate Grounded Visual Captions without Localization Supervision

Related tags

Overview

Learning to Generate Grounded Visual Captions without Localization Supervision

10-min YouTube Video

How to start

Installation

Acknowledgments

Citation

Owner

Chih-Yao Ma

Yolo ros - YOLO-ROS for HUAWEI ATLAS200

ncnn is a high-performance neural network inference framework optimized for the mobile platform

SEOVER: Sentence-level Emotion Orientation Vector based Conversation Emotion Recognition Model

Lightweight Face Image Quality Assessment

Learnable Motion Coherence for Correspondence Pruning

Codes to pre-train T5 (Text-to-Text Transfer Transformer) models pre-trained on Japanese web texts

Fast convergence of detr with spatially modulated co-attention

PyTorch for Semantic Segmentation

A PyTorch implementation of the Relational Graph Convolutional Network (RGCN).

Meaningful titles for tabs and PDF downloads! Also supports tab search.

Python implementation of a live deep learning based age/gender/expression recognizer

A full pipeline AutoML tool for tabular data

SalFBNet: Learning Pseudo-Saliency Distribution via Feedback Convolutional Networks

Few-Shot Object Detection via Association and DIscrimination

Multi-label Co-regularization for Semi-supervised Facial Action Unit Recognition (NeurIPS 2019)

Official code for 'Robust Siamese Object Tracking for Unmanned Aerial Manipulator' and offical introduction to UAMT100 benchmark

Trajectory Prediction with Graph-based Dual-scale Context Fusion

Gym environments used in the paper: "Developmental Reinforcement Learning of Control Policy of a Quadcopter UAV with Thrust Vectoring Rotors"

A torch.Tensor-like DataFrame library supporting multiple execution runtimes and Arrow as a common memory format

Explainability for Vision Transformers (in PyTorch)