Empirical Study of Transformers for Source Code & A Simple Approach for Handling Out-of-Vocabulary Identifiers in Deep Learning for Source Code

Last update: Nov 15, 2022

Overview

Transformers for variable misuse, function naming and code completion tasks

The official PyTorch implementation of:

Empirical Study of Transformers for Source Code [arxiv] (accepted to ESEC/FSE'21)
A Simple Approach for Handling Out-of-Vocabulary Identifiers in Deep Learning for Source Code [arxiv] (accepted to NAACL'21)

The repository also contains code for resplitting Python150k and JavaScript150k datasets (with splitting by repository, removing duplicates and the redistributable version of Py150k).

Repository structure

data_utils: scripts for downloading Python150k and JavaScript150k datasets and obtaining new train / val / test splits (with splitting by repository, removing duplicates and the redistributable version of Py150k)
vm_fn: code for Variable Misuse (VM) and Function Naming (FN) tasks (additional preprocessing, models, training etc)
cc: code for Code Completion (CC) task (additional preprocessing, models, training etc)

See README in each directory for details.

Run

The code was tested on a system with Linux 3.10.0. Experiments were run using a Tesla V100 GPU. Required libraries are listed in requirments.txt in VM_FN and CC directories. The implementation is based on PyTorch>=1.5.

Running experiments:

Download and resplit data, see data_utils for details;
Preprocess data for a task you are interested in (VM, FN or CC), see vm_fn or cc for details;
Run the experiment you are interested in, see vm_fn or cc for details.

Attribution

Parts of this code are based on the following repositories:

Citation

If you found this code useful, please cite our papers

@misc{chirkova2020empirical,
      title={Empirical Study of Transformers for Source Code}, 
      author={Nadezhda Chirkova and Sergey Troshin},
      year={2020},
      eprint={2010.07987},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

@inproceedings{chirkova2020simple,
      title={A Simple Approach for Handling Out-of-Vocabulary Identifiers in Deep Learning for Source Code}, 
      author={Nadezhda Chirkova and Sergey Troshin},
      booktitle={North American Chapter of the Association for Computational Linguistics}
      year={2021}, 
}

Empirical Study of Transformers for Source Code & A Simple Approach for Handling Out-of-Vocabulary Identifiers in Deep Learning for Source Code

Related tags

Overview

Transformers for variable misuse, function naming and code completion tasks

Repository structure

Run

Attribution

Citation

Owner

Bayesian Methods Research Group

[CVPR'2020] DeepDeform: Learning Non-rigid RGB-D Reconstruction with Semi-supervised Data

Synthesizing Long-Term 3D Human Motion and Interaction in 3D in CVPR2021

TorchGRL is the source code for our paper Graph Convolution-Based Deep Reinforcement Learning for Multi-Agent Decision-Making in Mixed Traffic Environments for IV 2022.

Classify bird species based on their songs using SIamese Networks and 1D dilated convolutions.

Tensorflow 2.x based implementation of EDSR, WDSR and SRGAN for single image super-resolution

MACE is a deep learning inference framework optimized for mobile heterogeneous computing platforms.

The Curious Layperson: Fine-Grained Image Recognition without Expert Labels (BMVC 2021)

Pywonderland - A tour in the wonderland of math with python.

Megaverse is a new 3D simulation platform for reinforcement learning and embodied AI research

Pythonic particle-based (super-droplet) warm-rain/aqueous-chemistry cloud microphysics package with box, parcel & 1D/2D prescribed-flow examples in Python, Julia and Matlab

DeLiGAN - This project is an implementation of the Generative Adversarial Network

Magic tool for managing internet connection in local network by @zalexdev

Styled Handwritten Text Generation with Transformers (ICCV 21)

PyTorch Personal Trainer: My framework for deep learning experiments

catch-22: CAnonical Time-series CHaracteristics

Keras implementation of PersonLab for Multi-Person Pose Estimation and Instance Segmentation.

A curated list of the latest breakthroughs in AI (in 2021) by release date with a clear video explanation, link to a more in-depth article, and code.

1st Solution For NeurIPS 2021 Competition on ML4CO Dual Task

Speeding-Up Back-Propagation in DNN: Approximate Outer Product with Memory

Non-Metric Space Library (NMSLIB): An efficient similarity search library and a toolkit for evaluation of k-NN methods for generic non-metric spaces.