TensorFlow code for the neural network presented in the paper: "Structural Language Models of Code" (ICML'2020)

Overview

SLM: Structural Language Models of Code

This is an official implementation of the model described in:

"Structural Language Models of Code" [PDF]

To appear in ICML'2020.

An online demo is available at https://AnyCodeGen.org.

This repository currently contains the dataset and the data extractor that we used to create the Java dataset in the paper. The TensorFlow code will be released soon.

Feel free to open a new issue for any question. We always respond quickly.

Table of Contents

Requirements

  • python3
  • TensorFlow 1.13 or newer (install). To check TensorFlow version:

python3 -c 'import tensorflow as tf; print(tf.__version__)'

Download our preprocessed Java-small dataset

This dataset contains ~1.3M examples (1.1GB).

mkdir data
cd data
wget https://codegen-slm.s3.us-east-2.amazonaws.com/data/java-small-preprocessed.tar.gz
tar -xvzf java-small-preprocessed.tar.gz

This will create a data/java-small/ sub-directory, containing the files that hold training, test and validation sets, a dict file for various dataset properties and histograms, and a grammar file that is used during beam search to distinguish between terminal and non-terminal nodes.

Creating and preprocessing a new Java dataset

To create and preprocess a new dataset (for example, to compare SLM to a new model on another dataset):

  • Edit the file preprocess.sh using the instructions there, pointing it to the correct training, validation and test directories.
  • Run the preprocess.sh file:

bash preprocess.sh

Datasets

Java

To download the Java-small as raw *.java files, use:

To download the preprocessed dataset, use:

To download the dataset in a tokenized format that can be used in seq2seq models (for example, with OpenNMT-py), use:

The following JSON files are the files that are created by the JavaExtractor. The preprocessed and the seq2seq files are created from these JSON files:

Every line is a JSON object that contains the following fields: num_targets, num_nodes, targets, is_token, target_child_id, internal_paths, relative_paths, head_paths, head_root_path, head_child_id, linearized_tree, filepath, left_context, right_context, target_seq, line.

C#

The C# dataset that we used in the paper was created using the raw (*.cs files) dataset of Allamanis et al., 2018, (https://aka.ms/iclr18-prog-graphs-dataset) and can be found here: https://aka.ms/iclr18-prog-graphs-dataset.

To extract examples from the C# files, we modified the data extraction code of Brockschmidt et al., 2019: https://github.com/microsoft/graph-based-code-modelling/.

Querying the Trained Model

To query the trained model, use the following API, where MYCODE is the given code snippet, that includes two question marks (??) to mark the "hole" that should be completed:

curl -X POST https://w0w3uc4a63.execute-api.us-east-1.amazonaws.com/prod/predict -d '{"code": "MYCODE"}'

For example:

curl -X POST https://w0w3uc4a63.execute-api.us-east-1.amazonaws.com/prod/predict -d '{"code": "public static Path[] stat2Paths(FileStatus[] stats) {  if (stats == null) return null;  Path[] ret = new Path[stats.length]; for (int i = 0; i < stats.length; ++i) { ret[i] = ??; } return ret; }"}'

Citation

Structural Language Models of Code

@article{alon2019structural,
  title={Structural Language Models of Code},
  author={Alon, Uri and Sadaka, Roy and Levy, Omer and Yahav, Eran},
  journal={arXiv preprint arXiv:1910.00577},
  year={2019}
}
A curated list of awesome deep long-tailed learning resources.

A curated list of awesome deep long-tailed learning resources.

vanint 210 Dec 25, 2022
Fine-tuning StyleGAN2 for Cartoon Face Generation

Cartoon-StyleGAN 🙃 : Fine-tuning StyleGAN2 for Cartoon Face Generation Abstract Recent studies have shown remarkable success in the unsupervised imag

Jihye Back 520 Jan 04, 2023
[ICML 2021] A fast algorithm for fitting robust decision trees.

GROOT: Growing Robust Trees Growing Robust Trees (GROOT) is an algorithm that fits binary classification decision trees such that they are robust agai

Cyber Analytics Lab 17 Nov 21, 2022
Instant-nerf-pytorch - NeRF trained SUPER FAST in pytorch

instant-nerf-pytorch This is WORK IN PROGRESS, please feel free to contribute vi

94 Nov 22, 2022
Repository for training material for the 2022 SDSC HPC/CI User Training Course

hpc-training-2022 Repository for training material for the 2022 SDSC HPC/CI Training Series HPC/CI Training Series home https://www.sdsc.edu/event_ite

sdsc-hpc-training-org 21 Jul 27, 2022
NUANCED is a user-centric conversational recommendation dataset that contains 5.1k annotated dialogues and 26k high-quality user turns.

NUANCED: Natural Utterance Annotation for Nuanced Conversation with Estimated Distributions Overview NUANCED is a user-centric conversational recommen

Facebook Research 18 Dec 28, 2021
Code for 2021 NeurIPS --- Towards Multi-Grained Explainability for Graph Neural Networks

ReFine: Multi-Grained Explainability for GNNs We are trying hard to update the code, but it may take a while to complete due to our tight schedule rec

Shirley (Ying-Xin) Wu 47 Dec 16, 2022
For the paper entitled ''A Case Study and Qualitative Analysis of Simple Cross-Lingual Opinion Mining''

Summary This is the source code for the paper "A Case Study and Qualitative Analysis of Simple Cross-Lingual Opinion Mining", which was accepted as fu

1 Nov 10, 2021
Implementation of H-UCRL Algorithm

Implementation of H-UCRL Algorithm This repository is an implementation of the H-UCRL algorithm introduced in Curi, S., Berkenkamp, F., & Krause, A. (

Sebastian Curi 25 May 20, 2022
Simultaneous NMT/MMT framework in PyTorch

This repository includes the codes, the experiment configurations and the scripts to prepare/download data for the Simultaneous Machine Translation wi

<a href=[email protected]"> 37 Sep 29, 2022
Rational Activation Functions - Replacing Padé Activation Units

Rational Activations - Learnable Rational Activation Functions First introduce as PAU in Padé Activation Units: End-to-end Learning of Activation Func

<a href=[email protected]"> 38 Nov 22, 2022
A MNIST-like fashion product database. Benchmark

Fashion-MNIST Table of Contents Why we made Fashion-MNIST Get the Data Usage Benchmark Visualization Contributing Contact Citing Fashion-MNIST License

Zalando Research 10.5k Jan 08, 2023
Demonstration of the Model Training as a CI/CD System in Vertex AI

Model Training as a CI/CD System This project demonstrates the machine model training as a CI/CD system in GCP platform. You will see more detailed wo

Chansung Park 19 Dec 28, 2022
Pytorch implementation of AREL

Status: Archive (code is provided as-is, no updates expected) Agent-Temporal Attention for Reward Redistribution in Episodic Multi-Agent Reinforcement

8 Nov 25, 2022
NAS-FCOS: Fast Neural Architecture Search for Object Detection (CVPR 2020)

NAS-FCOS: Fast Neural Architecture Search for Object Detection This project hosts the train and inference code with pretrained model for implementing

Ning Wang 180 Dec 06, 2022
Machine learning, in numpy

numpy-ml Ever wish you had an inefficient but somewhat legible collection of machine learning algorithms implemented exclusively in NumPy? No? Install

David Bourgin 11.6k Dec 30, 2022
Trading Strategies for Freqtrade

Freqtrade Strategies Strategies for Freqtrade, developed primarily in a partnership between @werkkrew and @JimmyNixx from the Freqtrade Discord. Use t

Bryan Chain 242 Jan 07, 2023
Code for ICCV 2021 paper Graph-to-3D: End-to-End Generation and Manipulation of 3D Scenes using Scene Graphs

Graph-to-3D This is the official implementation of the paper Graph-to-3d: End-to-End Generation and Manipulation of 3D Scenes Using Scene Graphs | arx

Helisa Dhamo 33 Jan 06, 2023
Generative Adversarial Networks(GANs)

Generative Adversarial Networks(GANs) Vanilla GAN ClusterGAN Vanilla GAN Model Structure Final Generator Structure A MLP with 2 hidden layers of hidde

Zhenbang Feng 2 Nov 05, 2021
Seasonal Contrast: Unsupervised Pre-Training from Uncurated Remote Sensing Data

Seasonal Contrast: Unsupervised Pre-Training from Uncurated Remote Sensing Data This is the official PyTorch implementation of the SeCo paper: @articl

ElementAI 101 Dec 12, 2022