Implementation of Natural Language Code Search in the project CodeBERT: A Pre-Trained Model for Programming and Natural Languages.

Last update: Jul 01, 2022

Related tags

Text Data & NLP CodeBERT-Implementation

Overview

CodeBERT-Implementation

In this repo we have replicated the paper CodeBERT: A Pre-Trained Model for Programming and Natural Languages.
We are interested in evaluating CodeBERT specifically in Natural language code search. Given a natural language as the input, the objective of code search is to find the most semantically related code from a collection of codes.

This code was implemented on a 64-bit Windows system with 8 GB ram and GeForce GTX 1650 4GB graphics card.

Due to limited compuational power, we have trained and evaluated the model on a smaller data compared to the original data.

Language	Training data size		Validation data size		Test data size for batch_0
Language	Original	Our	Original	Our	Original	Our
Ruby	97580	500	4417	100	1000000	20000
Go	635653	500	28483	100	1000000	20000
PHP	1047404	500	52029	100	1000000	20000
Python	824342	500	46213	100	1000000	20000
Java	908886	500	30655	100	1000000	20000
Javascript	247773	500	16505	100	1000000	20000

Compared to the code in original repo, code in this repo can be implemented directly in Windows system without any hindrance. We have already provided a subset of pre-processed data for batch_0 (shown in table under Testing data size) in ./data/codesearch/test/

Fine tuning pretrained model CodeBERT on individual languages

lang = go
cd CodeBERT-Implementation
! python run_classifier.py --model_type roberta --task_name codesearch --do_train --do_eval --eval_all_checkpoints --train_file train_short.txt --dev_file valid_short.txt --max_seq_length 50 --per_gpu_train_batch_size 8 --per_gpu_eval_batch_size 8 --learning_rate 1e-5 --num_train_epochs 1 --gradient_accumulation_steps 1 --overwrite_output_dir --data_dir CodeBERT-Implementation/data/codesearch/train_valid/$lang/ --output_dir ./models/$lang/ --model_name_or_path microsoft/codebert-base

Inference and Evaluation

lang = go
idx = 0
! python run_classifier.py --model_type roberta --model_name_or_path microsoft/codebert-base --task_name codesearch --do_predict --output_dir CodeBERT-Implementation/data/models/$lang --data_dir CodeBERT-Implementation/data/codesearch/test/$lang/ --max_seq_length 50 --per_gpu_train_batch_size 8 --per_gpu_eval_batch_size 8 --learning_rate 1e-5 --num_train_epochs 1 --test_file batch_short_${idx}.txt --pred_model_dir ./models/ruby/checkpoint-best/ --test_result_dir ./results/$lang/${idx}_batch_result.txt

! python mrr.py

The Mean Evaluation Rank (MER), the evaluation mteric, for the subset of data is given as follows:

Language	MER
Ruby	0.0037
Go	0.0034
PHP	0.0044
Python	0.0052
Java	0.0033
Java script	0.0054

The accuracy is way less than what is reported in the paper. However, the purpose of this repo is to provide the user, ready to implement data of CodeBERT without any heavy downloads. We have also included the prediction results in this repo corresponding to the test data.

Implementation of Natural Language Code Search in the project CodeBERT: A Pre-Trained Model for Programming and Natural Languages.

Related tags

Overview

CodeBERT-Implementation

Fine tuning pretrained model CodeBERT on individual languages

Inference and Evaluation

Owner

Tanuj Sur

Code for EMNLP20 paper: "ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training"

Optimal Transport Tools (OTT), A toolbox for all things Wasserstein.

Vad-sli-asr - A Python scripts for a speech processing pipeline with Voice Activity Detection (VAD)

Curso práctico: NLP de cero a cien 🤗

PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers

ChainKnowledgeGraph, 产业链知识图谱包括A股上市公司、行业和产品共3类实体

This is my reading list for my PhD in AI, NLP, Deep Learning and more.

Korean Simple Contrastive Learning of Sentence Embeddings using SKT KoBERT and kakaobrain KorNLU dataset

Princeton NLP's pre-training library based on fairseq with DeepSpeed kernel integration 🚃

PyTorch Implementation of "Bridging Pre-trained Language Models and Hand-crafted Features for Unsupervised POS Tagging" (Findings of ACL 2022)

A natural language modeling framework based on PyTorch

Official implementation of MLP Singer: Towards Rapid Parallel Korean Singing Voice Synthesis

Yet Another Neural Machine Translation Toolkit

This repository consists of a complete guide on natural language processing (NLP) in Python where we'll learn various techniques for implementing NLP including parsing & text processing and understand how to use NLP for text feature engineering.

auto_code_complete is a auto word-completetion program which allows you to customize it on your need

构建一个多源（公众号、RSS）、干净、个性化的阅读环境

SAINT PyTorch implementation

Treemap visualisation of Maya scene files

Code Implementation of "Learning Span-Level Interactions for Aspect Sentiment Triplet Extraction".

Python bindings to the dutch NLP tool Frog (pos tagger, lemmatiser, NER tagger, morphological analysis, shallow parser, dependency parser)