Implementation of Natural Language Code Search in the project CodeBERT: A Pre-Trained Model for Programming and Natural Languages.

Overview

CodeBERT-Implementation

In this repo we have replicated the paper CodeBERT: A Pre-Trained Model for Programming and Natural Languages.
We are interested in evaluating CodeBERT specifically in Natural language code search. Given a natural language as the input, the objective of code search is to find the most semantically related code from a collection of codes.

This code was implemented on a 64-bit Windows system with 8 GB ram and GeForce GTX 1650 4GB graphics card.

Due to limited compuational power, we have trained and evaluated the model on a smaller data compared to the original data.

Language Training data size Validation data size Test data size for batch_0
Original Our Original Our Original Our
Ruby 97580 500 4417 100 1000000 20000
Go 635653 500 28483 100 1000000 20000
PHP 1047404 500 52029 100 1000000 20000
Python 824342 500 46213 100 1000000 20000
Java 908886 500 30655 100 1000000 20000
Javascript 247773 500 16505 100 1000000 20000

Compared to the code in original repo, code in this repo can be implemented directly in Windows system without any hindrance. We have already provided a subset of pre-processed data for batch_0 (shown in table under Testing data size) in ./data/codesearch/test/

Fine tuning pretrained model CodeBERT on individual languages

lang = go
cd CodeBERT-Implementation
! python run_classifier.py --model_type roberta --task_name codesearch --do_train --do_eval --eval_all_checkpoints --train_file train_short.txt --dev_file valid_short.txt --max_seq_length 50 --per_gpu_train_batch_size 8 --per_gpu_eval_batch_size 8 --learning_rate 1e-5 --num_train_epochs 1 --gradient_accumulation_steps 1 --overwrite_output_dir --data_dir CodeBERT-Implementation/data/codesearch/train_valid/$lang/ --output_dir ./models/$lang/ --model_name_or_path microsoft/codebert-base

Inference and Evaluation

lang = go
idx = 0
! python run_classifier.py --model_type roberta --model_name_or_path microsoft/codebert-base --task_name codesearch --do_predict --output_dir CodeBERT-Implementation/data/models/$lang --data_dir CodeBERT-Implementation/data/codesearch/test/$lang/ --max_seq_length 50 --per_gpu_train_batch_size 8 --per_gpu_eval_batch_size 8 --learning_rate 1e-5 --num_train_epochs 1 --test_file batch_short_${idx}.txt --pred_model_dir ./models/ruby/checkpoint-best/ --test_result_dir ./results/$lang/${idx}_batch_result.txt
! python mrr.py

The Mean Evaluation Rank (MER), the evaluation mteric, for the subset of data is given as follows:

Language MER
Ruby 0.0037
Go 0.0034
PHP 0.0044
Python 0.0052
Java 0.0033
Java script 0.0054

The accuracy is way less than what is reported in the paper. However, the purpose of this repo is to provide the user, ready to implement data of CodeBERT without any heavy downloads. We have also included the prediction results in this repo corresponding to the test data.

Owner
Tanuj Sur
Student at Chennai Mathematical Institute | Research Intern at TCS Research and Innovation Labs
Tanuj Sur
运小筹公众号是致力于分享运筹优化(LP、MIP、NLP、随机规划、鲁棒优化)、凸优化、强化学习等研究领域的内容以及涉及到的算法的代码实现。

OlittleRer 运小筹公众号是致力于分享运筹优化(LP、MIP、NLP、随机规划、鲁棒优化)、凸优化、强化学习等研究领域的内容以及涉及到的算法的代码实现。编程语言和工具包括Java、Python、Matlab、CPLEX、Gurobi、SCIP 等。 关注我们: 运筹小公众号 有问题可以直接在

运小筹 151 Dec 30, 2022
Beyond Masking: Demystifying Token-Based Pre-Training for Vision Transformers

beyond masking Beyond Masking: Demystifying Token-Based Pre-Training for Vision Transformers The code is coming Figure 1: Pipeline of token-based pre-

Yunjie Tian 23 Sep 27, 2022
Modeling cumulative cases of Covid-19 in the US during the Covid 19 Delta wave using Bayesian methods.

Introduction The goal of this analysis is to find a model that fits the observed cumulative cases of COVID-19 in the US, starting in Mid-July 2021 and

Alexander Keeney 1 Jan 05, 2022
NeMo: a toolkit for conversational AI

NVIDIA NeMo Introduction NeMo is a toolkit for creating Conversational AI applications. NeMo product page. Introductory video. The toolkit comes with

NVIDIA Corporation 5.3k Jan 04, 2023
Use Tensorflow2.7.0 Build OpenAI'GPT-2

TF2_GPT-2 Use Tensorflow2.7.0 Build OpenAI'GPT-2 使用最新tensorflow2.7.0构建openai官方的GPT-2 NLP模型 优点 使用无监督技术 拥有大量词汇量 可实现续写(堪比“xx梦续写”) 实现对话后续将应用于FloatTech的Bot

Watermelon 9 Sep 13, 2022
Implementation of the Hybrid Perception Block and Dual-Pruned Self-Attention block from the ITTR paper for Image to Image Translation using Transformers

ITTR - Pytorch Implementation of the Hybrid Perception Block (HPB) and Dual-Pruned Self-Attention (DPSA) block from the ITTR paper for Image to Image

Phil Wang 17 Dec 23, 2022
GPT-2 Model for Leetcode Questions in python

Leetcode using AI 🤖 GPT-2 Model for Leetcode Questions in python New demo here: https://huggingface.co/spaces/gagan3012/project-code-py Note: the Ans

Gagan Bhatia 100 Dec 12, 2022
IMS-Toucan is a toolkit to train state-of-the-art Speech Synthesis models

IMS-Toucan is a toolkit to train state-of-the-art Speech Synthesis models. Everything is pure Python and PyTorch based to keep it as simple and beginner-friendly, yet powerful as possible.

Digital Phonetics at the University of Stuttgart 247 Jan 05, 2023
Multi-Task Pre-Training for Plug-and-Play Task-Oriented Dialogue System

Multi-Task Pre-Training for Plug-and-Play Task-Oriented Dialogue System Authors: Yixuan Su, Lei Shu, Elman Mansimov, Arshit Gupta, Deng Cai, Yi-An Lai

Amazon Web Services - Labs 124 Jan 03, 2023
A cross platform OCR Library based on PaddleOCR & OnnxRuntime

A cross platform OCR Library based on PaddleOCR & OnnxRuntime

RapidOCR Team 767 Jan 09, 2023
nlp基础任务

NLP算法 说明 此算法仓库包括文本分类、序列标注、关系抽取、文本匹配、文本相似度匹配这五个主流NLP任务,涉及到22个相关的模型算法。 框架结构 文件结构 all_models ├── Base_line │   ├── __init__.py │   ├── base_data_process.

zuxinqi 23 Sep 22, 2022
ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

ELECTRA Introduction ELECTRA is a method for self-supervised language representation learning. It can be used to pre-train transformer networks using

Google Research 2.1k Dec 28, 2022
Binaural Speech Synthesis

Binaural Speech Synthesis This repository contains code to train a mono-to-binaural neural sound renderer. If you use this code or the provided datase

Facebook Research 135 Dec 18, 2022
A Python module made to simplify the usage of Text To Speech and Speech Recognition.

Nav Module The solution for voice related stuff in Python Nav is a Python module which simplifies voice related stuff in Python. Just import the Modul

Snm Logic 1 Dec 20, 2021
This repo stores the codes for topic modeling on palliative care journals.

This repo stores the codes for topic modeling on palliative care journals. Data Preparation You first need to download the journal papers. bash 1_down

3 Dec 20, 2022
CCF BDCI BERT系统调优赛题baseline(Pytorch版本)

CCF BDCI BERT系统调优赛题baseline(Pytorch版本) 此版本基于Pytorch后端的huggingface进行实现。由于此实现使用了Oneflow的dataloader作为数据读入的方式,因此也需要安装Oneflow。其它框架的数据读取可以参考OneflowDataloade

Ziqi Zhou 9 Oct 13, 2022
Implementing SimCSE(paper, official repository) using TensorFlow 2 and KR-BERT.

KR-BERT-SimCSE Implementing SimCSE(paper, official repository) using TensorFlow 2 and KR-BERT. Training Unsupervised python train_unsupervised.py --mi

Jeong Ukjae 27 Dec 12, 2022
Easy to use, state-of-the-art Neural Machine Translation for 100+ languages

EasyNMT - Easy to use, state-of-the-art Neural Machine Translation This package provides easy to use, state-of-the-art machine translation for more th

Ubiquitous Knowledge Processing Lab 748 Jan 06, 2023
SGMC: Spectral Graph Matrix Completion

SGMC: Spectral Graph Matrix Completion Code for AAAI21 paper "Scalable and Explainable 1-Bit Matrix Completion via Graph Signal Learning". Data Format

Chao Chen 8 Dec 12, 2022
HF's ML for Audio study group

Hugging Face Machine Learning for Audio Study Group Welcome to the ML for Audio Study Group. Through a series of presentations, paper reading and disc

Vaibhav Srivastav 110 Jan 01, 2023