Trex is a tool to match semantically similar functions based on transfer learning.

Last update: Dec 28, 2022

Related tags

Text Data & NLP trex

Overview

Introduction

Trex is a tool to match semantically similar functions based on transfer learning.

Installation

We recommend conda to setup the environment and install the required packages.

First, create the conda environment,

conda create -n trex python=3.8 numpy scipy scikit-learn requests

and activate the conda environment:

conda activate trex

Then, install the latest PyTorch (assume you have GPU):

conda install pytorch torchvision torchaudio cudatoolkit=11.1 -c pytorch -c nvidia

Enter the trex root directory: e.g., path/to/trex, and install trex:

pip install --editable .

For large datasets install PyArrow:

pip install pyarrow

For faster training install NVIDIA's apex library:

git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" \
  --global-option="--deprecated_fused_adam" --global-option="--xentropy" \
  --global-option="--fast_multihead_attn" ./

Preparation

Pretrained models:

Create the checkpoints and checkpoints/pretrain subdirectory in path/to/trex

mkdir checkpoints, mkdir checkpoints/pretrain

Download our pretrained weight parameters and put in checkpoints/pretrain

Sample data for finetuning similarity

We provide the sample training/testing files of finetuning in data-src/similarity If you want to prepare the finetuning data yourself, make sure you follow the format shown in data-src/similarity (coming soon: tokenization script).

We have to binarize the data to make it ready to be trained. To binarize the training data for finetuning, run:

python command/finetune/preprocess.py

The binarized training data ready for finetuning (for detecting similarity) will be stored at data-bin/similarity

Training

To finetune the model, run:

./command/finetune/finetune.sh

The scripts loads the pretrained weight parameters from checkpoints/pretrain/ and finetunes the model.

Sample data for pretraining on micro-traces

We also provide (10K) samples and scripts to demonstrate how to pretrain the model. To binarize the training data for pretraining, run:

python command/pretrain/preprocess_pretrain_10k.py

The binarized training data ready for pretraining will be stored at data-bin/pretrain_10k

To pretrain the model, run:

./command/pretrain/pretrain_10k.sh

The pretrained model will be checkpointed at checkpoints/pretrain_10k

Dataset

We put our dataset here.

Trex is a tool to match semantically similar functions based on transfer learning.

Related tags

Overview

Introduction

Installation

Preparation

Pretrained models:

Sample data for finetuning similarity

Training

Sample data for pretraining on micro-traces

Dataset

Owner

Smart discord chatbot integrated with Dialogflow

A PyTorch Implementation of End-to-End Models for Speech-to-Text

This is the Alpha of Nutte language, she is not complete yet / Essa é a Alpha da Nutte language, não está completa ainda

Signature remover is a NLP based solution which removes email signatures from the rest of the text.

Code for PED: DETR For (Crowd) Pedestrian Detection

Utilize Korean BERT model in sentence-transformers library

This project uses unsupervised machine learning to identify correlations between daily inoculation rates in the USA and twitter sentiment in regards to COVID-19.

An implementation of model parallel GPT-3-like models on GPUs, based on the DeepSpeed library. Designed to be able to train models in the hundreds of billions of parameters or larger.

This repository contains all the source code that is needed for the project : An Efficient Pipeline For Bloom’s Taxonomy Using Natural Language Processing and Deep Learning

🐍 A hyper-fast Python module for reading/writing JSON data using Rust's serde-json.

TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset.

Edge-Augmented Graph Transformer

Production First and Production Ready End-to-End Keyword Spotting Toolkit

Code for EMNLP'21 paper "Types of Out-of-Distribution Texts and How to Detect Them"

A Telegram bot to add notes to Flomo.

Twitter-Sentiment-Analysis - Analysis of twitter posts' positive and negative score.

Spam filtering made easy for you

Text Classification Using LSTM

Open-World Entity Segmentation

Python library for processing Chinese text