LTR_CrossEncoder: Legal Text Retrieval Zalo AI Challenge 2021

Last update: Jan 12, 2022

Related tags

Deep Learning ZaloAI2021_LTR

Overview

LTR_CrossEncoder: Legal Text Retrieval Zalo AI Challenge 2021

We propose a cross encoder model (LTR_CrossEncoder) for information retrieval, re-retrieval text relevant base on result of elasticsearch

Model achieved 0.747 F2 score in public test (Legal Text Retrieval Zalo AI Challenge 2021)
If using elasticsearch only, our F2 score is 0.54

Algorithm design

Our algorithm includes two key components:

Elasticsearch
Cross Encoder Model

Elasticsearch

Elasticsearch is used for filtering top-k most relevant articles based on BM25 score.

Cross Encoder Model

Our model accepts query, article text (passage) and article title as inputs and outputs a relevant score of that query and that article. Higher score, more relavant. We use pretrained vinai/phobert-base and CrossEntropyLoss or BCELoss as loss function

Train dataset

Non-relevant samples in dataset are obtained by top-10 result of elasticsearch, the training data (train_data_model.json) has format as follow:

[
    {
        "question_id": "..."
        "question": "..."
        "relevant_articles":[
            {
                "law_id": "..."
                "article_id": "..."
                "title": "..."
                "text": "..."
            },
            ...
        ]
        "non_relevant_articles":[
            {
                "law_id": "..."
                "article_id": "..."
                "title": "..."
                "text": "..."
            },
            ...
        ]
    },
    ...
]

Test dataset

First we use elasticsearch to obtain k relevant candidates (k=top-50 result of elasticsearch), then LTR_CrossEncoder classify which actual relevant article. The test data (test_data_model.json) has format as follow:

[
    {
        "question_id": "..."
        "question": "..."
        "articles":[
            {
                "law_id": "..."
                "article_id": "..."
                "title": "..."
                "text": "..."
            },
            ...
        ]
    },
    ...
]

Training

Run the following bash file to train model:

bash run_phobert.sh

Inference

We also provide model checkpoints. Please download these checkpoints if you want to make inference on a new text file without training the models from scratch. Create new checkpoint folder, unzip model file and push it in checkpoint folder. https://drive.google.com/file/d/1oT8nlDIAatx3XONN1n5eOgYTT6Lx_h_C/view?usp=sharing

Run the following bash file to infer test dataset:

bash run_predict.sh

LTR_CrossEncoder: Legal Text Retrieval Zalo AI Challenge 2021

Related tags

Overview

LTR_CrossEncoder: Legal Text Retrieval Zalo AI Challenge 2021

Algorithm design

Elasticsearch

Cross Encoder Model

Train dataset

Test dataset

Training

Inference

Owner

Xuan Hieu Duong

An end-to-end library for editing and rendering motion of 3D characters with deep learning [SIGGRAPH 2020]

RNN Predict Street Commercial Vitality

Image Restoration Toolbox (PyTorch). Training and testing codes for DPIR, USRNet, DnCNN, FFDNet, SRMD, DPSR, BSRGAN, SwinIR

Source code of the paper PatchGraph: In-hand tactile tracking with learned surface normals.

MultiSiam: Self-supervised Multi-instance Siamese Representation Learning for Autonomous Driving

Compartmental epidemic model to assess undocumented infections: applications to SARS-CoV-2 epidemics in Brazil - Datasets and Codes

PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation

This repository for project that can Automate Number Plate Recognition (ANPR) in Morocco Licensed Vehicles. 💻 + 🚙 + 🇲🇦 = 🤖 🕵🏻‍♂️

Heart Arrhythmia Classification

Softlearning is a reinforcement learning framework for training maximum entropy policies in continuous domains. Includes the official implementation of the Soft Actor-Critic algorithm.

Facilitates implementing deep neural-network backbones, data augmentations

DeepProbLog is an extension of ProbLog that integrates Probabilistic Logic Programming with deep learning by introducing the neural predicate.

A PyTorch implementation of Learning to learn by gradient descent by gradient descent

This is the repository for our paper Ditch the Gold Standard: Re-evaluating Conversational Question Answering

Official PyTorch implementation of Retrieve in Style: Unsupervised Facial Feature Transfer and Retrieval.

Revisiting, benchmarking, and refining Heterogeneous Graph Neural Networks.

Marine debris detection with commercial satellite imagery and deep learning.

PyTorch implementation of Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

An alarm clock coded in Python 3 with Tkinter

Repository for the semantic WMI loss