LexGLUE: A Benchmark Dataset for Legal Language Understanding in English

Overview

LexGLUE: A Benchmark Dataset for Legal Language Understanding in English ⚖️ 🏆 🧑‍🎓 👩‍⚖️

LexGLUE Graphic

Dataset Summary

Inspired by the recent widespread use of the GLUE multi-task benchmark NLP dataset (Wang et al., 2018), the subsequent more difficult SuperGLUE (Wang et al., 2109), other previous multi-task NLP benchmarks (Conneau and Kiela,2018; McCann et al., 2018), and similar initiatives in other domains (Peng et al., 2019), we introduce LexGLUE, a benchmark dataset to evaluate the performance of NLP methods in legal tasks. LexGLUE is based on seven existing legal NLP datasets, selected using criteria largely from SuperGLUE.

We anticipate that more datasets, tasks, and languages will be added in later versions of LexGLUE. As more legal NLP datasets become available, we also plan to favor datasets checked thoroughly for validity (scores reflecting real-life performance), annotation quality, statistical power,and social bias (Bowman and Dahl, 2021).

As in GLUE and SuperGLUE (Wang et al., 2109) one of our goals is to push towards generic (or foundation) models that can cope with multiple NLP tasks, in our case legal NLP tasks,possibly with limited task-specific fine-tuning. An-other goal is to provide a convenient and informative entry point for NLP researchers and practitioners wishing to explore or develop methods for legalNLP. Having these goals in mind, the datasets we include in LexGLUE and the tasks they address have been simplified in several ways, discussed below, to make it easier for newcomers and generic models to address all tasks. We provide PythonAPIs integrated with Hugging Face (Wolf et al.,2020; Lhoest et al., 2021) to easily import all the datasets, experiment with and evaluate their performance.

By unifying and facilitating the access to a set of law-related datasets and tasks, we hope to attract not only more NLP experts, but also more interdisciplinary researchers (e.g., law doctoral students willing to take NLP courses). More broadly, we hope LexGLUE will speed up the adoption and transparent evaluation of new legal NLP methods and approaches in the commercial sector too. Indeed, there have been many commercial press releases in legal-tech industry, but almost no independent evaluation of the veracity of the performance of various machine learning and NLP-based offerings. A standard publicly available benchmark would also allay concerns of undue influence in predictive models, including the use of metadata which the relevant law expressly disregards.

If you participate, use the LexGLUE benchmark, or our experimentation library, please cite:

Ilias Chalkidis, Abhik Jana, Dirk Hartung, Michael Bommarito, Ion Androutsopoulos, Daniel Martin Katz, and Nikolaos Aletras. LexGLUE: A Benchmark Dataset for Legal Language Understanding in English. 2021. arXiv: 2110.00976.

@article{chalkidis-etal-2021-lexglue,
        title={LexGLUE: A Benchmark Dataset for Legal Language Understanding in English}, 
        author={Chalkidis, Ilias and Jana, Abhik and Hartung, Dirk and
        Bommarito, Michael and Androutsopoulos, Ion and Katz, Daniel Martin and
        Aletras, Nikolaos},
        year={2021},
        eprint={2110.00976},
        archivePrefix={arXiv},
        primaryClass={cs.CL},
        note = {arXiv: 2110.00976},
}

Supported Tasks

Dataset Source Sub-domain Task Type Training/Dev/Test Instances Classes
ECtHR (Task A) Chalkidis et al. (2019) ECHR Multi-label classification 9,000/1,000/1,000 10+1
ECtHR (Task B) Chalkidis et al. (2021a) ECHR Multi-label classification 9,000/1,000/1,000 10
SCOTUS Spaeth et al. (2020) US Law Multi-class classification 5,000/1,400/1,400 14
EUR-LEX Chalkidis et al. (2021b) EU Law Multi-label classification 55,000/5,000/5,000 100
LEDGAR Tuggener et al. (2020) Contracts Multi-class classification 60,000/10,000/10,000 100
UNFAIR-ToS Lippi et al. (2019) Contracts Multi-label classification 5,532/2,275/1,607 8
CaseHOLD Zheng et al. (2021) US Law Multiple choice QA 45,000/3,900/3,900 n/a

ECtHR (Task A)

The European Court of Human Rights (ECtHR) hears allegations that a state has breached human rights provisions of the European Convention of Human Rights (ECHR). For each case, the dataset provides a list of factual paragraphs (facts) from the case description. Each case is mapped to articles of the ECHR that were violated (if any).

ECtHR (Task B)

The European Court of Human Rights (ECtHR) hears allegations that a state has breached human rights provisions of the European Convention of Human Rights (ECHR). For each case, the dataset provides a list of factual paragraphs (facts) from the case description. Each case is mapped to articles of ECHR that were allegedly violated (considered by the court).

SCOTUS

The US Supreme Court (SCOTUS) is the highest federal court in the United States of America and generally hears only the most controversial or otherwise complex cases which have not been sufficiently well solved by lower courts. This is a single-label multi-class classification task, where given a document (court opinion), the task is to predict the relevant issue areas. The 14 issue areas cluster 278 issues whose focus is on the subject matter of the controversy (dispute).

EUR-LEX

European Union (EU) legislation is published in EUR-Lex portal. All EU laws are annotated by EU's Publications Office with multiple concepts from the EuroVoc thesaurus, a multilingual thesaurus maintained by the Publications Office. The current version of EuroVoc contains more than 7k concepts referring to various activities of the EU and its Member States (e.g., economics, health-care, trade). Given a document, the task is to predict its EuroVoc labels (concepts).

LEDGAR

LEDGAR dataset aims contract provision (paragraph) classification. The contract provisions come from contracts obtained from the US Securities and Exchange Commission (SEC) filings, which are publicly available from EDGAR. Each label represents the single main topic (theme) of the corresponding contract provision.

UNFAIR-ToS

The UNFAIR-ToS dataset contains 50 Terms of Service (ToS) from on-line platforms (e.g., YouTube, Ebay, Facebook, etc.). The dataset has been annotated on the sentence-level with 8 types of unfair contractual terms (sentences), meaning terms that potentially violate user rights according to the European consumer law.

CaseHOLD

The CaseHOLD (Case Holdings on Legal Decisions) dataset includes multiple choice questions about holdings of US court cases from the Harvard Law Library case law corpus. Holdings are short summaries of legal rulings accompany referenced decisions relevant for the present case. The input consists of an excerpt (or prompt) from a court decision, containing a reference to a particular case, while the holding statement is masked out. The model must identify the correct (masked) holding statement from a selection of five choices.

Leaderboard

Dataset ECtHR Task A ECtHR Task B SCOTUS EUR-LEX LEDGAR UNFAIR-ToS CaseHOLD
Model μ-F1 / m-F1 μ-F1 / m-F1 μ-F1 / m-F1 μ-F1 / m-F1 μ-F1 / m-F1 μ-F1 / m-F1 μ-F1 / m-F1
BERT (Devlin et al., 2018) 71.4 / 64.0 87.6 / 77.8 70.5 / 60.9 71.6 / 55.6 87.7 / 82.2 87.5 / 81.0 70.7
RoBERTa (Liu et al., 2019) 69.5 / 60.7 87.2 / 77.3 70.8 / 61.2 71.8 / 57.5 87.9 / 82.1 87.7 / 81.5 71.7
DeBERTa (He et al., 2021) 69.1 / 61.2 87.4 / 77.3 70.0 / 60.0 72.3 / 57.2 87.9 / 82.0 87.2 / 78.8 72.1
Longformer (Beltagy et al., 2020) 69.6 / 62.4 88.0 / 77.8 72.2 / 62.5 71.9 / 56.7 87.7 / 82.3 87.7 / 80.1 72.0
BigBird (Zaheer et al., 2021) 70.5 / 63.8 88.1 / 76.6 71.7 / 61.4 71.8 / 56.6 87.7 / 82.1 87.7 / 80.2 70.4
Legal-BERT (Chalkidis et al., 2020) 71.2 / 64.6 88.0 / 77.2 76.2 / 65.8 72.2 / 56.2 88.1 / 82.7 88.6 / 82.3 75.1
CaseLaw-BERT (Zheng et al., 2021) 71.2 / 64.2 88.0 / 77.5 76.4 / 66.2 71.0 / 55.9 88.0 / 82.3 88.3 / 81.0 75.6

Frequently Asked Questions (FAQ)

Where are the datasets?

We provide access to LexGLUE on Hugging Face Datasets (Lhoest et al., 2021) at https://huggingface.co/datasets/lex_glue.

For example to load the SCOTUS Spaeth et al. (2020) dataset, you first simply install the datasets python library and then make the following call:

from datasets import load_dataset 
dataset = load_dataset("lex_glue", "scotus")

How to run experiments?

Furthermore, to make reproducing the results for the already examined models or future models even easier, we release our code in this repository. In folder /experiments, there are Python scripts, relying on the Hugging Face Transformers library, to run and evaluate any Transformer-based model (e.g., BERT, RoBERTa, LegalBERT, and their hierarchical variants, as well as, Longforrmer, and BigBird). We also provide bash scripts in folder /scripts to replicate the experiments for each dataset with 5 randoms seeds, as we did for the reported results for the original leaderboard.

For example to replicate the results for RoBERTa (Liu et al., 2019) on UNFAIR-ToS Lippi et al. (2019), you have to configure the relevant bash script (run_unfair_tos.sh):

> nano run_unfair_tos.sh
GPU_NUMBER=1
MODEL_NAME='roberta-base'
LOWER_CASE='False'
BATCH_SIZE=8
ACCUMULATION_STEPS=1
TASK='unfair_tos'

and then run it:

> sh run_unfair_tos.sh

How to participate?

We are currently still lacking some technical infrastructure, e.g., an integrated submission environment comprised of an automated evaluation and an automatically updated leaderboard. We plan to develop the necessary publicly available web infrastructure extend the public infrastructure of LexGLUE in the near future.

In the mean-time, we ask participants to re-use and expand our code to submit new results, if possible, and raise a new issue in our repository (https://github.com/coastalcph/lex-glue/issues/new) presenting their results, providing the auto-generated result logs and the relevant publication (or pre-print), if available, accompanied with a pull request including the code amendments that are needed to reproduce their experiments. Upon reviewing your results, we'll update the public leaderboard accordingly.

I still have open questions...

Please post your question on Discussions section or communicate with the corresponding author via e-mail.

A flexible submap-based framework towards spatio-temporally consistent volumetric mapping and scene understanding.

Panoptic Mapping This package contains panoptic_mapping, a general framework for semantic volumetric mapping. We provide, among other, a submap-based

ETHZ ASL 194 Dec 20, 2022
Implementation of Graph Convolutional Networks in TensorFlow

Graph Convolutional Networks This is a TensorFlow implementation of Graph Convolutional Networks for the task of (semi-supervised) classification of n

Thomas Kipf 6.6k Dec 30, 2022
这是一个yolox-pytorch的源码,可以用于训练自己的模型。

YOLOX:You Only Look Once目标检测模型在Pytorch当中的实现 目录 性能情况 Performance 实现的内容 Achievement 所需环境 Environment 小技巧的设置 TricksSet 文件下载 Download 训练步骤 How2train 预测步骤

Bubbliiiing 613 Jan 05, 2023
Official implementation of NPMs: Neural Parametric Models for 3D Deformable Shapes - ICCV 2021

NPMs: Neural Parametric Models Project Page | Paper | ArXiv | Video NPMs: Neural Parametric Models for 3D Deformable Shapes Pablo Palafox, Aljaz Bozic

PabloPalafox 109 Nov 22, 2022
Continuous Security Group Rule Change Detection & Response at scale

Introduction Get notified of Security Group Changes across all AWS Accounts & Regions in an AWS Organization, with the ability to respond/revert those

Raajhesh Kannaa Chidambaram 3 Aug 13, 2022
Official repository for the NeurIPS 2021 paper Get Fooled for the Right Reason: Improving Adversarial Robustness through a Teacher-guided curriculum Learning Approach

Get Fooled for the Right Reason Official repository for the NeurIPS 2021 paper Get Fooled for the Right Reason: Improving Adversarial Robustness throu

Sowrya Gali 1 Apr 25, 2022
Multi-label Co-regularization for Semi-supervised Facial Action Unit Recognition (NeurIPS 2019)

MLCR This is the source code for paper Multi-label Co-regularization for Semi-supervised Facial Action Unit Recognition. Xuesong Niu, Hu Han, Shiguang

Edson-Niu 60 Nov 29, 2022
Pytorch implementation of "Forward Thinking: Building and Training Neural Networks One Layer at a Time"

forward-thinking-pytorch Pytorch implementation of Forward Thinking: Building and Training Neural Networks One Layer at a Time Requirements Python 2.7

Kim Heecheol 65 Oct 06, 2022
Apache Flink

Apache Flink Apache Flink is an open source stream processing framework with powerful stream- and batch-processing capabilities. Learn more about Flin

The Apache Software Foundation 20.4k Dec 30, 2022
Image restoration with neural networks but without learning.

Warning! The optimization may not converge on some GPUs. We've personally experienced issues on Tesla V100 and P40 GPUs. When running the code, make s

Dmitry Ulyanov 7.4k Jan 01, 2023
Tensorflow 2 implementations of the C-SimCLR and C-BYOL self-supervised visual representation methods from "Compressive Visual Representations" (NeurIPS 2021)

Compressive Visual Representations This repository contains the source code for our paper, Compressive Visual Representations. We developed informatio

Google Research 30 Nov 23, 2022
This repository contains code for the paper "Decoupling Representation and Classifier for Long-Tailed Recognition", published at ICLR 2020

Classifier-Balancing This repository contains code for the paper: Decoupling Representation and Classifier for Long-Tailed Recognition Bingyi Kang, Sa

Facebook Research 820 Dec 26, 2022
PyTorch implementation for the paper Pseudo Numerical Methods for Diffusion Models on Manifolds

Pseudo Numerical Methods for Diffusion Models on Manifolds (PNDM) This repo is the official PyTorch implementation for the paper Pseudo Numerical Meth

Luping Liu (刘路平) 196 Jan 05, 2023
EfficientNetV2 implementation using PyTorch

EfficientNetV2-S implementation using PyTorch Train Steps Configure imagenet path by changing data_dir in train.py python main.py --benchmark for mode

Jahongir Yunusov 86 Dec 29, 2022
ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and Intra-modal Knowledge Integration

ROSITA News & Updates (24/08/2021) Release the demo to perform fine-grained semantic alignments using the pretrained ROSITA model. (15/08/2021) Releas

Vision and Language Group@ MIL 48 Dec 23, 2022
DeepFaceLive - Live Deep Fake in python, Real-time face swap for PC streaming or video calls

DeepFaceLive - Live Deep Fake in python, Real-time face swap for PC streaming or video calls

8.3k Dec 31, 2022
DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference

DeeBERT This is the code base for the paper DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference. Code in this repository is also available

Castorini 132 Nov 14, 2022
Code for `BCD Nets: Scalable Variational Approaches for Bayesian Causal Discovery`, Neurips 2021

This folder contains the code for 'Scalable Variational Approaches for Bayesian Causal Discovery'. Installation To install, use conda with conda env c

14 Sep 21, 2022
Our implementation used for the MICCAI 2021 FLARE Challenge titled 'Efficient Multi-Organ Segmentation Using SpatialConfiguartion-Net with Low GPU Memory Requirements'.

Efficient Multi-Organ Segmentation Using SpatialConfiguartion-Net with Low GPU Memory Requirements Our implementation used for the MICCAI 2021 FLARE C

Franz Thaler 3 Sep 27, 2022
[ICCV 2021] Deep Hough Voting for Robust Global Registration

Deep Hough Voting for Robust Global Registration, ICCV, 2021 Project Page | Paper | Video Deep Hough Voting for Robust Global Registration Junha Lee1,

Junha Lee 10 Dec 02, 2022