Language Models for the legal domain in Spanish done @ BSC-TEMU within the "Plan de las Tecnologías del Lenguaje" (Plan-TL).

Last update: Nov 14, 2022

Overview

Spanish legal domain Language Model ⚖️

This repository contains the page for two main resources for the Spanish legal domain:

A RoBERTa model: https://huggingface.co/PlanTL-GOB-ES/RoBERTalex
FastText embeddings: https://zenodo.org/record/5036147
Legal corpora: https://zenodo.org/record/5495529

The repository and the pre-print will be updated with larger models, evaluations, etcetera.

Why ❓

There are few models trained for the Spanish language. Some of the models have been trained with a low resource, unclean corpora. The ones derived from the Spanish National Plan for Language Technologies are proficient solving several tasks and have been trained using large scale clean corpora. However, the Spanish Legal domain language could be think of an independent language on its own. We therefore created a Spanish Legal model from scratch trained exclusively on legal corpora.

Evaluation ✅

Work in progress.

Corpora 📃

Corpus name	Size (GB)	Tokens (M)
Procesos Penales	0.625	0.119
JRC Acquis	0.345	59.359
Códigos Electrónicos Universitarios	0.077	11.835
Códigos Electrónicos	0.080	12.237
Doctrina de la Fiscalía General del Estado	0.017	2.669
Legislación BOE	3.600	578.685
Abogacía del Estado BOE	0.037	6.123
Consejo de Estado: Dictámenes	0.827	135.348
Spanish EURLEX	0.001	0.072
UN Resolutions	0.023	3.539
Spanish DOGC	0.826	132.569
Spanish MultiUN	2.200	352.653
Consultas Tributarias Generales y Vinculantes	0.466	77.691
Constitución Española	0.002	0.018
COPPA Patents Corpus	0.002	-
Biomedical Patents	0.083	-

Usage example ⚗️

You can train your model for different downstream tasks using the scripts that Hugging Face provides (Name Entity Recognition, GLUE tasks and others)

from transformers import AutoModelForMaskedLM
from transformers import AutoTokenizer, FillMaskPipeline
from pprint import pprint
tokenizer_hf = AutoTokenizer.from_pretrained('PlanTL-GOB-ES/RoBERTalex')
model = AutoModelForMaskedLM.from_pretrained('PlanTL-GOB-ES/RoBERTalex')
model.eval()
pipeline = FillMaskPipeline(model, tokenizer_hf)
text = f"¡Hola <mask>!"
res_hf = pipeline(text)
pprint([r['token_str'] for r in res_hf])

Cite 📣

If this work is helpful, please cite it:

@misc{gutierrezfandino2021legal,
      title={Spanish Legalese Language Model and Corpora}, 
      author={Asier Gutiérrez-Fandiño and Jordi Armengol-Estapé and Aitor Gonzalez-Agirre and Marta Villegas},
      year={2021},
      eprint={2110.12201},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Contact 📧

📋 We are interested in (1) extending our corpora to make larger models (2) evaluate/train the model in other tasks.

For questions regarding this work, contact Asier Gutiérrez-Fandiño ([email protected])

Language Models for the legal domain in Spanish done @ BSC-TEMU within the "Plan de las Tecnologías del Lenguaje" (Plan-TL).

Related tags

Overview

Spanish legal domain Language Model ⚖️

Why ❓

Evaluation ✅

Corpora 📃

Usage example ⚗️

Cite 📣

Contact 📧

Owner

Plan de Tecnologías del Lenguaje - Gobierno de España

Source code for our CVPR 2019 paper - PPGNet: Learning Point-Pair Graph for Line Segment Detection

A Multi-attribute Controllable Generative Model for Histopathology Image Synthesis

[内测中]前向式Python环境快捷封装工具，快速将Python打包为EXE并添加CUDA、NoAVX等支持。

Benchmark datasets, data loaders, and evaluators for graph machine learning

Leaf: Multiple-Choice Question Generation

A really easy-to-use and powerful sudoku solver.

MetaDrive: Composing Diverse Scenarios for Generalizable Reinforcement Learning

LWCC: A LightWeight Crowd Counting library for Python that includes several pretrained state-of-the-art models.

Code To Tune or Not To Tune? Zero-shot Models for Legal Case Entailment.

DeepGNN is a framework for training machine learning models on large scale graph data.

SAMO: Streaming Architecture Mapping Optimisation

Machine Learning University: Accelerated Computer Vision Class

Python Jupyter kernel using Poetry for reproducible notebooks

This is the official Pytorch implementation of the paper "Diverse Motion Stylization for Multiple Style Domains via Spatial-Temporal Graph-Based Generative Model"

Official implementation of VQ-Diffusion

Ppq - A powerful offline neural network quantization tool with custimized IR

Experiments on Flood Segmentation on Sentinel-1 SAR Imagery with Cyclical Pseudo Labeling and Noisy Student Training

Studying Python release adoptions by looking at PyPI downloads

Exploit Camera Raw Data for Video Super-Resolution via Hidden Markov Model Inference

Try out deep learning models online on Google Colab