Official source for spanish Language Models and resources made @ BSC-TEMU within the "Plan de las Tecnologías del Lenguaje" (Plan-TL).

Overview

Spanish Language Models 💃🏻

Corpora 📃

Corpora Number of documents Size (GB)
BNE 201,080,084 570GB

Models 🤖

Word embeddings 🔤

Word embeddings trained with FastText for 300d:

Evaluation

Dataset Metric RoBERTa-b RoBERTa-l BETO mBERT BERTIN
UD-POS F1 0.9907 0.9901 0.9900 0.9886 0.9904
Conll-NER F1 0.8851 0.8772 0.8759 0.8691 0.8627
Capitel-POS F1 0.9846 0.9851 0.9836 0.9839 0.9826
Capitel-NER F1 0.8959 0.8998 0.8771 0.8810 0.8741
STS Combined 0.8423 0.8420 0.8216 0.8249 0.7822
MLDoc Accuracy 0.9595 0.9600 0.9650 0.9560 0.9673
PAWS-X F1 0.9035 0.9000 0.8915 0.9020 0.8820
XNLI Accuracy 0.8016 WiP 0.8130 0.7876 WiP

Usage example ⚗️

For the RoBERTa-base

from transformers import AutoModelForMaskedLM
from transformers import AutoTokenizer, FillMaskPipeline
from pprint import pprint
tokenizer_hf = AutoTokenizer.from_pretrained('BSC-TeMU/roberta-base-bne')
model = AutoModelForMaskedLM.from_pretrained('BSC-TeMU/roberta-base-bne')
model.eval()
pipeline = FillMaskPipeline(model, tokenizer_hf)
text = f"¡Hola <mask>!"
res_hf = pipeline(text)
pprint([r['token_str'] for r in res_hf])

For the RoBERTa-large

from transformers import AutoModelForMaskedLM
from transformers import AutoTokenizer, FillMaskPipeline
from pprint import pprint
tokenizer_hf = AutoTokenizer.from_pretrained('BSC-TeMU/roberta-large-bne')
model = AutoModelForMaskedLM.from_pretrained('BSC-TeMU/roberta-large-bne')
model.eval()
pipeline = FillMaskPipeline(model, tokenizer_hf)
text = f"¡Hola <mask>!"
res_hf = pipeline(text)
pprint([r['token_str'] for r in res_hf])

Other Spanish Language Models 👩‍👧‍👦

We are developing domain-specific language models:

Cite 📣

@misc{gutierrezfandino2021spanish,
      title={Spanish Language Models}, 
      author={Asier Gutiérrez-Fandiño and Jordi Armengol-Estapé and Marc Pàmies and Joan Llop-Palao and Joaquín Silveira-Ocampo and Casimiro Pio Carrino and Aitor Gonzalez-Agirre and Carme Armentano-Oller and Carlos Rodriguez-Penagos and Marta Villegas},
      year={2021},
      eprint={2107.07253},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Contact 📧

📋 We are interested in (1) extending our corpora to make larger models (2) train/evaluate the model in other tasks.

For questions regarding this work, contact Asier Gutiérrez-Fandiño ([email protected])

Owner
PlanTL-SANIDAD
PlanTL-SANIDAD
PyTorch implementation and pretrained models for XCiT models. See XCiT: Cross-Covariance Image Transformer

Cross-Covariance Image Transformer (XCiT) PyTorch implementation and pretrained models for XCiT models. See XCiT: Cross-Covariance Image Transformer L

Facebook Research 605 Jan 02, 2023
🧪 Cutting-edge experimental spaCy components and features

spacy-experimental: Cutting-edge experimental spaCy components and features This package includes experimental components and features for spaCy v3.x,

Explosion 65 Dec 30, 2022
Universal End2End Training Platform, including pre-training, classification tasks, machine translation, and etc.

背景 安装教程 快速上手 (一)预训练模型 (二)机器翻译 (三)文本分类 TenTrans 进阶 1. 多语言机器翻译 2. 跨语言预训练 背景 TrenTrans是一个统一的端到端的多语言多任务预训练平台,支持多种预训练方式,以及序列生成和自然语言理解任务。 安装教程 git clone git

Tencent Minority-Mandarin Translation Team 42 Dec 20, 2022
Applied Natural Language Processing in the Enterprise - An O'Reilly Media Publication

Applied Natural Language Processing in the Enterprise This is the companion repo for Applied Natural Language Processing in the Enterprise, an O'Reill

Applied Natural Language Processing in the Enterprise 95 Jan 05, 2023
Pipeline for fast building text classification TF-IDF + LogReg baselines.

Text Classification Baseline Pipeline for fast building text classification TF-IDF + LogReg baselines. Usage Instead of writing custom code for specif

Dani El-Ayyass 57 Dec 07, 2022
Just Another Telegram Ai Chat Bot Written In Python With Pyrogram.

OkaeriChatBot Just another Telegram AI chat bot written in Python using Pyrogram. Requirements Python 3.7 or higher.

Wahyusaputra 2 Dec 23, 2021
TEACh is a dataset of human-human interactive dialogues to complete tasks in a simulated household environment.

TEACh is a dataset of human-human interactive dialogues to complete tasks in a simulated household environment.

Alexa 98 Dec 09, 2022
🗣️ NALP is a library that covers Natural Adversarial Language Processing.

NALP: Natural Adversarial Language Processing Welcome to NALP. Have you ever wanted to create natural text from raw sources? If yes, NALP is for you!

Gustavo Rosa 21 Aug 12, 2022
NLP codes implemented with Pytorch (w/o library such as huggingface)

NLP_scratch NLP codes implemented with Pytorch (w/o library such as huggingface) scripts ├── models: Neural Network models ├── data: codes for dataloa

3 Dec 28, 2021
Code for the paper "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer"

T5: Text-To-Text Transfer Transformer The t5 library serves primarily as code for reproducing the experiments in Exploring the Limits of Transfer Lear

Google Research 4.6k Jan 01, 2023
Repository to hold code for the cap-bot varient that is being presented at the SIIC Defence Hackathon 2021.

capbot-siic Repository to hold code for the cap-bot varient that is being presented at the SIIC Defence Hackathon 2021. Problem Inspiration A plethora

Aryan Kargwal 19 Feb 17, 2022
Transformation spoken text to written text

Transformation spoken text to written text This model is used for formatting raw asr text output from spoken text to written text (Eg. date, number, i

Nguyen Binh 16 Dec 28, 2022
Modeling cumulative cases of Covid-19 in the US during the Covid 19 Delta wave using Bayesian methods.

Introduction The goal of this analysis is to find a model that fits the observed cumulative cases of COVID-19 in the US, starting in Mid-July 2021 and

Alexander Keeney 1 Jan 05, 2022
End-to-end image captioning with EfficientNet-b3 + LSTM with Attention

Image captioning End-to-end image captioning with EfficientNet-b3 + LSTM with Attention Model is seq2seq model. In the encoder pretrained EfficientNet

2 Feb 10, 2022
A Python script which randomly chooses and prints a file from a directory.

___ ____ ____ _ __ ___ / _ \ | _ \ | _ \ ___ _ __ | '__| / _ \ | |_| || | | || | | | / _ \| '__| | | | __/ | _ || |_| || |_| || __

yesmaybenookay 0 Aug 06, 2021
Bot to connect a real Telegram user, simulating responses with OpenAI's davinci GPT-3 model.

AI-BOT Bot to connect a real Telegram user, simulating responses with OpenAI's davinci GPT-3 model.

Thempra 2 Dec 21, 2022
Artificial Conversational Entity for queries in Eulogio "Amang" Rodriguez Institute of Science and Technology (EARIST)

🤖 Coeus - EARIST A.C.E 💬 Coeus is an Artificial Conversational Entity for queries in Eulogio "Amang" Rodriguez Institute of Science and Technology,

Dids Irwyn Reyes 3 Oct 14, 2022
LSTM based Sentiment Classification using Tensorflow - Amazon Reviews Rating

LSTM based Sentiment Classification using Tensorflow - Amazon Reviews Rating (Dataset) The dataset is from Amazon Review Data (2018)

Immanuvel Prathap S 1 Jan 16, 2022
Use Tensorflow2.7.0 Build OpenAI'GPT-2

TF2_GPT-2 Use Tensorflow2.7.0 Build OpenAI'GPT-2 使用最新tensorflow2.7.0构建openai官方的GPT-2 NLP模型 优点 使用无监督技术 拥有大量词汇量 可实现续写(堪比“xx梦续写”) 实现对话后续将应用于FloatTech的Bot

Watermelon 9 Sep 13, 2022
Machine Learning Course Project, IMDB movie review sentiment analysis by lstm, cnn, and transformer

IMDB Sentiment Analysis This is the final project of Machine Learning Courses in Huazhong University of Science and Technology, School of Artificial I

Daniel 0 Dec 27, 2021