A pre-trained language model for social media text in Spanish

Overview

RoBERTuito

A pre-trained language model for social media text in Spanish

READ THE FULL PAPER Github Repository

RoBERTuito is a pre-trained language model for user-generated content in Spanish, trained following RoBERTa guidelines on 500 million tweets. RoBERTuito comes in 3 flavors: cased, uncased, and uncased+deaccented.

We tested RoBERTuito on a benchmark of tasks involving user-generated text in Spanish. It outperforms other pre-trained language models for this language such as BETO, BERTin and RoBERTa-BNE. The 4 tasks selected for evaluation were: Hate Speech Detection (using SemEval 2019 Task 5, HatEval dataset), Sentiment and Emotion Analysis (using TASS 2020 datasets), and Irony detection (using IrosVa 2019 dataset).

model hate speech sentiment analysis emotion analysis irony detection score
robertuito-uncased 0.801 ± 0.010 0.707 ± 0.004 0.551 ± 0.011 0.736 ± 0.008 0.699
robertuito-deacc 0.798 ± 0.008 0.702 ± 0.004 0.543 ± 0.015 0.740 ± 0.006 0.696
robertuito-cased 0.790 ± 0.012 0.701 ± 0.012 0.519 ± 0.032 0.719 ± 0.023 0.682
roberta-bne 0.766 ± 0.015 0.669 ± 0.006 0.533 ± 0.011 0.723 ± 0.017 0.673
bertin 0.767 ± 0.005 0.665 ± 0.003 0.518 ± 0.012 0.716 ± 0.008 0.667
beto-cased 0.768 ± 0.012 0.665 ± 0.004 0.521 ± 0.012 0.706 ± 0.007 0.665
beto-uncased 0.757 ± 0.012 0.649 ± 0.005 0.521 ± 0.006 0.702 ± 0.008 0.657

We release the pre-trained models on huggingface model hub:

Usage

IMPORTANT -- READ THIS FIRST

RoBERTuito is not yet fully-integrated into huggingface/transformers. To use it, first install pysentimiento

pip install pysentimiento

and preprocess text using pysentimiento.preprocessing.preprocess_tweet before feeding it into the tokenizer

','▁Esto','▁es','▁un','▁tweet','▁estoy','▁usando','▁','▁hashtag','▁','▁ro','bert','uito','▁@usuario','▁','▁emoji','▁cara','▁revolviéndose','▁de','▁la','▁risa','▁emoji',''] ">
from transformers import AutoTokenizer
from pysentimiento.preprocessing import preprocess_tweet

tokenizer = AutoTokenizer.from_pretrained('pysentimiento/robertuito-base-cased')

text = "Esto es un tweet estoy usando #Robertuito @pysentimiento 🤣"
preprocessed_text = preprocess_tweet(text, ha)

tokenizer.tokenize(preprocessed_text)
# ['','▁Esto','▁es','▁un','▁tweet','▁estoy','▁usando','▁','▁hashtag','▁','▁ro','bert','uito','▁@usuario','▁','▁emoji','▁cara','▁revolviéndose','▁de','▁la','▁risa','▁emoji','']

We are working on integrating this preprocessing step into a Tokenizer within transformers library

Development

Installing

We use python==3.7 and poetry to manage dependencies.

pip install poetry
poetry install

Benchmarking

To run benchmarks

python bin/run_benchmark.py <model_name> --times 5 --output_path <output_path>

Check RUN_BENCHMARKS for all experiments

Smoke test

Test the benchmark running

./smoke_test.sh

Citation

If you use RoBERTuito, please cite our paper:

@misc{perez2021robertuito,
      title={RoBERTuito: a pre-trained language model for social media text in Spanish},
      author={Juan Manuel Pérez and Damián A. Furman and Laura Alonso Alemany and Franco Luque},
      year={2021},
      eprint={2111.09453},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
Unofficial PyTorch Implementation of UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation

UnivNet UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation This is an unofficial PyTorch

MINDs Lab 170 Jan 04, 2023
Keras Implementation of The One Hundred Layers Tiramisu: Fully Convolutional DenseNets for Semantic Segmentation by (Simon Jégou, Michal Drozdzal, David Vazquez, Adriana Romero, Yoshua Bengio)

The One Hundred Layers Tiramisu: Fully Convolutional DenseNets for Semantic Segmentation: Work In Progress, Results can't be replicated yet with the m

Yad Konrad 196 Aug 30, 2022
Lazy, a tool for running things in idle time

Lazy, a tool for running things in idle time Mostly used to stop CUDA ML model training from making my desktop unusable. Simply monitors keyboard/mous

N Shepperd 46 Nov 06, 2022
SHRIMP: Sparser Random Feature Models via Iterative Magnitude Pruning

SHRIMP: Sparser Random Feature Models via Iterative Magnitude Pruning This repository is the official implementation of "SHRIMP: Sparser Random Featur

Bobby Shi 0 Dec 16, 2021
Contains modeling practice materials and homework for the Computational Neuroscience course at Okinawa Institute of Science and Technology

A310 Computational Neuroscience - Okinawa Institute of Science and Technology, 2022 This repository contains modeling practice materials and homework

Sungho Hong 1 Jan 24, 2022
Data and code for ICCV 2021 paper Distant Supervision for Scene Graph Generation.

Distant Supervision for Scene Graph Generation Data and code for ICCV 2021 paper Distant Supervision for Scene Graph Generation. Introduction The pape

THUNLP 23 Dec 31, 2022
Dense Unsupervised Learning for Video Segmentation (NeurIPS*2021)

Dense Unsupervised Learning for Video Segmentation This repository contains the official implementation of our paper: Dense Unsupervised Learning for

Visual Inference Lab @TU Darmstadt 173 Dec 26, 2022
codes for paper Combining Dynamic Local Context Focus and Dependency Cluster Attention for Aspect-level sentiment classification

DLCF-DCA codes for paper Combining Dynamic Local Context Focus and Dependency Cluster Attention for Aspect-level sentiment classification. submitted t

15 Aug 30, 2022
Boosted neural network for tabular data

XBNet - Xtremely Boosted Network Boosted neural network for tabular data XBNet is an open source project which is built with PyTorch which tries to co

Tushar Sarkar 175 Jan 04, 2023
Stochastic Tensor Optimization for Robot Motion - A GPU Robot Motion Toolkit

STORM Stochastic Tensor Optimization for Robot Motion - A GPU Robot Motion Toolkit [Install Instructions] [Paper] [Website] This package contains code

NVIDIA Research Projects 101 Dec 12, 2022
Object detection GUI based on PaddleDetection

PP-Tracking GUI界面测试版 本项目是基于飞桨开源的实时跟踪系统PP-Tracking开发的可视化界面 在PaddlePaddle中加入pyqt进行GUI页面研发,可使得整个训练过程可视化,并通过GUI界面进行调参,模型预测,视频输出等,通过多种类型的识别,简化整体预测流程。 GUI界面

杨毓栋 68 Jan 02, 2023
Enhancing Aspect-Based Sentiment Analysis with Supervised Contrastive Learning.

Enhancing Aspect-Based Sentiment Analysis with Supervised Contrastive Learning. Enhancing Aspect-Based Sentiment Analysis with Supervised Contrastive

<a href=[email protected](SZ)"> 7 Dec 16, 2021
You Only Look Once for Panopitic Driving Perception

You Only 👀 Once for Panoptic 🚗 Perception You Only Look at Once for Panoptic driving Perception by Dong Wu, Manwen Liao, Weitian Zhang, Xinggang Wan

Hust Visual Learning Team 1.4k Jan 04, 2023
Pytorch implementation of ProjectedGAN

ProjectedGAN-pytorch Pytorch implementation of ProjectedGAN (https://arxiv.org/abs/2111.01007) Note: this repository is still under developement. @InP

Dominic Rampas 17 Dec 14, 2022
Weakly-supervised semantic image segmentation with CNNs using point supervision

Code for our ECCV paper What's the Point: Semantic Segmentation with Point Supervision. Summary This library is a custom build of Caffe for semantic i

27 Sep 14, 2022
Geometry-Free View Synthesis: Transformers and no 3D Priors

Geometry-Free View Synthesis: Transformers and no 3D Priors Geometry-Free View Synthesis: Transformers and no 3D Priors Robin Rombach*, Patrick Esser*

CompVis Heidelberg 293 Dec 22, 2022
The Pytorch implementation for "Video-Text Pre-training with Learned Regions"

Region_Learner The Pytorch implementation for "Video-Text Pre-training with Learned Regions" (arxiv) We are still cleaning up the code further and pre

Rui Yan 0 Mar 20, 2022
Code and Data for NeurIPS2021 Paper "A Dataset for Answering Time-Sensitive Questions"

Time-Sensitive-QA The repo contains the dataset and code for NeurIPS2021 (dataset track) paper Time-Sensitive Question Answering dataset. The dataset

wenhu chen 35 Nov 14, 2022
The first dataset on shadow generation for the foreground object in real-world scenes.

Object-Shadow-Generation-Dataset-DESOBA Object Shadow Generation is to deal with the shadow inconsistency between the foreground object and the backgr

BCMI 105 Dec 30, 2022