TLA - Twitter Linguistic Analysis

Related tags

Text Data & NLPTLA
Overview

TLA - Twitter Linguistic Analysis

Tool for linguistic analysis of communities

TLA is built using PyTorch, Transformers and several other State-of-the-Art machine learning techniques and it aims to expedite and structure the cumbersome process of collecting, labeling, and analyzing data from Twitter for a corpus of languages while providing detailed labeled datasets for all the languages. The analysis provided by TLA will also go a long way in understanding the sentiments of different linguistic communities and come up with new and innovative solutions for their problems based on the analysis. List of languages our library provides support for are listed as follows:

Language Code Language Code
English en Hindi hi
Swedish sv Thai th
Dutch nl Japanese ja
Turkish tr Urdu ur
Indonesian id Portuguese pt
French fr Chinese zn-ch
Spanish es Persian fa
Romainain ro Russian ru

Features

  • Provides 16 labeled Datasets for different languages for analysis.
  • Implements Bert based architecture to identify languages.
  • Provides Functionalities to Extract,process and label tweets from twitter.
  • Provides a Random Forest classifier to implement sentiment analysis on any string.

Installation :

pip install --upgrade https://github.com/tusharsarkar3/TLA.git

Overview

Extract data
from TLA.Data.get_data import store_data
store_data('en',False)

This will extract and store the unlabeled data in a new directory inside data named datasets.

Label data
from TLA.Datasets.get_lang_data import language_data
df = language_data('en')
print(df)

This will print the labeled data that we have already collected.

Classify languages
Training

Training can be done in the following way:

from TLA.Lang_Classify.train import train_lang
train_lang(path_to_dataset,epochs)
Prediction

Inference is done in the following way:

from TLA.Lang_Classify.predict import predict
model = get_model(path_to_weights)
preds = predict(dataframe_to_be_used,model)
Analyse
Training

Training can be done in the following way:

from TLA.Analyse.train_rf import train_rf
train_rf(path_to_dataset)

This will store all the vectorizers and models in a seperate directory named saved_rf and saved_vec and they are present inside Analysis directory. Further instructions for training multiple languages is given in the next section which shows how to run the commands using CLI

Final Analysis

Analysis is done in the following way:

from TLA.Analysis.analyse import analyse_data 
analyse_data(path_to_weights)

This will store the final analysis as .csv inside a new directory named analysis.

Overview with Git

Installation another method
git clone https://github.com/tusharsarkar3/TLA.git
Extract data Navigate to the required directory
cd Data

Run the following command:

python get_data.py --lang en --process True

Lang flag is used to input the language of the dataset that is required and process flag shows where pre-processing should be done before returning the data. Give the following codes in the lang flag wrt the required language:

Loading Dataset

To load a dataset run the following command in python.

df= pd.read_csv("TLA/TLA/Datasets/get_data_en.csv")
 

The command will return a dataframe consisting of the data for the specific language requested.

In the phrase get_data_en, en can be sunstituted by the desired language code to load the dataframe for the specific language.

Pre-Processing

To preprocess a given string run the following command.

In your terminal use code

cd Data

then run the command in python

from TLA.Data import Pre_Process_Tweets

df=Pre_Process_Tweets.pre_process_tweet(df)

Here the function pre_process_tweet takes an input as a dataframe of tweets and returns an output of a dataframe with the list of preprocessed words for a particular tweet next to the tweet in the dataframe.

Analysis Training To train a random forest classifier for the purpose of sentiment analysis run the following command in your terminal.
cd Analysis

then

python train.rf --path "path to your datafile" --train_all_datasets False

here the --path flag represents the path to the required dataset you want to train the Random Forest Classifier on the --train_all_datasets flag is a boolean which can be used to train the model on multiple datasets at once.

The output is a file with the a .pkl file extention saved in the folder at location "TLA\Analysis\saved_rf{}.pkl" The output for vectorization of is stored in a .pkl file in the directory "TLA\Analysis\saved_vec{}.pkl"

Get Sentiment

To get the sentiment of any string use the following code.

In your terminal type

cd Analysis

then in your terminal type

python get_sentiment.py --prediction "Your string for prediction to be made upon" --lang "en"

here the --prediction flag collects the string for which you want to get the sentiment for. the --lang represents the language code representing the language you typed your string in.

The output is a sentiment which is either positive or negative depending on your string.

Statistics

To get a comprehensive statistic on sentiment of datasets run the following command.

In your terminal type

cd Analysis

then

python analyse.py 

This will give you an output of a table1.csv file at the location 'TLA\Analysis\analysis\table1.csv' comprising of statistics relating to the percentage of positive or negative tweets for a given language dataset.

It will also give a table2.csv file at 'TLA\Analysis\analysis\table2.csv' comprising of statistics for all languages combined.

Language Classification Training To train a model for language classfication on a given dataset run the following commands.

In your terminal run

cd Lang_Classify

then run

python train.py --data "path for your dataset" --model "path to weights if pretrained" --epochs 4

The --data flag requires the path to your training dataset.

The --model flag requires the path to the model you want to implement

The --epoch flag represents the epochs you want to train your model for.

The output is a file with a .pt extention named saved_wieghts_full.pt where your trained wieghst are stored.

Prediction To make prediction on any given string Us ethe following code.

In your terminal type

cd Lang_Classify

then run the code

python predict.py --predict "Text/DataFrame for language to predicted" --weights " Path for the stored weights of your model " 

The --predict flag requires the string you want to get the language for.

The --wieghts flag is the path for the stored wieghts you want to run your model on to make predictions.

The outputs is the language your string was typed in.


Results:

img

Performance of TLA ( Loss vs epochs)

Language Total tweets Positive Tweets Percentage Negative Tweets Percentage
English 500 66.8 33.2
Spanish 500 61.4 38.6
Persian 50 52 48
French 500 53 47
Hindi 500 62 38
Indonesian 500 63.4 36.6
Japanese 500 85.6 14.4
Dutch 500 84.2 15.8
Portuguese 500 61.2 38.8
Romainain 457 85.55 14.44
Russian 213 62.91 37.08
Swedish 420 80.23 19.76
Thai 424 71.46 28.53
Turkish 500 67.8 32.2
Urdu 42 69.04 30.95
Chinese 500 80.6 19.4

Reference:

@misc{sarkar2021tla,
     title={TLA: Twitter Linguistic Analysis}, 
     author={Tushar Sarkar and Nishant Rajadhyaksha},
     year={2021},
     eprint={2107.09710},
     archivePrefix={arXiv},
     primaryClass={cs.CL}
}
@misc{640cba8b-35cb-475e-ab04-62d079b74d13,
 title = {TLA: Twitter Linguistic Analysis},
 author = {Tushar Sarkar and Nishant Rajadhyaksha},
  journal = {Software Impacts},
 doi = {10.24433/CO.6464530.v1}, 
 howpublished = {\url{https://www.codeocean.com/}},
 year = 2021,
 month = {6},
 version = {v1}
}

Features to be added :

  • Access to more language
  • Creating GUI based system for better accesibility
  • Improving performance of the baseline model

Developed by Tushar Sarkar and Nishant Rajadhyaksha

Owner
Tushar Sarkar
I love solving problems with data
Tushar Sarkar
PyTorch implementation of "data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language" from Meta AI

data2vec-pytorch PyTorch implementation of "data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language" from Meta AI (F

Aryan Shekarlaban 105 Jan 04, 2023
OpenAI CLIP text encoders for multiple languages!

Multilingual-CLIP OpenAI CLIP text encoders for any language Colab Notebook · Pre-trained Models · Report Bug Overview OpenAI recently released the pa

Fredrik Carlsson 481 Dec 30, 2022
profile tools for pytorch nn models

nnprof Introduction nnprof is a profile tool for pytorch neural networks. Features multi profile mode: nnprof support 4 profile mode: Layer level, Ope

Feng Wang 42 Jul 09, 2022
NLP tool to extract emotional phrase from tweets 🤩

Emotional phrase extractor Extract phrase in the given text that is used to express the sentiment. Capturing sentiment in language is important in the

Shahul ES 38 Oct 17, 2022
🗣️ NALP is a library that covers Natural Adversarial Language Processing.

NALP: Natural Adversarial Language Processing Welcome to NALP. Have you ever wanted to create natural text from raw sources? If yes, NALP is for you!

Gustavo Rosa 21 Aug 12, 2022
DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism (SVS & TTS); AAAI 2022

DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism This repository is the official PyTorch implementation of our AAAI-2022 paper, in

Jinglin Liu 829 Jan 07, 2023
Mkdocs + material + cool stuff

Modern-Python-Doc-Example mkdocs + material + cool stuff Doc is live here Features out of the box amazing good looking website thanks to mkdocs.org an

Francesco Saverio Zuppichini 61 Oct 26, 2022
Tools to download and cleanup Common Crawl data

cc_net Tools to download and clean Common Crawl as introduced in our paper CCNet. If you found these resources useful, please consider citing: @inproc

Meta Research 483 Jan 02, 2023
AEC_DeepModel - Deep learning based acoustic echo cancellation baseline code

AEC_DeepModel - Deep learning based acoustic echo cancellation baseline code

凌逆战 75 Dec 05, 2022
This repository will contain the code for the CVPR 2021 paper "GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields"

GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields Project Page | Paper | Supplementary | Video | Slides | Blog | Talk If

1.1k Dec 27, 2022
Lightweight utility tools for the detection of multiple spellings, meanings, and language-specific terminology in British and American English

Breame ( British English and American English) Breame is a lightweight Python package with a number of utility tools to aid in the detection of words

Charles 8 Oct 10, 2022
NLP-based analysis of poor Chinese movie reviews on Douban

douban_embedding 豆瓣中文影评差评分析 1. NLP NLP(Natural Language Processing)是指自然语言处理,他的目的是让计算机可以听懂人话。 下面是我将2万条豆瓣影评训练之后,随意输入一段新影评交给神经网络,最终AI推断出的结果。 "很好,演技不错

3 Apr 15, 2022
A PyTorch-based model pruning toolkit for pre-trained language models

English | 中文说明 TextPruner是一个为预训练语言模型设计的模型裁剪工具包,通过轻量、快速的裁剪方法对模型进行结构化剪枝,从而实现压缩模型体积、提升模型速度。 其他相关资源: 知识蒸馏工具TextBrewer:https://github.com/airaria/TextBrewe

Ziqing Yang 231 Jan 08, 2023
Chinese Named Entity Recognization (BiLSTM with PyTorch)

BiLSTM-CRF for Name Entity Recognition PyTorch version A PyTorch implemention of Bi-LSTM-CRF model for Chinese Named Entity Recognition. 使用 PyTorch 实现

5 Jun 01, 2022
sangha, pronounced "suhng-guh", is a social networking, booking platform where students and teachers can share their practice.

Flask React Project This is the backend for the Flask React project. Getting started Clone this repository (only this branch) git clone https://github

Courtney Newcomer 17 Sep 29, 2021
Transcribing audio files using Hugging Face's implementation of Wav2Vec2 + "chain-linking" NLP tasks to combine speech-to-text with downstream tasks like translation and summarisation.

PART 2: CHAIN LINKING AUDIO-TO-TEXT NLP TASKS 2A: TRANSCRIBE-TRANSLATE-SENTIMENT-ANALYSIS In notebook3.0, I demo a simple workflow to: transcribe a lo

Chua Chin Hon 30 Jul 13, 2022
A telegram bot to translate 100+ Languages

🔥 GOOGLE TRANSLATER 🔥 The owner would not be responsible for any kind of bans due to the bot. • ⚡ INSTALLING ⚡ • • 🔰 Deploy To Railway 🔰 • • ✅ OFF

Aɴᴋɪᴛ Kᴜᴍᴀʀ 5 Dec 20, 2021
Nested Named Entity Recognition for Chinese Biomedical Text

CBio-NAMER CBioNAMER (Nested nAMed Entity Recognition for Chinese Biomedical Text) is our method used in CBLUE (Chinese Biomedical Language Understand

8 Dec 25, 2022
PyTorch Implementation of the paper Single Image Texture Translation for Data Augmentation

SITT The repo contains official PyTorch Implementation of the paper Single Image Texture Translation for Data Augmentation. Authors: Boyi Li Yin Cui T

Boyi Li 52 Jan 05, 2023
Must-read papers on improving efficiency for pre-trained language models.

Must-read papers on improving efficiency for pre-trained language models.

Tobias Lee 89 Jan 03, 2023