TLA - Twitter Linguistic Analysis

Related tags

Text Data & NLPTLA
Overview

TLA - Twitter Linguistic Analysis

Tool for linguistic analysis of communities

TLA is built using PyTorch, Transformers and several other State-of-the-Art machine learning techniques and it aims to expedite and structure the cumbersome process of collecting, labeling, and analyzing data from Twitter for a corpus of languages while providing detailed labeled datasets for all the languages. The analysis provided by TLA will also go a long way in understanding the sentiments of different linguistic communities and come up with new and innovative solutions for their problems based on the analysis. List of languages our library provides support for are listed as follows:

Language Code Language Code
English en Hindi hi
Swedish sv Thai th
Dutch nl Japanese ja
Turkish tr Urdu ur
Indonesian id Portuguese pt
French fr Chinese zn-ch
Spanish es Persian fa
Romainain ro Russian ru

Features

  • Provides 16 labeled Datasets for different languages for analysis.
  • Implements Bert based architecture to identify languages.
  • Provides Functionalities to Extract,process and label tweets from twitter.
  • Provides a Random Forest classifier to implement sentiment analysis on any string.

Installation :

pip install --upgrade https://github.com/tusharsarkar3/TLA.git

Overview

Extract data
from TLA.Data.get_data import store_data
store_data('en',False)

This will extract and store the unlabeled data in a new directory inside data named datasets.

Label data
from TLA.Datasets.get_lang_data import language_data
df = language_data('en')
print(df)

This will print the labeled data that we have already collected.

Classify languages
Training

Training can be done in the following way:

from TLA.Lang_Classify.train import train_lang
train_lang(path_to_dataset,epochs)
Prediction

Inference is done in the following way:

from TLA.Lang_Classify.predict import predict
model = get_model(path_to_weights)
preds = predict(dataframe_to_be_used,model)
Analyse
Training

Training can be done in the following way:

from TLA.Analyse.train_rf import train_rf
train_rf(path_to_dataset)

This will store all the vectorizers and models in a seperate directory named saved_rf and saved_vec and they are present inside Analysis directory. Further instructions for training multiple languages is given in the next section which shows how to run the commands using CLI

Final Analysis

Analysis is done in the following way:

from TLA.Analysis.analyse import analyse_data 
analyse_data(path_to_weights)

This will store the final analysis as .csv inside a new directory named analysis.

Overview with Git

Installation another method
git clone https://github.com/tusharsarkar3/TLA.git
Extract data Navigate to the required directory
cd Data

Run the following command:

python get_data.py --lang en --process True

Lang flag is used to input the language of the dataset that is required and process flag shows where pre-processing should be done before returning the data. Give the following codes in the lang flag wrt the required language:

Loading Dataset

To load a dataset run the following command in python.

df= pd.read_csv("TLA/TLA/Datasets/get_data_en.csv")
 

The command will return a dataframe consisting of the data for the specific language requested.

In the phrase get_data_en, en can be sunstituted by the desired language code to load the dataframe for the specific language.

Pre-Processing

To preprocess a given string run the following command.

In your terminal use code

cd Data

then run the command in python

from TLA.Data import Pre_Process_Tweets

df=Pre_Process_Tweets.pre_process_tweet(df)

Here the function pre_process_tweet takes an input as a dataframe of tweets and returns an output of a dataframe with the list of preprocessed words for a particular tweet next to the tweet in the dataframe.

Analysis Training To train a random forest classifier for the purpose of sentiment analysis run the following command in your terminal.
cd Analysis

then

python train.rf --path "path to your datafile" --train_all_datasets False

here the --path flag represents the path to the required dataset you want to train the Random Forest Classifier on the --train_all_datasets flag is a boolean which can be used to train the model on multiple datasets at once.

The output is a file with the a .pkl file extention saved in the folder at location "TLA\Analysis\saved_rf{}.pkl" The output for vectorization of is stored in a .pkl file in the directory "TLA\Analysis\saved_vec{}.pkl"

Get Sentiment

To get the sentiment of any string use the following code.

In your terminal type

cd Analysis

then in your terminal type

python get_sentiment.py --prediction "Your string for prediction to be made upon" --lang "en"

here the --prediction flag collects the string for which you want to get the sentiment for. the --lang represents the language code representing the language you typed your string in.

The output is a sentiment which is either positive or negative depending on your string.

Statistics

To get a comprehensive statistic on sentiment of datasets run the following command.

In your terminal type

cd Analysis

then

python analyse.py 

This will give you an output of a table1.csv file at the location 'TLA\Analysis\analysis\table1.csv' comprising of statistics relating to the percentage of positive or negative tweets for a given language dataset.

It will also give a table2.csv file at 'TLA\Analysis\analysis\table2.csv' comprising of statistics for all languages combined.

Language Classification Training To train a model for language classfication on a given dataset run the following commands.

In your terminal run

cd Lang_Classify

then run

python train.py --data "path for your dataset" --model "path to weights if pretrained" --epochs 4

The --data flag requires the path to your training dataset.

The --model flag requires the path to the model you want to implement

The --epoch flag represents the epochs you want to train your model for.

The output is a file with a .pt extention named saved_wieghts_full.pt where your trained wieghst are stored.

Prediction To make prediction on any given string Us ethe following code.

In your terminal type

cd Lang_Classify

then run the code

python predict.py --predict "Text/DataFrame for language to predicted" --weights " Path for the stored weights of your model " 

The --predict flag requires the string you want to get the language for.

The --wieghts flag is the path for the stored wieghts you want to run your model on to make predictions.

The outputs is the language your string was typed in.


Results:

img

Performance of TLA ( Loss vs epochs)

Language Total tweets Positive Tweets Percentage Negative Tweets Percentage
English 500 66.8 33.2
Spanish 500 61.4 38.6
Persian 50 52 48
French 500 53 47
Hindi 500 62 38
Indonesian 500 63.4 36.6
Japanese 500 85.6 14.4
Dutch 500 84.2 15.8
Portuguese 500 61.2 38.8
Romainain 457 85.55 14.44
Russian 213 62.91 37.08
Swedish 420 80.23 19.76
Thai 424 71.46 28.53
Turkish 500 67.8 32.2
Urdu 42 69.04 30.95
Chinese 500 80.6 19.4

Reference:

@misc{sarkar2021tla,
     title={TLA: Twitter Linguistic Analysis}, 
     author={Tushar Sarkar and Nishant Rajadhyaksha},
     year={2021},
     eprint={2107.09710},
     archivePrefix={arXiv},
     primaryClass={cs.CL}
}
@misc{640cba8b-35cb-475e-ab04-62d079b74d13,
 title = {TLA: Twitter Linguistic Analysis},
 author = {Tushar Sarkar and Nishant Rajadhyaksha},
  journal = {Software Impacts},
 doi = {10.24433/CO.6464530.v1}, 
 howpublished = {\url{https://www.codeocean.com/}},
 year = 2021,
 month = {6},
 version = {v1}
}

Features to be added :

  • Access to more language
  • Creating GUI based system for better accesibility
  • Improving performance of the baseline model

Developed by Tushar Sarkar and Nishant Rajadhyaksha

Owner
Tushar Sarkar
I love solving problems with data
Tushar Sarkar
This is a project of data parallel that running on NLP tasks.

This is a project of data parallel that running on NLP tasks.

2 Dec 12, 2021
Ongoing research training transformer language models at scale, including: BERT & GPT-2

Megatron (1 and 2) is a large, powerful transformer developed by the Applied Deep Learning Research team at NVIDIA.

NVIDIA Corporation 3.5k Dec 30, 2022
Grading tools for Advanced NLP (11-711)Grading tools for Advanced NLP (11-711)

Grading tools for Advanced NLP (11-711) Installation You'll need docker and unzip to use this repo. For docker, visit the official guide to get starte

Hao Zhu 2 Sep 27, 2022
Beyond Masking: Demystifying Token-Based Pre-Training for Vision Transformers

beyond masking Beyond Masking: Demystifying Token-Based Pre-Training for Vision Transformers The code is coming Figure 1: Pipeline of token-based pre-

Yunjie Tian 23 Sep 27, 2022
GrammarTagger โ€” A Neural Multilingual Grammar Profiler for Language Learning

GrammarTagger โ€” A Neural Multilingual Grammar Profiler for Language Learning GrammarTagger is an open-source toolkit for grammatical profiling for lan

Octanove Labs 27 Jan 05, 2023
Code for EMNLP'21 paper "Types of Out-of-Distribution Texts and How to Detect Them"

Code for EMNLP'21 paper "Types of Out-of-Distribution Texts and How to Detect Them"

Udit Arora 19 Oct 28, 2022
Phrase-BERT: Improved Phrase Embeddings from BERT with an Application to Corpus Exploration

Phrase-BERT: Improved Phrase Embeddings from BERT with an Application to Corpus Exploration This is the official repository for the EMNLP 2021 long pa

70 Dec 11, 2022
An open source library for deep learning end-to-end dialog systems and chatbots.

DeepPavlov is an open-source conversational AI library built on TensorFlow, Keras and PyTorch. DeepPavlov is designed for development of production re

Neural Networks and Deep Learning lab, MIPT 6k Dec 31, 2022
This github repo is for Neurips 2021 paper, NORESQA A Framework for Speech Quality Assessment using Non-Matching References.

NORESQA: Speech Quality Assessment using Non-Matching References This is a Pytorch implementation for using NORESQA. It contains minimal code to predi

Meta Research 36 Dec 08, 2022
NeurIPS'21: Probabilistic Margins for Instance Reweighting in Adversarial Training (Pytorch implementation).

source code for NeurIPS21 paper robabilistic Margins for Instance Reweighting in Adversarial Training

9 Dec 20, 2022
Python package for performing Entity and Text Matching using Deep Learning.

DeepMatcher DeepMatcher is a Python package for performing entity and text matching using deep learning. It provides built-in neural networks and util

461 Dec 28, 2022
Question answering app is used to answer for a user given question from user given text.

Question answering app is used to answer for a user given question from user given text.It is created using HuggingFace's transformer pipeline and streamlit python packages.

Siva Prakash 3 Apr 05, 2022
๐Ÿงช Cutting-edge experimental spaCy components and features

spacy-experimental: Cutting-edge experimental spaCy components and features This package includes experimental components and features for spaCy v3.x,

Explosion 65 Dec 30, 2022
BeautyNet is an AI powered model which can tell you whether you're beautiful or not.

BeautyNet BeautyNet is an AI powered model which can tell you whether you're beautiful or not. Download Dataset from here:https://www.kaggle.com/gpios

Ansh Gupta 0 May 06, 2022
Training RNNs as Fast as CNNs

News SRU++, a new SRU variant, is released. [tech report] [blog] The experimental code and SRU++ implementation are available on the dev branch which

Tao Lei 14 Dec 12, 2022
Telegram bot to auto post messages of one channel in another channel as soon as it is posted, without the forwarded tag.

Channel Auto-Post Bot This bot can send all new messages from one channel, directly to another channel (or group, just in case), without the forwarded

Aditya 128 Dec 29, 2022
The projects lets you extract glossary words and their definitions from a given piece of text automatically using NLP techniques

Unsupervised technique to Glossary and Definition Extraction Code Files GPT2-DefinitionModel.ipynb - GPT-2 model for definition generation. Data_Gener

Prakhar Mishra 28 May 25, 2021
NLP techniques such as named entity recognition, sentiment analysis, topic modeling, text classification with Python to predict sentiment and rating of drug from user reviews.

This file contains the following documents sumbited for Baruch CIS9665 group 9 fall 2021. 1. Dataset: drug_reviews.csv 2. python codes for text classi

Aarif Munwar Jahan 2 Jan 04, 2023
๐Ÿ’ฌ Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants

Rasa Open Source Rasa is an open source machine learning framework to automate text-and voice-based conversations. With Rasa, you can build contextual

Rasa 15.3k Dec 30, 2022
AI Assistant for Building Reliable, High-performing and Fair Multilingual NLP Systems

AI Assistant for Building Reliable, High-performing and Fair Multilingual NLP Systems

Microsoft 37 Nov 29, 2022