PLStream: A Framework for Fast Polarity Labelling of Massive Data Streams

Last update: Aug 02, 2022

Related tags

Data Analysis PLStream

Overview

PLStream: A Framework for Fast Polarity Labelling of Massive Data Streams

Motivation

When dataset freshness is critical, the annotating of high speed unlabelled data streams becomes critical but remains an open problem.
We propose PLStream, a novel Apache Flink-based framework for fast polarity labelling of massive data streams, like Twitter tweets or online product reviews.

Environment Requirements

relative python packages are summerized in requirements.txt

Flink v1.13
Python 3.7
Java 8

DataSource

Dataset quick access on https://course.fast.ai/datasets#nlp

Tweets

1.6 million labeled Tweets:
Source:Sentiment140

Yelp Reviews

280,000 training and 19,000 test samples in each polarity
Source:Yelp Review Polarity

Amazon Reviews

1,800,000 training and 200,000 testing samples in each polarity
Source:Amazon product review polarity

Quick Start

quick try PLStream on yelp review dataset

Data Prepare

cd PLStream
weget https://s3.amazonaws.com/fast-ai-nlp/yelp_review_polarity_csv.tgz
tar zxvf yelp_review_polarity_csv.tgz
mv yelp_review_polarity_csv/train.csv train.csv

1. Install required environment of PLStream

please make sure Environment Requirements mentioned above is ready.

pip install -r requirements.txt

2. Start Redis-Server in a terminal

redis-server

3. Run PLStream

python PLStream.py

The outputs' form is "original text" + "label" + "@@@@":
With help of a split("@@@@") function we can further reorganize the labelled dataset.

Optional

to see the labelling accuracy, simply run: python PLStream_acc.py

PLStream: A Framework for Fast Polarity Labelling of Massive Data Streams

Related tags

Overview

PLStream: A Framework for Fast Polarity Labelling of Massive Data Streams

Motivation

Environment Requirements

DataSource

Tweets

Yelp Reviews

Amazon Reviews

Quick Start

Data Prepare

1. Install required environment of PLStream

2. Start Redis-Server in a terminal

3. Run PLStream

Optional

Owner

Full ELT process on GCP environment.

Churn prediction with PySpark

MetPy is a collection of tools in Python for reading, visualizing and performing calculations with weather data.

This is a repo documenting the best practices in PySpark.

Big Data & Cloud Computing for Oceanography

A Pythonic introduction to methods for scaling your data science and machine learning work to larger datasets and larger models, using the tools and APIs you know and love from the PyData stack (such as numpy, pandas, and scikit-learn).

Data collection, enhancement, and metrics calculation.

Option Pricing Calculator using the Binomial Pricing Method (No Libraries Required)

Projeto para realizar o RPA Challenge . Utilizando Python e as bibliotecas Selenium e Pandas.

Detailed analysis on fraud claims in insurance companies, gives you information as to why huge loss take place in insurance companies

Building house price data pipelines with Apache Beam and Spark on GCP

Instant search for and access to many datasets in Pyspark.

MotorcycleParts DataAnalysis python

Intercepting proxy + analysis toolkit for Second Life compatible virtual worlds

Python package for processing UC module spectral data.

The lastest all in one bombing tool coded in python uses tbomb api

Hatchet is a Python-based library that allows Pandas dataframes to be indexed by structured tree and graph data.

Data analysis and visualisation projects from a range of individual projects and applications

Generates a simple report about the current Covid-19 cases and deaths in Malaysia

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis