PLStream: A Framework for Fast Polarity Labelling of Massive Data Streams

Last update: Aug 02, 2022

Related tags

Data Analysis PLStream

Overview

PLStream: A Framework for Fast Polarity Labelling of Massive Data Streams

Motivation

When dataset freshness is critical, the annotating of high speed unlabelled data streams becomes critical but remains an open problem.
We propose PLStream, a novel Apache Flink-based framework for fast polarity labelling of massive data streams, like Twitter tweets or online product reviews.

Environment Requirements

relative python packages are summerized in requirements.txt

Flink v1.13
Python 3.7
Java 8

DataSource

Dataset quick access on https://course.fast.ai/datasets#nlp

Tweets

1.6 million labeled Tweets:
Source:Sentiment140

Yelp Reviews

280,000 training and 19,000 test samples in each polarity
Source:Yelp Review Polarity

Amazon Reviews

1,800,000 training and 200,000 testing samples in each polarity
Source:Amazon product review polarity

Quick Start

quick try PLStream on yelp review dataset

Data Prepare

cd PLStream
weget https://s3.amazonaws.com/fast-ai-nlp/yelp_review_polarity_csv.tgz
tar zxvf yelp_review_polarity_csv.tgz
mv yelp_review_polarity_csv/train.csv train.csv

1. Install required environment of PLStream

please make sure Environment Requirements mentioned above is ready.

pip install -r requirements.txt

2. Start Redis-Server in a terminal

redis-server

3. Run PLStream

python PLStream.py

The outputs' form is "original text" + "label" + "@@@@":
With help of a split("@@@@") function we can further reorganize the labelled dataset.

Optional

to see the labelling accuracy, simply run: python PLStream_acc.py

PLStream: A Framework for Fast Polarity Labelling of Massive Data Streams

Related tags

Overview

PLStream: A Framework for Fast Polarity Labelling of Massive Data Streams

Motivation

Environment Requirements

DataSource

Tweets

Yelp Reviews

Amazon Reviews

Quick Start

Data Prepare

1. Install required environment of PLStream

2. Start Redis-Server in a terminal

3. Run PLStream

Optional

Owner

[CVPR2022] This repository contains code for the paper "Nested Collaborative Learning for Long-Tailed Visual Recognition", published at CVPR 2022

Candlestick Pattern Recognition with Python and TA-Lib

First steps with Python in Life Sciences

yt is an open-source, permissively-licensed Python library for analyzing and visualizing volumetric data.

A data parser for the internal syncing data format used by Fog of World.

Demonstrate a Dataflow pipeline that saves data from an API into BigQuery table

For making Tagtog annotation into csv dataset

Falcon: Interactive Visual Analysis for Big Data

Import, connect and transform data into Excel

OpenARB is an open source program aiming to emulate a free market while encouraging players to participate in arbitrage in order to increase working capital.

HyperSpy is an open source Python library for the interactive analysis of multidimensional datasets

Statsmodels: statistical modeling and econometrics in Python

The Dash Enterprise App Gallery "Oil & Gas Wells" example

Leverage Twitter API v2 to analyze tweet metrics such as impressions and profile clicks over time.

Uses MIT/MEDSL, New York Times, and US Census datasources to analyze per-county COVID-19 deaths.

A Python Tools to imaging the shallow seismic structure

Shot notebooks resuming the main functions of GeoPandas

Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code

A pipeline that creates consensus sequences from a Nanopore reads. I

follow-analyzer helps GitHub users analyze their following and followers relationship