PLStream: A Framework for Fast Polarity Labelling of Massive Data Streams

Related tags

Data AnalysisPLStream
Overview

PLStream: A Framework for Fast Polarity Labelling of Massive Data Streams

Motivation

  • When dataset freshness is critical, the annotating of high speed unlabelled data streams becomes critical but remains an open problem.
  • We propose PLStream, a novel Apache Flink-based framework for fast polarity labelling of massive data streams, like Twitter tweets or online product reviews.

Environment Requirements

relative python packages are summerized in requirements.txt

  1. Flink v1.13
  2. Python 3.7
  3. Java 8

DataSource

Tweets

Yelp Reviews

Amazon Reviews

Quick Start

quick try PLStream on yelp review dataset

Data Prepare

cd PLStream
weget https://s3.amazonaws.com/fast-ai-nlp/yelp_review_polarity_csv.tgz
tar zxvf yelp_review_polarity_csv.tgz
mv yelp_review_polarity_csv/train.csv train.csv

1. Install required environment of PLStream

  • please make sure Environment Requirements mentioned above is ready.
pip install -r requirements.txt

2. Start Redis-Server in a terminal

redis-server

3. Run PLStream

python PLStream.py
  • The outputs' form is "original text" + "label" + "@@@@":
  • With help of a split("@@@@") function we can further reorganize the labelled dataset.

Optional

to see the labelling accuracy, simply run: python PLStream_acc.py

Python ELT Studio, an application for building ELT (and ETL) data flows.

The Python Extract, Load, Transform Studio is an application for performing ELT (and ETL) tasks. Under the hood the application consists of a two parts.

Schlerp 55 Nov 18, 2022
Projects that implement various aspects of Data Engineering.

DATAWAREHOUSE ON AWS The purpose of this project is to build a datawarehouse to accomodate data of active user activity for music streaming applicatio

2 Oct 14, 2021
Desafio proposto pela IGTI em seu bootcamp de Cloud Data Engineer

Desafio Modulo 4 - Cloud Data Engineer Bootcamp - IGTI Objetivos Criar infraestrutura como código Utuilizando um cluster Kubernetes na Azure Ingestão

Otacilio Filho 4 Jan 23, 2022
Multiple Pairwise Comparisons (Post Hoc) Tests in Python

scikit-posthocs is a Python package that provides post hoc tests for pairwise multiple comparisons that are usually performed in statistical data anal

Maksim Terpilowski 264 Dec 30, 2022
CSV database for chihuahua (HUAHUA) blockchain transactions

super-fiesta Shamelessly ripped components from https://github.com/hodgerpodger/staketaxcsv - Thanks for doing all the hard work. This code does only

Arlene Macciaveli 1 Jan 07, 2022
Data Analysis for First Year Laboratory at Imperial College, London.

Data Analysis for First Year Laboratory at Imperial College, London. For personal reference only, and to reference in lab reports and lab books.

Martin He 0 Aug 29, 2022
A distributed block-based data storage and compute engine

Nebula is an extremely-fast end-to-end interactive big data analytics solution. Nebula is designed as a high-performance columnar data storage and tabular OLAP engine.

Columns AI 131 Dec 26, 2022
ETL pipeline on movie data using Python and postgreSQL

Movies-ETL ETL pipeline on movie data using Python and postgreSQL Overview This project consisted on a automated Extraction, Transformation and Load p

Juan Nicolas Serrano 0 Jul 07, 2021
BIGDATA SIMULATION ONE PIECE WORLD CENSUS

ONE PIECE is a Japanese manga of great international success. The story turns inhabited in a fictional world, tells the adventures of a young man whose body gained rubber properties after accidentall

Maycon Cypriano 3 Jun 30, 2022
Display the behaviour of a realtime program with a scope or logic analyser.

1. A monitor for realtime MicroPython code This library provides a means of examining the behaviour of a running system. It was initially designed to

Peter Hinch 17 Dec 05, 2022
An orchestration platform for the development, production, and observation of data assets.

Dagster An orchestration platform for the development, production, and observation of data assets. Dagster lets you define jobs in terms of the data f

Dagster 6.2k Jan 08, 2023
A library to create multi-page Streamlit applications with ease.

A library to create multi-page Streamlit applications with ease.

Jackson Storm 107 Jan 04, 2023
Python package for analyzing sensor-collected human motion data

Python package for analyzing sensor-collected human motion data

Simon Ho 71 Nov 05, 2022
Universal data analysis tools for atmospheric sciences

U_analysis Universal data analysis tools for atmospheric sciences Script written in python 3. This file defines multiple functions that can be used fo

Luis Ackermann 1 Oct 10, 2021
OpenARB is an open source program aiming to emulate a free market while encouraging players to participate in arbitrage in order to increase working capital.

Overview OpenARB is an open source program aiming to emulate a free market while encouraging players to participate in arbitrage in order to increase

Tom 3 Feb 12, 2022
Binance Kline Data With Python

Binance Kline Data by seunghan(gingerthorp) reference https://github.com/binance/binance-public-data/ All intervals are supported: 1m, 3m, 5m, 15m, 30

shquant 5 Jul 13, 2022
Parses data out of your Google Takeout (History, Activity, Youtube, Locations, etc...)

google_takeout_parser parses both the Historical HTML and new JSON format for Google Takeouts caches individual takeout results behind cachew merge mu

Sean Breckenridge 27 Dec 28, 2022
Code for the DH project "Dhimmis & Muslims – Analysing Multireligious Spaces in the Medieval Muslim World"

Damast This repository contains code developed for the digital humanities project "Dhimmis & Muslims – Analysing Multireligious Spaces in the Medieval

University of Stuttgart Visualization Research Center 2 Jul 01, 2022
NFCDS Workshop Beginners Guide Bioinformatics Data Analysis

Genomics Workshop FIXME: overview of workshop Code of Conduct All participants s

Elizabeth Brooks 2 Jun 13, 2022