Open Crawl Vietnamese Text

Last update: Jan 05, 2022

Related tags

Overview

Open Crawl Vietnamese Text

This repo contains crawled Vietnamese text from multiple sources.

This list of a topic-centric public data sources in high quality . We have collected and cleaned them from multiple sources. All of the datasets listed below are free.

Here are the ways we clean the data:

Removal of emojis
Removal of emoticons
Removal of URLs
Removal of HTML tags

1. Binhvq News Corpus:

Binhvq News Corpus was crawled from news on the internet with size of 50GB text.

link_raw, link_clean

2. Oscar corpus vietnamese crawl:

OSCAR or Open Super-large Crawled Aggregated coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture. Oscar has mostly 32 GB vietnamese text discarded duplicates.

link_raw, link_clean

3. Dataset story VietNamese :

Including texts of short and long story with size of 10 GB crawled by QAI on the internet.

link_clean

4. Dataset poem VietNamese :

More than 1 million sentences collected by QAI on the internet.

link_clean

Open Crawl Vietnamese Text

Related tags

Overview

Open Crawl Vietnamese Text

1. Binhvq News Corpus:

2. Oscar corpus vietnamese crawl:

3. Dataset story VietNamese :

4. Dataset poem VietNamese :

Owner

QAI Research

Python scraper to check for earlier appointments in Clalit Health Services

抢京东茅台脚本，定时自动触发，自动预约，自动停止

Webservice wrapper for hhursev/recipe-scrapers (python library to scrape recipes from websites)

Consulta de CPF e CNPJ na Receita Federal com Web-Scraping

Simple proxy scraper made by using ProxyScrape's api.

Web scraped S&P 500 Data from Wikipedia using Pandas and performed Exploratory Data Analysis on the data.

A low-code tool that generates python crawler code based on curl or url

This is a sport analytics project that combines the knowledge of OOP and Webscraping

This project was created using Python technology and flask tools to scrape a music site

Web and PDF Scraper Refactoring

淘宝、天猫半价抢购，抢电视、抢茅台，干死黄牛党

Scraping Top Repositories for Topics on GitHub,

Python script who crawl first shodan page and check DBLTEK vulnerability

Using Python and Pushshift.io to Track stocks on the WallStreetBets subreddit

download NCERT books using scrapy

A simple python web scraper.

Scrape and display grades onto the console

New World Market Scraper

CreamySoup - a helper script for automated SourceMod plugin updates management.

A spider for Universal Online Judge(UOJ) system, converting problem pages to PDFs.