Libextract: extract data from websites

Overview

Libextract: extract data from websites

https://travis-ci.org/datalib/libextract.svg?branch=master
    ___ __              __                  __
   / (_) /_  ___  _  __/ /__________ ______/ /_
  / / / __ \/ _ \| |/_/ __/ ___/ __ `/ ___/ __/
 / / / /_/ /  __/>   

Libextract is a statistics-enabled data extraction library that works on HTML and XML documents and written in Python. Originating from eatiht, the extraction algorithm works by making one simple assumption: data appear as collections of repetitive elements. You can read about the reasoning here.

Overview

libextract.api.extract(document, encoding='utf-8', count=5)
Given an html document, and optionally the encoding, return a list of nodes likely containing data (5 by default).

Installation

pip install libextract

Usage

Due to our simple definition of "data", we open up a single interfaceable method. Post-processing is up to you.

from requests import get
from libextract.api import extract

r = get('http://en.wikipedia.org/wiki/Information_extraction')
textnodes = list(extract(r.content))

Using lxml's built-in methods for post-processing:

>> print(textnodes[0].text_content())
Information extraction (IE) is the task of automatically extracting structured information...

The extraction algo is agnostic to article text as it is with tabular data:

height_data = get("http://en.wikipedia.org/wiki/Human_height")
tabs = list(extract(height_data.content))
>> [elem.text_content() for elem in tabs[0].iter('th')]
['Country/Region',
 'Average male height',
 'Average female height',
 ...]

Dependencies

lxml
statscounter

Disclaimer

This project is still in its infancy; and advice and suggestions as to what this library could and should be would be greatly appreciated

:)

Screen scraping and web crawling framework

Pomp Pomp is a screen scraping and web crawling framework. Pomp is inspired by and similar to Scrapy, but has a simpler implementation that lacks the

Evgeniy Tatarkin 61 Jun 21, 2021
Semplice scraper realizzato in Python tramite la libreria BeautifulSoup

Semplice scraper realizzato in Python tramite la libreria BeautifulSoup

2 Nov 22, 2021
A web Scraper for CSrankings.com that scrapes University and Faculty list for a particular country

A look into what we're building Demo.mp4 Prerequisites Python 3 Node v16+ Steps to run Create a virtual environment. Activate the virtual environment.

2 Jun 06, 2022
An introduction to free, automated web scraping with GitHub’s powerful new Actions framework.

An introduction to free, automated web scraping with GitHub’s powerful new Actions framework Published at palewi.re/docs/first-github-scraper/ Contrib

Ben Welsh 15 Nov 24, 2022
Github scraper app is used to scrape data for a specific user profile created using streamlit and BeautifulSoup python packages

Github Scraper Github scraper app is used to scrape data for a specific user profile. Github scraper app gets a github profile name and check whether

Siva Prakash 6 Apr 05, 2022
12306抢票脚本

12306抢票脚本

罐子里的茶 457 Jan 05, 2023
A tool to easily scrape youtube data using the Google API

YouTube data scraper To easily scrape any data from the youtube homepage, a youtube channel/user, search results, playlists, and a single video itself

7 Dec 03, 2022
Bulk download tool for the MyMedia platform

MyMedia Bulk Content Downloader This is a bulk download tool for the MyMedia platform. USE ONLY WHERE ALLOWED BY THE COPYRIGHT OWNER. NOT AFFILIATED W

Ege Feyzioglu 3 Oct 14, 2022
Deep Web Miner Python | Spyder Crawler

Webcrawler written in Python. This crawler does dig in till the 3 level of inside addressed and mine the respective data accordingly

Karan Arora 17 Jan 24, 2022
Scrapes Every Email Address of Every Society in Every University

society-email-scrape Site Live at https://kcsoc.github.io/society-email-scrape/ How to automatically generate new data Go to unis.yml Add your uni Cre

Krishna Consciousness Society 18 Dec 14, 2022
Scrape Twitter for Tweets

Backers Thank you to all our backers! 🙏 [Become a backer] Sponsors Support this project by becoming a sponsor. Your logo will show up here with a lin

Ahmet Taspinar 2.2k Jan 05, 2023
Instagram profile scrapper with python

IG Profile Scrapper Instagram profile Scrapper Just type the username, and boo! :D Instalation clone this repo to your computer git clone https://gith

its Galih 6 Nov 07, 2022
ChromiumJniGenerator - Jni Generator module extracted from Chromium project

ChromiumJniGenerator - Jni Generator module extracted from Chromium project

allenxuan 4 Jun 12, 2022
A Smart, Automatic, Fast and Lightweight Web Scraper for Python

AutoScraper: A Smart, Automatic, Fast and Lightweight Web Scraper for Python This project is made for automatic web scraping to make scraping easy. It

Mika 4.8k Jan 04, 2023
Screenhook is a script that captures an image of a web page and send it to a discord webhook.

screenshot from the web for discord webhooks screenhook is a script that captures an image of a web page and send it to a discord webhook.

Toast Energy 3 Jun 04, 2022
A module for CME that spiders hashes across the domain with a given hash.

hash_spider A module for CME that spiders hashes across the domain with a given hash. Installation Simply copy hash_spider.py to your CME module folde

37 Sep 08, 2022
Binance Smart Chain Contract Scraper + Contract Evaluator

Pulls Binance Smart Chain feed of newly-verified contracts every 30 seconds, then checks their contract code for links to socials.Returns only those with socials information included, and then submit

14 Dec 09, 2022
Web-Scrapper using Python and Flask

Web-Scrapper "[초급]Python으로 웹 스크래퍼 만들기" 코스 -NomadCoders 기초적인 Python 문법강의부터 시작하여 웹사이트의 html파일에서 원하는 내용을 Scrapping해서 출력, csv 파일로 저장, flask를 이용한 간단한 웹페이지

윤성도 1 Nov 10, 2021
茅台抢购最新优化版本,茅台秒杀,优化了抢购协程队列

茅台抢购最新优化版本,茅台秒杀,优化了抢购协程队列

MaoTai 33 Sep 03, 2022
淘宝茅台抢购最新优化版本,淘宝茅台秒杀,优化了茅台抢购线程队列

淘宝茅台抢购最新优化版本,淘宝茅台秒杀,优化了茅台抢购线程队列

MaoTai 118 Dec 16, 2022