Libextract: extract data from websites

https://travis-ci.org/datalib/libextract.svg?branch=master

    ___ __              __                  __
   / (_) /_  ___  _  __/ /__________ ______/ /_
  / / / __ \/ _ \| |/_/ __/ ___/ __ `/ ___/ __/
 / / / /_/ /  __/>   
Libextract is a statistics-enabled data extraction library that works on HTML and XML documents and written in Python. Originating from eatiht, the extraction algorithm works by making one simple assumption: data appear as collections of repetitive elements. You can read about the reasoning here.
  

  
   Overview 

  
 
  
   
  libextract.api.extract(document, encoding='utf-8', count=5)
 
   
 
  
   
  Given an html document, and optionally the encoding, return a list of nodes likely containing data (5 by default).
 
   

 
  

  
   Installation 
pip install libextract
  

  
   Usage 
Due to our simple definition of "data", we open up a single interfaceable method. Post-processing is up to you. 

 
 
  from requests import get
from libextract.api import extract

r = get('http://en.wikipedia.org/wiki/Information_extraction')
textnodes = list(extract(r.content))

  
Using lxml's built-in methods for post-processing: 

 
 
  >> print(textnodes[0].text_content())
Information extraction (IE) is the task of automatically extracting structured information...

  
The extraction algo is agnostic to article text as it is with tabular data: 

 
 
  height_data = get("http://en.wikipedia.org/wiki/Human_height")
tabs = list(extract(height_data.content))

  

 
 
  >> [elem.text_content() for elem in tabs[0].iter('th')]
['Country/Region',
 'Average male height',
 'Average female height',
 ...]

 
  

  
   Dependencies 
lxml
statscounter
  

  
   Disclaimer 
This project is still in its infancy; and advice and suggestions as to what this library could and should be would be greatly appreciated 
:)

Libextract: extract data from websites

Related tags

Overview

Libextract: extract data from websites

Overview

Installation

Usage

Dependencies

Disclaimer

Owner

:arrow_double_down: Dumb downloader that scrapes the web

Instagram profile scrapper with python

A web crawler script that crawls the target website and lists its links

Subscrape - A Python scraper for substrate chains

京东秒杀商品抢购Python脚本

Scraping Thailand COVID-19 data from the DDC's tableau dashboard

Screen scraping and web crawling framework

A simple flask application to scrape gogoanime website.

Crawler in Python 3.7, 3.8. 3.9. Pypy3

Web scraper for Zillow

Script for scrape user data like "id,username,fullname,followers,tweets .. etc" by Twitter's search engine .

Jobinja.ir jobs scraper.

12306抢票脚本

IGLS - Instagram Like Scraper CLI tool

Web3 Pancakeswap Sniper bot written in python3

Scrap-mtg-top-8 - A top 8 mtg scraper using python

Web scraping library and command-line tool for text discovery and extraction (main content, metadata, comments)

A list of Python Bots used to extract data from several websites

抢京东茅台脚本，定时自动触发，自动预约，自动停止

The open-source web scrapers that feed the Los Angeles Times California coronavirus tracker.