PyQuery-based scraping micro-framework.

Last update: Jul 20, 2022

Related tags

Web Crawling demiurge

Overview

demiurge

PyQuery-based scraping micro-framework. Supports Python 2.x and 3.x.

Documentation: http://demiurge.readthedocs.org

Installing demiurge

$ pip install demiurge

Quick start

Define items to be scraped using a declarative (Django-inspired) syntax:

import demiurge

class TorrentDetails(demiurge.Item):
    label = demiurge.TextField(selector='strong')
    value = demiurge.TextField()

    def clean_value(self, value):
        unlabel = value[value.find(':') + 1:]
        return unlabel.strip()

    class Meta:
        selector = 'div#specifications p'

class Torrent(demiurge.Item):
    url = demiurge.AttributeValueField(
        selector='td:eq(2) a:eq(1)', attr='href')
    name = demiurge.TextField(selector='td:eq(2) a:eq(2)')
    size = demiurge.TextField(selector='td:eq(3)')
    details = demiurge.RelatedItem(
        TorrentDetails, selector='td:eq(2) a:eq(2)', attr='href')

    class Meta:
        selector = 'table.maintable:gt(0) tr:gt(0)'
        base_url = 'http://www.mininova.org'


>>> t = Torrent.one('/search/ubuntu/seeds')
>>> t.name
'Ubuntu 7.10 Desktop Live CD'
>>> t.size
u'695.81\xa0MB'
>>> t.url
'/get/1053846'
>>> t.html
u'<td>19\xa0Dec\xa007</td><td><a href="/cat/7">Software</a></td><td>...'

>>> results = Torrent.all('/search/ubuntu/seeds')
>>> len(results)
116
>>> for t in results[:3]:
...     print t.name, t.size
...
Ubuntu 7.10 Desktop Live CD 695.81 MB
Super Ubuntu 2008.09 - VMware image 871.95 MB
Portable Ubuntu 9.10 for Windows 559.78 MB
...

>>> t = Torrent.one('/search/ubuntu/seeds')
>>> for detail in t.details:
...     print detail.label, detail.value
... 
Category: Software > GNU/Linux
Total size: 695.81 megabyte
Added: 2467 days ago by Distribution
Share ratio: 17 seeds, 2 leechers
Last updated: 35 minutes ago
Downloads: 29,085

See documentation for details: http://demiurge.readthedocs.org

Why demiurge?

Plato, as the speaker Timaeus, refers to the Demiurge frequently in the Socratic dialogue Timaeus, c. 360 BC. The main character refers to the Demiurge as the entity who "fashioned and shaped" the material world. Timaeus describes the Demiurge as unreservedly benevolent, and hence desirous of a world as good as possible. The world remains imperfect, however, because the Demiurge created the world out of a chaotic, indeterminate non-being.

http://en.wikipedia.org/wiki/Demiurge

Contributors

Martín Gaitán (@mgaitan)

Comments

Reausable cleaning functions
You can now add a "clean" kwarg containing a function to a field.

This makes it easy to use quick filtering (I want this data to be an int) and to re-use functions such as parsedatetime.

score = demiurge.TextField(selector=".score .upvoted", clean=int)
opened by traverseda 5
proof of concept: subitem field

short rationale: Sometimes I need to scrap a page to retrieve the actual links where the items are. I would like a way to nest Item classes, analog (in some way) to a in ForeignKey / ManyToManyField in Django.

This is a first PR as a proof of concept, to discuss the idea and its API.

opened by mgaitan 5
RelatedItems only work across urls

An obvious use of RelatedItems (or a similar construct) is recursively mapping a comment tree. Right now there's no elegant way to do that.

An example

http://pastebin.com/WDL4RjkE

Reading through the actual code, I think I might be wrong about this. I'll try and make the docs clearer.

opened by traverseda 2
Use lib "requests" for downloading

I'm right now making use of https://pypi.python.org/pypi/requests-cache which creates a cache of the downloaded stuff magically, and it's awesome. So, I would like to be able to take advantage of it using demiurge.

I don't know if just as an option or as a replacement of pyquery downloader.

What do you think?

opened by jmansilla 2
docs: fix simple typo, ocurrence -> occurrence

There is a small typo in docs/index.rst.

Should read occurrence rather than ocurrence.

Semi-automated pull request generated by https://github.com/timgates42/meticulous/blob/master/docs/NOTE.md

opened by timgates42 1
Fix when no selector defined
the default selector is the whole page ('html') but this is applied through PyQuery.find wich traverses down. example:

In [2]: PyQuery('<html>hello</html>').find('html') Out[2]: [] In [3]: PyQuery('<html>hello</html>')('html') Out[3]: [<html>]
opened by mgaitan 1
support self reference in RelatedItem

RelatedItem('self'). Also, the relateditem's item class could be given by its name (i.eRelatedItem("ItemClass")` ) A typical use case is a listing page with a "next page" link.

opened by mgaitan 0

Releases(v0.2)

v0.2(Sep 20, 2014)
Added docs.

Added RelatedItem.

Added field clean support.

Source code(tar.gz)
Source code(zip)

Owner

Matias Bordese

GitHub Repository http://demiurge.readthedocs.org

自动完成每日体温上报（Github Actions）

体温上报助手简介每天 10:30 GMT+8 自动完成体温上报，如想修改定时运行的时间，可修改 .github/workflows/SduHealthReport.yml 中 schedule 属性。如果当日有异常，请手动在小程序端/PC 端填写！

23 Sep 15, 2022

Complete pipeline for crawling online newspaper article.

Complete pipeline for crawling online newspaper article. The articles are stored to MongoDB. The whole pipeline is dockerized, thus the user does not need to worry about dependencies. Additionally, d

4 May 27, 2022

Consulta de CPF e CNPJ na Receita Federal com Web-Scraping

Repositório contendo scripts Python que realizam a consulta de CPF e CNPJ diretamente no site da Receita Federal.

5 Nov 29, 2021

Html Content / Article Extractor, web scrapping lib in Python

Python-Goose - Article Extractor Intro Goose was originally an article extractor written in Java that has most recently (Aug2011) been converted to a

3.8k Jan 02, 2023

SkyScrapers: A collection of variety of Scraping Apps

SkyScrapers Collection of variety of Web Scraping Apps The web-scrapers involved

3 Feb 17, 2022

A dead simple crawler to get books information from Douban.

Introduction A dead simple crawler to get books information from Douban. Pre-requesites Python 3 Install dependencies from requirements.txt (Optional)

1 Jan 10, 2022

Here I provide the source code for doing web scraping using the python library, it is Selenium.

1 Nov 13, 2021

Transistor, a Python web scraping framework for intelligent use cases.

Web data collection and storage for intelligent use cases. transistor About The web is full of data. Transistor is a web scraping framework for collec

212 Nov 05, 2022

A high-level distributed crawling framework.

Cola: high-level distributed crawling framework Overview Cola is a high-level distributed crawling framework, used to crawl pages and extract structur

1.5k Jan 04, 2023

Works very well and you can ask for the type of image you want the scrapper to collect.

Works very well and you can ask for the type of image you want the scrapper to collect. Also follows a specific urls path depending on keyword selection.

1 Feb 17, 2022

A simple flask application to scrape gogoanime website.

gogoanime-api-flask A simple flask application to scrape gogoanime website. Used for demo and learning purposes only. How to use the API The base api

1 Oct 29, 2021

Pseudo API for Google Trends

pytrends Introduction Unofficial API for Google Trends Allows simple interface for automating downloading of reports from Google Trends. Only good unt

2.6k Dec 28, 2022

Tool to scan for secret files on HTTP servers

snallygaster Finds file leaks and other security problems on HTTP servers. what? snallygaster is a tool that looks for files accessible on web servers

2k Dec 28, 2022

Web Scraping Practica With Python

Web-Scraping-Practica Integrants: Guillem Vidal Pallarols. Lídia Bandrés Solé Fitxers: Aquest document és el primer que trobem. A continuació trobem u

2 Nov 08, 2021

This is a simple website crawler which asks for a website link from the user to crawl and find specific data from the given website address.

1 Jan 10, 2022

PyQuery-based scraping micro-framework.

Related tags

Overview

demiurge

Installing demiurge

Quick start

Why demiurge?

Contributors

Comments

Reausable cleaning functions

proof of concept: subitem field

RelatedItems only work across urls

Use lib "requests" for downloading

docs: fix simple typo, ocurrence -> occurrence

Fix when no selector defined

support self reference in RelatedItem