A Powerful Spider(Web Crawler) System in Python.

Last update: Jan 04, 2023

Related tags

Overview

pyspider

A Powerful Spider(Web Crawler) System in Python.

Write script in Python
Powerful WebUI with script editor, task monitor, project manager and result viewer
MySQL, MongoDB, Redis, SQLite, Elasticsearch; PostgreSQL with SQLAlchemy as database backend
RabbitMQ, Redis and Kombu as message queue
Task priority, retry, periodical, recrawl by age, etc...
Distributed architecture, Crawl Javascript pages, Python 2.{6,7}, 3.{3,4,5,6} support, etc...

Tutorial: http://docs.pyspider.org/en/latest/tutorial/
Documentation: http://docs.pyspider.org/
Release notes: https://github.com/binux/pyspider/releases

Sample Code

from pyspider.libs.base_handler import *


class Handler(BaseHandler):
    crawl_config = {
    }

    @every(minutes=24 * 60)
    def on_start(self):
        self.crawl('http://scrapy.org/', callback=self.index_page)

    @config(age=10 * 24 * 60 * 60)
    def index_page(self, response):
        for each in response.doc('a[href^="http"]').items():
            self.crawl(each.attr.href, callback=self.detail_page)

    def detail_page(self, response):
        return {
            "url": response.url,
            "title": response.doc('title').text(),
        }

Installation

pip install pyspider
run command pyspider, visit http://localhost:5000/

WARNING: WebUI is open to the public by default, it can be used to execute any command which may harm your system. Please use it in an internal network or enable need-auth for webui.

Quickstart: http://docs.pyspider.org/en/latest/Quickstart/

Contribute

Use It
Open Issue, send PR
User Group
中文问答

TODO

v0.4.0

a visual scraping interface like portia

License

Licensed under the Apache License, Version 2.0

Comments

no web interface

Hi, I try to use pyspider, but I can't see the web interface when I click the web button after crawling a page, so I can't use CSS selector on pyspider. when I click the html button, it can show the page source. but on the demo.pyspider.org. everything is ok do you know what's wrong?

opened by wdfsinap 24
fail when running command pyspider

安装完成后，在command line输入pyspider，结果有：ImportError: dlopen(/usr/local/lib/python2.7/site-packages/pycurl.so, 2): Library not loaded: libssl.1.0.0.dylib Referenced from: /usr/local/lib/python2.7/site-packages/pycurl.so Reason: image not found

大神这是怎么回事啊？

另，我用的是mac os x，谢谢！

opened by ghost 24
Batch job start

When adding a batch job, why do you want to wait until after the completion of the task to grab? For example, I added a URL 5W, I read the log is to wait until the completion of the 5W to start?

opened by kaito-kidd 23

pyspider command disappear

the command will show up only after I install pyspider of python2, can't work in python3

$ python3 /usr/local/bin/pyspider
[I 150114 16:17:57 result_worker:44] result_worker starting...
Exception in thread Thread-1:
Traceback (most recent call last):
  File "<frozen importlib._bootstrap>", line 2195, in _find_and_load_unlocked
AttributeError: '_MovedItems' object has no attribute '__path__'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.4/dist-packages/pyspider/scheduler/scheduler.py", line 418, in xmlrpc_run
    from six.moves.xmlrpc_server import SimpleXMLRPCServer
ImportError: No module named 'six.moves.xmlrpc_server'; 'six.moves' is not a package

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.4/threading.py", line 920, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.4/threading.py", line 868, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.4/dist-packages/pyspider/scheduler/scheduler.py", line 420, in xmlrpc_run
    from SimpleXMLRPCServer import SimpleXMLRPCServer
ImportError: No module named 'SimpleXMLRPCServer'

[I 150114 16:17:57 scheduler:388] loading projects
[I 150114 16:17:57 processor:157] processor starting...
[I 150114 16:17:57 tornado_fetcher:387] fetcher starting...
Traceback (most recent call last):
  File "/usr/local/bin/pyspider", line 9, in <module>
    load_entry_point('pyspider==0.3.0', 'console_scripts', 'pyspider')()
  File "/usr/local/lib/python3.4/dist-packages/pyspider/run.py", line 532, in main
    cli()
  File "/usr/local/lib/python3.4/dist-packages/click/core.py", line 610, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.4/dist-packages/click/core.py", line 590, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.4/dist-packages/click/core.py", line 916, in invoke
    return Command.invoke(self, ctx)
  File "/usr/local/lib/python3.4/dist-packages/click/core.py", line 782, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.4/dist-packages/click/core.py", line 416, in invoke
    return callback(*args, **kwargs)
  File "/usr/local/lib/python3.4/dist-packages/pyspider/run.py", line 144, in cli
    ctx.invoke(all)
  File "/usr/local/lib/python3.4/dist-packages/click/core.py", line 416, in invoke
    return callback(*args, **kwargs)
  File "/usr/local/lib/python3.4/dist-packages/pyspider/run.py", line 409, in all
    ctx.invoke(webui, **webui_config)
  File "/usr/local/lib/python3.4/dist-packages/click/core.py", line 416, in invoke
    return callback(*args, **kwargs)
  File "/usr/local/lib/python3.4/dist-packages/pyspider/run.py", line 277, in webui
    app = load_cls(None, None, webui_instance)
  File "/usr/local/lib/python3.4/dist-packages/pyspider/run.py", line 47, in load_cls
    return utils.load_object(value)
  File "/usr/local/lib/python3.4/dist-packages/pyspider/libs/utils.py", line 348, in load_object
    module = __import__(module_name, globals(), locals(), [object_name])
  File "/usr/local/lib/python3.4/dist-packages/pyspider/webui/__init__.py", line 8, in <module>
    from . import app, index, debug, task, result, login
  File "/usr/local/lib/python3.4/dist-packages/pyspider/webui/app.py", line 79, in <module>
    template_folder=os.path.join(os.path.dirname(__file__), 'templates'))
  File "/usr/local/lib/python3.4/dist-packages/flask/app.py", line 319, in __init__
    template_folder=template_folder)
  File "/usr/local/lib/python3.4/dist-packages/flask/helpers.py", line 741, in __init__
    self.root_path = get_root_path(self.import_name)
  File "/usr/local/lib/python3.4/dist-packages/flask/helpers.py", line 631, in get_root_path
    loader = pkgutil.get_loader(import_name)
  File "/usr/lib/python3.4/pkgutil.py", line 467, in get_loader
    return find_loader(fullname)
  File "/usr/lib/python3.4/pkgutil.py", line 488, in find_loader
    return spec.loader
AttributeError: 'NoneType' object has no attribute 'loader'

opened by zhanglongqi 23

valid json config file
Could you please add sample config.json file for list of valid parameters? e.g

{ "webui": { "host": "127.0.0.1", "port": "5501" }, ... }
opened by mavencode01 20
use Fig to run docker container instead of use docker command line

首先，我改了改wiki，因为不加 :latest 会自动下载所有tags 然后，觉得我们这个项目可以用fig操作docker跑起来，效果更好

首先，新建一个目录pyspider 下载这个命名为fig.yml http://p.esd.cc/paste/wp5ELQ2M (手机打的代码没有测试过抱歉) 然后fig up即可！

安装fig pip install -U fig
enhancement

opened by imlonghao 20
how to use local file as project's script, how to use customized mysql result database.

代码中使用了本地的一个python文件，将抓取的结果写入mysql中，把此文件放在\database\mysql文件夹下运行代码时提示导入出错，找不到文件，小问题不知道怎么解决。看过issues中有说导入project当做模块的feature，但是不知道怎么用，请指导。附带的一个问题是，如果要替换数据库，需要重写on_result(self,result)函数，有没有示例参考下。

opened by ronaldhan 19
How to define a global variable
class Handler(BaseHandler): configuration = {'a' : 'b', 'c' : 'd'}

@every(minutes=12 * 60) def on_start(self): self.configuration = {'a' : 'a', 'c' : 'c'} @config(age= 12 * 60 * 60) def index_page(self, response): print(self.configuration)

I changed configuration in on_start. But in index_page, it still print {'a' : 'b', 'c' : 'd'}. How can I define a global variable?
opened by liu840185317 18
logging gb decode error

Hi，我安装后，启动run.py，可以正常进入web页面，但是一旦报错task任务。就会出错：

Traceback (most recent call last): File "/home/zfz/spider/stack/python-2.7.8/lib/python2.7/logging/init.py", line 859, in emit msg = self.format(record) File "/home/zfz/spider/stack/python-2.7.8/lib/python2.7/logging/init.py", line 732, in format return fmt.format(record) File "/home/zfz/spider/src/pyspider/pyspider/libs/log.py", line 121, in format formatted = formatted.rstrip() + "\n" + _unicode(record.exc_text) File "/home/zfz/spider/src/pyspider/pyspider/libs/log.py", line 27, in _unicode raise e UnicodeDecodeError: 'gb18030' codec can't decode bytes in position 670-671: illegal multibyte sequence Logged from file scheduler.py, line 354 Traceback (most recent call last): File "/home/zfz/spider/stack/python-2.7.8/lib/python2.7/logging/init.py", line 859, in emit msg = self.format(record) File "/home/zfz/spider/stack/python-2.7.8/lib/python2.7/logging/init.py", line 732, in format return fmt.format(record) File "/home/zfz/spider/src/pyspider/pyspider/libs/log.py", line 121, in format formatted = formatted.rstrip() + "\n" + _unicode(record.exc_text) File "/home/zfz/spider/src/pyspider/pyspider/libs/log.py", line 27, in _unicode raise e UnicodeDecodeError: 'gb18030' codec can't decode bytes in position 670-671: illegal multibyte sequence Logged from file scheduler.py, line 354 Traceback (most recent call last): File "/home/zfz/spider/stack/python-2.7.8/lib/python2.7/logging/init.py", line 859, in emit msg = self.format(record) File "/home/zfz/spider/stack/python-2.7.8/lib/python2.7/logging/init.py", line 732, in format return fmt.format(record) File "/home/zfz/spider/src/pyspider/pyspider/libs/log.py", line 121, in format formatted = formatted.rstrip() + "\n" + _unicode(record.exc_text) File "/home/zfz/spider/src/pyspider/pyspider/libs/log.py", line 27, in _unicode raise e UnicodeDecodeError: 'gb18030' codec can't decode bytes in position 670-671: illegal multibyte sequence Logged from file scheduler.py, line 354 Traceback (most recent call last): File "/home/zfz/spider/stack/python-2.7.8/lib/python2.7/logging/init.py", line 859, in emit msg = self.format(record) File "/home/zfz/spider/stack/python-2.7.8/lib/python2.7/logging/init.py", line 732, in format return fmt.format(record) File "/home/zfz/spider/src/pyspider/pyspider/libs/log.py", line 121, in format formatted = formatted.rstrip() + "\n" + _unicode(record.exc_text) File "/home/zfz/spider/src/pyspider/pyspider/libs/log.py", line 27, in _unicode raise e UnicodeDecodeError: 'gb18030' codec can't decode bytes in position 670-671: illegal multibyte sequence Logged from file scheduler.py, line 354

此后所有web请求全是500错误。操作系统版本centos 6.5，python环境2.7.8。

另外发现默认的是sqlite数据库，我想换用mysql，没找到在哪里配置？谢谢。
bug

opened by zfz 16
how to store results to database like mongodb

hi, pyspider is very nice to manage many crawler projects. but i have problem of how to store results into database. I see your tutorial on doc.pyspider.org. here is some tutorial on how to do with results

from pyspider.result import ResultWorker

Class MyResultWorker(ResultWorker): def on_result(self, task, result): assert task['taskid'] assert task['project'] assert task['url'] assert result #save result to database can i use this way in the script to save crawler results into database automatically?

opened by wdfsinap 15
Running pyspider to crawl https://phytozome.jgi.doe.gov/pz/portal.html, why doesn't it work?
code as follow: from pyspider.libs.base_handler import *

class Handler(BaseHandler): crawl_config = {'headers': { 'Content-Type':'application/x-www-form-urlencoded', 'Accept':'/', 'Accept-Encoding':'gzip, deflate', 'Accept-Language':'zh-CN,zh;q=0.8', 'Cache-Control':'max-age=0', 'Connection':'keep-alive', 'Content-Length':'295', 'X-Requested-With': 'XMLHttpRequest', 'Cookie':'__utmt=1; __utma=89664858.1557390068.1472454301.1472628080.1472628080.6; __utmb=89664858.3.10.1472628080; __utmc=89664858; __utmz=89664858.1472628080.5.5.utmcsr=sogou|utmccn=(organic)|utmcmd=organic|utmctr=phytozome', 'Host':'phytozome.jgi.doe.gov', 'Origin':'https://phytozome.jgi.doe.gov', 'Referer':'https://phytozome.jgi.doe.gov/pz/portal.html', 'User-Agent':'Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36', 'X-GWT-Module-Base':'https://phytozome.jgi.doe.gov/pz/phytoweb/', 'X-GWT-Permutation':'80DA602CF8FBCB99E9D79278AD2DA616', } }

@every(minutes=24 * 60) def on_start(self): self.crawl('https://phytozome.jgi.doe.gov/pz/portal.html#!results?search=0&crown=1&star=1&method=2296&searchText=AUX/IAA&offset=0',callback=self.detail_page,fetch_type='js') def index_page(self, response): for each in response.doc('*').items(): self.crawl(each.attr.href, callback=self.detail_page,fetch_type='js') @config(priority=2) def detail_page(self, response): self.index_page(response) for each in response.doc('*').items(): self.crawl(each.attr.href, callback=self.detail_page,fetch_type='js') return { "url": response.url, "content":response.doc("*").text() }

只能抓取到css
opened by GenomeW 14
fix(sec): upgrade lxml to 4.9.1
What happened？

There are 1 security vulnerabilities found in lxml 4.3.3

MPS-2022-14974

What did I do？

Upgrade lxml from 4.3.3 to 4.9.1 for vulnerability fix

What did you expect to happen？

Ideally, no insecure libs should be used.

The specification of the pull request

PR Specification from OSCS Signed-off-by:pen4[email protected]
opened by pen4 0
fix(sec): upgrade tornado to 5.1
What happened？

There are 1 security vulnerabilities found in tornado 4.5.3

CVE-2018-1000518

What did I do？

Upgrade tornado from 4.5.3 to 5.1 for vulnerability fix

What did you expect to happen？

Ideally, no insecure libs should be used.

The specification of the pull request

PR Specification from OSCS
opened by chncaption 0

Add support to release Linux aarch64 wheels

Problem

On aarch64, ‘pip install pyspider’ is giving the below error -

ERROR: Command errored out with exit status 1:
     command: /usr/bin/python3 -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-w68_cof2/pycurl/setup.py'"'"'; __file__='"'"'/tmp/pip-install-w68_cof2/pycurl/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-install-w68_cof2/pycurl/pip-egg-info
         cwd: /tmp/pip-install-w68_cof2/pycurl/
    Complete output (22 lines):
    Traceback (most recent call last):
      File "/tmp/pip-install-w68_cof2/pycurl/setup.py", line 235, in configure_unix
        p = subprocess.Popen((self.curl_config(), '--version'),
      File "/usr/lib/python3.8/subprocess.py", line 858, in __init__
        self._execute_child(args, executable, preexec_fn, close_fds,
      File "/usr/lib/python3.8/subprocess.py", line 1704, in _execute_child
        raise child_exception_type(errno_num, err_msg, err_filename)
    FileNotFoundError: [Errno 2] No such file or directory: 'curl-config'

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-install-w68_cof2/pycurl/setup.py", line 1017, in <module>
        ext = get_extension(sys.argv, split_extension_source=split_extension_source)
      File "/tmp/pip-install-w68_cof2/pycurl/setup.py", line 673, in get_extension
        ext_config = ExtensionConfiguration(argv)
      File "/tmp/pip-install-w68_cof2/pycurl/setup.py", line 99, in __init__
        self.configure()
      File "/tmp/pip-install-w68_cof2/pycurl/setup.py", line 240, in configure_unix
        raise ConfigurationError(msg)
    __main__.ConfigurationError: Could not run curl-config: [Errno 2] No such file or directory: 'curl-config'
    ----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

Resolution

On aarch64, ‘pip install pyspider’ should download the wheels from PyPI.

@binux and Team, Please let me know your interest in releasing aarch64 wheels. I can help in this.

opened by odidev 0

got error when starting the webui

[W 220404 15:02:18 run:413] phantomjs not found, continue running without it. [I 220404 15:02:20 result_worker:49] result_worker starting... [I 220404 15:02:20 processor:211] processor starting... [I 220404 15:02:20 tornado_fetcher:638] fetcher starting... [I 220404 15:02:20 scheduler:647] scheduler starting... [I 220404 15:02:20 scheduler:782] scheduler.xmlrpc listening on 127.0.0.1:23333 [I 220404 15:02:20 scheduler:586] in 5m: new:0,success:0,retry:0,failed:0 [I 220404 15:02:20 app:84] webui exiting... Traceback (most recent call last): File "/usr/local/Caskroom/miniconda/base/envs/web/bin/pyspider", line 8, in sys.exit(main()) File "/usr/local/Caskroom/miniconda/base/envs/web/lib/python3.6/site-packages/pyspider/run.py", line 754, in main cli() File "/usr/local/Caskroom/miniconda/base/envs/web/lib/python3.6/site-packages/click/core.py", line 1128, in call return self.main(*args, **kwargs) File "/usr/local/Caskroom/miniconda/base/envs/web/lib/python3.6/site-packages/click/core.py", line 1053, in main rv = self.invoke(ctx) File "/usr/local/Caskroom/miniconda/base/envs/web/lib/python3.6/site-packages/click/core.py", line 1637, in invoke super().invoke(ctx) File "/usr/local/Caskroom/miniconda/base/envs/web/lib/python3.6/site-packages/click/core.py", line 1395, in invoke return ctx.invoke(self.callback, **ctx.params) File "/usr/local/Caskroom/miniconda/base/envs/web/lib/python3.6/site-packages/click/core.py", line 754, in invoke return __callback(*args, **kwargs) File "/usr/local/Caskroom/miniconda/base/envs/web/lib/python3.6/site-packages/click/decorators.py", line 26, in new_func return f(get_current_context(), *args, **kwargs) File "/usr/local/Caskroom/miniconda/base/envs/web/lib/python3.6/site-packages/pyspider/run.py", line 165, in cli ctx.invoke(all) File "/usr/local/Caskroom/miniconda/base/envs/web/lib/python3.6/site-packages/click/core.py", line 754, in invoke return __callback(*args, **kwargs) File "/usr/local/Caskroom/miniconda/base/envs/web/lib/python3.6/site-packages/click/decorators.py", line 26, in new_func return f(get_current_context(), *args, **kwargs) File "/usr/local/Caskroom/miniconda/base/envs/web/lib/python3.6/site-packages/pyspider/run.py", line 497, in all ctx.invoke(webui, **webui_config) File "/usr/local/Caskroom/miniconda/base/envs/web/lib/python3.6/site-packages/click/core.py", line 754, in invoke return __callback(*args, **kwargs) File "/usr/local/Caskroom/miniconda/base/envs/web/lib/python3.6/site-packages/click/decorators.py", line 26, in new_func return f(get_current_context(), *args, **kwargs) File "/usr/local/Caskroom/miniconda/base/envs/web/lib/python3.6/site-packages/pyspider/run.py", line 384, in webui app.run(host=host, port=port) File "/usr/local/Caskroom/miniconda/base/envs/web/lib/python3.6/site-packages/pyspider/webui/app.py", line 59, in run from .webdav import dav_app File "/usr/local/Caskroom/miniconda/base/envs/web/lib/python3.6/site-packages/pyspider/webui/webdav.py", line 207, in '/': ScriptProvider(app) TypeError: Can't instantiate abstract class ScriptProvider with abstract methods get_resource_inst

opened by kroraina-threesteps 6
pyspider version: latest version, built the master branch using docker.

Operating system: macOS Catalina

Start up command: docker-compose -f docker-compose.yaml build && docker-compose up

Expected behavior

Either the response should interpret the base tags or, on setting response.url and response.orig_url, proper hrefs are constructed for relative hrefs.

Actual behavior

Proper href URLs are not formed.

How to reproduce

Try to crawl a website with <base href="...."> tag.
opened by agarwal-nitesh 0

Releases(v0.3.10)

v0.3.10(Apr 18, 2018)

New features:

add phantomjs proxy support #692 @volvofixthis

support redis 3.x in cluster mode for message queue @hackty

Fix several bugs:

Improve the performance of counter.to_dict

Fixed issue of counter changed during read

Fix tornado version dependency in setup.py

Source code(tar.gz)
Source code(zip)
v0.3.9(Mar 18, 2017)
New features:

Support for Python 3.6.

Auto Pause: the project will be paused for scheduler.PAUSE_TIME (default: 5min) when last scheduler.FAIL_PAUSE_NUM (default: 10) task failed, and dispatch scheduler.UNPAUSE_CHECK_NUM (default: 3) tasks after scheduler.PAUSE_TIME. Project will resume if any one of last scheduler.UNPAUSE_CHECK_NUM tasks success.

Each callback now have a default 30s process time limit. (Platform support required) @beader

New Javascript render engine - Splash support: Enabled by fetch argument --splash-endpoint=http://splash:8050/execute

Python3 webdav support.

Python3 from projects import project support.

A link to corresponding task is added to webui debug page when debugging a exists task in webui.

New user_agent parameter in self.crawl, you can set user-agent by headers though.

Fix several bugs:

New webui dashboard frontend framework - vue.js, improved the performance when having large number of tasks (e.g. http://demo.pyspider.org/)

Fix crawl_config doesn't work in webui while debugging a script issue.

Fix CSS Selector Helper doesn't work issue. @ackalker

Fix connection_timeout not working issue.

FIx need_auth option not applied on webdav issue.

Fix "fix can't dump counter to file: scheduler.all" error.

Some other fixes

Source code(tar.gz)
Source code(zip)
v0.3.8(Aug 18, 2016)
New features:

Now you can use cancel to stop an active task of a task with auto_recrawl enabled.

Handler.crawl_config will now be applied to the task when fetching. (It's applied when the task created before, that means proxy/headers can be changed afterward). See http://docs.pyspider.org/en/latest/apis/self.crawl/#handlercrawl_config

Fix several bugs:

* Fixed a global config object thread interference issue, which may cause connect to scheduler rpc error: error(10061, '') error when all --run-in=thread (default in windows platform)

Fix response.save lost when fetch failed issue

Fix potential scheduler failure caused by old version of six

Fix result dump return nothing when using mongodb backend

Source code(tar.gz)
Source code(zip)
v0.3.7(Apr 20, 2016)
ThreadBaseScheduler added to improve the performance of scheduler

robots.txt supported!

elasticsearch database backend supported!

new script callback on_finished, http://docs.pyspider.org/en/latest/About-Projects/#on_finished-callback

you can now set the delay time between retries:

retry_delay is a dict to specify retry intervals. The items in the dict are {retried: seconds}, and a special key: '' (empty string) is used to specify the default retry delay if not specified.

dict parameters in crawl_config, @config will be merged (e.g. headers), thanks to @ihipop

add parameter max_redirects in self.crawl to control maximum redirect numbers when doing the fetch, thanks to @AtaLuZiK

add parameter validate_cert in self.crawl to ignore the error of server’s certificate.

new property etree for Response, etree is a cached lxml.html.HtmlElement object, thanks to @waveyeung

you can now pass arguments to phantomjs from command line or config file.

support for pymongo 3.0

local.projectdb now accept a glob path (e.g. script/*.py) to load multiple projects from local filesystem.

queue size in the dashboard is not working for osx, thanks to @xyb

counters in dashboard will shown for stopped projects

other bug fix

Source code(tar.gz)
Source code(zip)
v0.3.6(Nov 10, 2015)
NEW: webdav mode, now you can use webdav to mount project folder to your local filesystem and edit scripts with your favority editor! (not support python 3, wsgidav required, which is not contained in setup.py)

bug fixes for Python 3 compatibility, Postgresql, flask-Login>=0.3.0, typo and more, thanks for the help of @lushl9301 @hitjackma @exoticknight @d0ugal @qiang.luo @twinmegami @jttoday @machinewu @littlezz @yaokaige

fix Queue.qsize NotImplementedError on Mac OS X, thanks @xyb

Source code(tar.gz)
Source code(zip)
v0.3.5(May 22, 2015)
New parameter: auto_recrawl - auto restart task every age.

New parameter: js_viewport_width/js_viewport_height to set viewport size for phantomjs engine.

New command line option to set different message queue backends with URI scheme.

New task level storage mechanism: self.save

New redis taskdb

New redis message queue.

New high level message queue interface kombu.

Fix bugs related to mongodb (keyword missing if not set).

Fix phantomjs not work in all mode.

Fix a potential deadlock in processor send_message.

Default log level of scheduler is changed to INFO

Source code(tar.gz)
Source code(zip)
v0.3.4(Apr 21, 2015)
Global

New message queue support: beanstalkd by @tiancheng91

New global argument: --logging-config to specify a customization logging config (to disable werkzeug logs for instance). You can get a sample config from pyspider/logging.conf).

Project group info is added to task package now.

Change docker base image to cmfatih/phantomjs, you can use phantomjs with same docker image now.

Auto restart phantomjs if crash, only enabled in all mode by default.

WebUI

Show next exetime of a task in task page.

Show fetch time and process time in tasks page.

Show average fetch time and process time in 5min in dashboard page.

Show message queue status in dashboard page.

limit and offset parameter support in result dump.

Fix frontend bug when crawling pages with dataurl.

Other

Fix support for phantomjs 2.0.

Fix scheduler project update inform not work, and use md5sum of script as another signal.

Scheduler: periodic counter report in log.

Fetcher: fix for legacy version of pycurl

Source code(tar.gz)
Source code(zip)
v0.3.3(Mar 8, 2015)
API

self.crawl will raise TypeError when get unexcepted arguments

self.crawl not accapt cURL command as first argument, see http://docs.pyspider.org/en/latest/apis/self.crawl/#curl-command.

WEBUI

A new css selector tool bar is added, the pre-generated css selected pattern can be modified and added/copy to script.

Benchmarking

The database table for bench test will be cleared before and after bench test.

insert/update/get bench test for database and put/get test for message queue is added.

Other

The default message queue is switched to ampq.

docs fix.

Source code(tar.gz)
Source code(zip)
v0.3.2(Feb 11, 2015)
Scheduler

The size of task queue is more accurate now, you can use it to determine all done status of scheduler.

Fetcher

Fix tornado loss cookies while doing 30x redirects

You can use cookies with cookie header at same time now

Fix proxy not working bug.

Enable proxy by default.

Proxy now support username and password authorization. @soloradish

Etag and Last-Modified header will be disabled while last crawl is failed.

Databases

MySQL default engine changed to InnoDB @laapsaap

MySQL, larger result column size, changed to MEDIUMBLOB(up to 16M) @laapsaap

WebUI

WebUI will use same arguments as the fetcher, fix proxy not word for webui bug.

Results will be sorted in the order of updatetime.

One Mode

Script exception logs would be printed to screen

New Command send_message

You can use the command pyspider send_message [project] [message] to send a message to project via command-line.

Other

Using localhosted test web pages

Remove version specify of lxml, you can use apt-get to install any version of lxml

Source code(tar.gz)
Source code(zip)
v0.3.1(Jan 22, 2015)
One Mode

One mode not only means all-in-one, it runs every thing in one process over tornado.ioloop. One mode is designed for debug purpose. You can test scripts written in local files and using --interactive to choose a task to be tested.

With one mode you can use pyspider.libs.utils.python_console() to open an interactive shell in your script context to test your code.

full documentation: http://docs.pyspider.org/en/latest/Command-Line/#one

bug fix

Source code(tar.gz)
Source code(zip)
v0.3.0(Jan 11, 2015)
A lot of bug fixed.

Make pyspider as a single top-level package. (thanks to zbb, iamtew and fmueller from HN)

Python 3 support!

Use click to create a better command line interface.

Postgresql Supported via SQLAlchemy (with the power of SQLAlchemy, pyspider also support Oracle, SQL Server, etc).

Benchmark test.

Documentation & tutorial: http://docs.pyspider.org/

Flake8 cleanup (thanks to @jtwaleson)

Base

Use messagepack instead of pickle in message queue.

JSON data will encoding as base64 string when content is binary.

Rabbitmq lazy limit for better performance.

Scheduler

Never re-crawl a task with a negative age.

Fetcher

proxy parameter support ip:port format.

increase default fetcher poolsize to 100.

PhantomJS will return JS script result in Response.js_script_result.

Processor

Put multiple new tasks in one package. performance for rabbitmq.

Not store all of the headers when success.

Script

Add an interface to generate taskid with task object. get_taskid

Task would be de-duplicated by project and taskid.

Webui

Project list sortable.

Return 404 page when dump a not exists project.

Web preview support image

Source code(tar.gz)
Source code(zip)
v0.2.0(Nov 12, 2014)
Base

mysql, mongodb backend support, and you can use a database uri to setup them.

rabbitmq as Queue for distributed deployment

docker supported

support for Windows

support for python2.6

a resultdb, result_worker and WEBUI is added.

Scheduler

cronjob task supported

delete project supported

Fetcher

a phantomjs fetcher is added. now you can fetch pages with javascript/ajax technology!

Processor

send_message api to send message to other projects

now you can import other project as module via from projects import xxxx

@config helper for setting configs for a callback

WEBUI

a css selector helper is added to debugger.

a option to switch JS/CSS CDN.

a page of task history/config

a page of recent active tasks

pages of results

a demo mode is added for http://demo.pyspider.org/

Others

bug fixes

more tests, coverage is used.

Source code(tar.gz)
Source code(zip)
v0.1.0(Mar 9, 2014)
finish a basic runnable system with:

sqlite3 task & project database

runnable scheduler & fetcher & processor

basic dashboard and debugger

Source code(tar.gz)
Source code(zip)

Owner

Roy Binux

[NEW] Add a bio

GitHub Repository http://docs.pyspider.org/

LSpider 一个为被动扫描器定制的前端爬虫

LSpider LSpider - 一个为被动扫描器定制的前端爬虫什么是LSpider? 一款为被动扫描器而生的前端爬虫~ 由Chrome Headless、LSpider主控、Mysql数据库、RabbitMQ、被动扫描器5部分组合而成。

321 Dec 12, 2022

Web-Scrapper using Python and Flask

Web-Scrapper "[초급]Python으로 웹 스크래퍼 만들기" 코스 -NomadCoders 기초적인 Python 문법강의부터 시작하여 웹사이트의 html파일에서 원하는 내용을 Scrapping해서 출력, csv 파일로 저장, flask를 이용한 간단한 웹페이지

1 Nov 10, 2021

This tool crawls a list of websites and download all PDF and office documents

This tool crawls a list of websites and download all PDF and office documents. Then it analyses the PDF documents and tries to detect accessibility issues.

7 Sep 30, 2022

This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster

This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster.

1.1k Jan 06, 2023

Ebay Webscraper for Getting Average Product Price

Ebay-Webscraper-for-Getting-Average-Product-Price The code in this repo is used to determine the average price of an item on Ebay given a valid search

17 Jan 05, 2023

A web scraper that exports your entire WhatsApp chat history.

WhatSoup 🍲 A web scraper that exports your entire WhatsApp chat history. Table of Contents Overview Demo Prerequisites Instructions Frequen

87 Jan 06, 2023

A spider for Universal Online Judge(UOJ) system, converting problem pages to PDFs.

Universal Online Judge Spider Introduction This is a spider for Universal Online Judge (UOJ) system (https://uoj.ac/). It also works for all other Onl

1 Dec 07, 2021

ChromiumJniGenerator - Jni Generator module extracted from Chromium project

4 Jun 12, 2022

Simple library for exploring/scraping the web or testing a website you’re developing

Robox is a simple library with a clean interface for exploring/scraping the web or testing a website you’re developing. Robox can fetch a page, click on links and buttons, and fill out and submit for

79 Nov 27, 2022

腾讯课堂，模拟登陆，获取课程信息，视频下载，视频解密。

腾讯课堂脚本要学一些东西，但腾讯课堂不支持自定义变速，播放时有水印，且有些老师的课一遍不够看，于是这个脚本诞生了。时间比较紧张，只会不定时修复重大bug。多线程下载之类的功能更新短期内不会有，如果你想一起完善这个脚本，欢迎pr 2020.5.22测试可用使用方法很简单，三部完成下载代码,

163 Dec 30, 2022

Scrapes proxies and saves them to a text file

Proxy Scraper Scrapes proxies from https://proxyscrape.com and saves them to a file. Also has a customizable theme system Made by nell and Lamp

2 Dec 22, 2021

Transistor, a Python web scraping framework for intelligent use cases.

Web data collection and storage for intelligent use cases. transistor About The web is full of data. Transistor is a web scraping framework for collec

212 Nov 05, 2022

Example of scraping a paginated API endpoint and dumping the data into a DB

Provider API Scraper Example Example of scraping a paginated API endpoint and dumping the data into a DB. Pre-requisits Python = 3.9 Pipenv Setup # i

1 Oct 20, 2021

for those who dont want to pay $10/month for high school game footage with ads

nfhs-scraper Disclaimer: I am in no way responsible for what you choose to do with this script and guide. I do not endorse avoiding paywalls or any il

5 Apr 12, 2022

Meme-videos - Scrapes memes and turn them into a video compilations

Meme Videos Scrapes memes from reddit using praw and request and then converts t

12 Oct 28, 2022

Web Crawlers for Data Labelling of Malicious Domain Detection & IP Reputation Evaluation

Web Crawlers for Data Labelling of Malicious Domain Detection & IP Reputation Evaluation This repository provides two web crawlers to label domain nam

1 Nov 05, 2021

Displays market info for the LUNI token on the Terra Blockchain

LuniBot for Discord Displays market info for the LUNI/LUNA token on the Terra Blockchain (Webscrape method currently scraping CoinMarketCap). Will evo

0 Jan 22, 2022

Screen scraping and web crawling framework

Pomp Pomp is a screen scraping and web crawling framework. Pomp is inspired by and similar to Scrapy, but has a simpler implementation that lacks the

61 Jun 21, 2021

Python script for crawling ResearchGate.net papers✨⭐️📎

ResearchGate Crawler Python script for crawling ResearchGate.net papers About the script This code start crawling process by urls in start.txt and giv

4 Aug 30, 2022

Facebook Group Scraping Using Beautiful Soup & Selenium

Extract Facebook group posts that are related to a specific topic and write them to a .json file.

14 Aug 12, 2022

A Powerful Spider(Web Crawler) System in Python.

Related tags

Overview

pyspider

Sample Code

Installation

Contribute

TODO

v0.4.0

License

Comments

What happened？

What did I do？

What did you expect to happen？

The specification of the pull request

What happened？

What did I do？

What did you expect to happen？

The specification of the pull request

Problem

Resolution

Expected behavior

Actual behavior

How to reproduce

Releases(v0.3.10)

v0.3.10(Apr 18, 2018)

New features:

Fix several bugs:

v0.3.9(Mar 18, 2017)

New features:

Fix several bugs:

v0.3.8(Aug 18, 2016)

New features:

Fix several bugs:

v0.3.7(Apr 20, 2016)

v0.3.6(Nov 10, 2015)

v0.3.5(May 22, 2015)

v0.3.4(Apr 21, 2015)

Global

WebUI

Other

v0.3.3(Mar 8, 2015)

API

WEBUI

Benchmarking

Other

v0.3.2(Feb 11, 2015)

Scheduler

Fetcher

Databases

WebUI

One Mode

New Command send_message

Other

v0.3.1(Jan 22, 2015)

One Mode

v0.3.0(Jan 11, 2015)

Base

Scheduler

Fetcher

Processor

Script

Webui

v0.2.0(Nov 12, 2014)

Base

Scheduler

Fetcher

Processor

WEBUI

Others

v0.1.0(Mar 9, 2014)

Owner

Roy Binux

LSpider 一个为被动扫描器定制的前端爬虫

Web-Scrapper using Python and Flask

This tool crawls a list of websites and download all PDF and office documents

This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster

Ebay Webscraper for Getting Average Product Price

A web scraper that exports your entire WhatsApp chat history.

A spider for Universal Online Judge(UOJ) system, converting problem pages to PDFs.

New Command `send_message`