A scalable frontier for web crawlers

Related tags

Web Crawlingfrontera
Overview

Frontera

pypi python versions Build Status codecov

Overview

Frontera is a web crawling framework consisting of crawl frontier, and distribution/scaling primitives, allowing to build a large scale online web crawler.

Frontera takes care of the logic and policies to follow during the crawl. It stores and prioritises links extracted by the crawler to decide which pages to visit next, and capable of doing it in distributed manner.

Main features

  • Online operation: small requests batches, with parsing done right after fetch.
  • Pluggable backend architecture: low-level backend access logic is separated from crawling strategy.
  • Two run modes: single process and distributed.
  • Built-in SqlAlchemy, Redis and HBase backends.
  • Built-in Apache Kafka and ZeroMQ message buses.
  • Built-in crawling strategies: breadth-first, depth-first, Discovery (with support of robots.txt and sitemaps).
  • Battle tested: our biggest deployment is 60 spiders/strategy workers delivering 50-60M of documents daily for 45 days, without downtime,
  • Transparent data flow, allowing to integrate custom components easily using Kafka.
  • Message bus abstraction, providing a way to implement your own transport (ZeroMQ and Kafka are available out of the box).
  • Optional use of Scrapy for fetching and parsing.
  • 3-clause BSD license, allowing to use in any commercial product.
  • Python 3 support.

Installation

$ pip install frontera

Documentation

Community

Join our Google group at https://groups.google.com/a/scrapinghub.com/forum/#!forum/frontera or check GitHub issues and pull requests.

Comments
  • Redesign codecs

    Redesign codecs

    Issue discussed here https://github.com/scrapinghub/frontera/issues/211#issuecomment-251931413 Todo List

    • [X] Fix msgpack codec
    • [x] Fix json codec
    • [x] Integration test with Hbase backend(manually)

    This PR fixes #211

    Other things done in this besides the todo list:

    • Added two methods _convert and reconvert in json codec. These are needed as JSONEncoder accepts strings only as unicode. Method convert converts objects recursively to unicode and saves their type.
    • made the requirement of msgpack >=0.4 as only versions greater than 0.4 support the changes made in this PR.
    • fixed a buggy test case in test_message_bus_backend which got exposed after fixing the codecs.
    opened by voith 35
  • Distributed example (HBase, Kafka)

    Distributed example (HBase, Kafka)

    The documentation is a little simple and does not explain how to integrate with Kafka and Hbase for a fully distributed architecture. Could you, please provide an example in the examples folder of a well configured distributed frontera config?

    opened by casertap 33
  • PY3 Syntactic changes.

    PY3 Syntactic changes.

    Most of the changes were produced using the modernize script. Changes include print syntax, error syntax, converting iterators and generators to lists, etc. Also includes some other changes which were missed by the script.

    opened by Preetwinder 32
  • Redirect loop when using distributed-frontera

    Redirect loop when using distributed-frontera

    I am using the development version of distributed-frontera, frontera and scrapy for crawling. After a while my spider keeps getting stuck in a redirect loop. Restarting the spider helps, but after a while this happens:

    2015-12-21 17:23:22 [scrapy] INFO: Crawled 10354 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
    2015-12-21 17:23:22 [scrapy] DEBUG: Redirecting (302) to <GET http://www.reg.ru/domain/new-gtlds> from <GET http://www.reg.ru/domain/new-gtlds>
    2015-12-21 17:23:22 [scrapy] DEBUG: Redirecting (302) to <GET http://www.toraath.com/index.php?lid=35&name=Downloads&req=getit> from <GET http://www.toraath.com/index.php?lid=35&name=Downloads&req=getit>
    2015-12-21 17:23:22 [scrapy] DEBUG: Redirecting (302) to <GET http://www.reg.ru/domain/new-gtlds> from <GET http://www.reg.ru/domain/new-gtlds>
    2015-12-21 17:23:23 [scrapy] DEBUG: Redirecting (302) to <GET http://www.reg.ru/domain/new-gtlds> from <GET http://www.reg.ru/domain/new-gtlds>
    2015-12-21 17:23:24 [scrapy] DEBUG: Redirecting (302) to <GET http://www.reg.ru/domain/new-gtlds> from <GET http://www.reg.ru/domain/new-gtlds>
    2015-12-21 17:23:25 [scrapy] DEBUG: Redirecting (302) to <GET http://www.reg.ru/domain/new-gtlds> from <GET http://www.reg.ru/domain/new-gtlds>
    2015-12-21 17:23:25 [scrapy] DEBUG: Redirecting (302) to <GET http://www.toraath.com/index.php?lid=35&name=Downloads&req=getit> from <GET http://www.toraath.com/index.php?lid=35&name=Downloads&req=getit>
    2015-12-21 17:23:25 [scrapy] DEBUG: Redirecting (302) to <GET http://www.reg.ru/domain/new-gtlds> from <GET http://www.reg.ru/domain/new-gtlds>
    2015-12-21 17:23:26 [scrapy] DEBUG: Redirecting (302) to <GET http://www.reg.ru/domain/new-gtlds> from <GET http://www.reg.ru/domain/new-gtlds>
    2015-12-21 17:23:27 [scrapy] DEBUG: Redirecting (302) to <GET http://www.reg.ru/domain/new-gtlds> from <GET http://www.reg.ru/domain/new-gtlds>
    2015-12-21 17:23:32 [scrapy] INFO: Crawled 10354 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
    2015-12-21 17:23:32 [scrapy] DEBUG: Redirecting (302) to <GET http://www.reg.ru/domain/new-gtlds> from <GET http://www.reg.ru/domain/new-gtlds>
    2015-12-21 17:23:32 [scrapy] DEBUG: Redirecting (302) to <GET http://www.toraath.com/index.php?lid=35&name=Downloads&req=getit> from <GET http://www.toraath.com/index.php?lid=35&name=Downloads&req=getit>
    2015-12-21 17:23:33 [scrapy] DEBUG: Discarding <GET http://www.reg.ru/domain/new-gtlds>: max redirections reached
    2015-12-21 17:23:34 [scrapy] DEBUG: Discarding <GET http://www.reg.ru/domain/new-gtlds>: max redirections reached
    2015-12-21 17:23:34 [scrapy] DEBUG: Discarding <GET http://www.reg.ru/domain/new-gtlds>: max redirections reached
    2015-12-21 17:23:35 [scrapy] DEBUG: Discarding <GET http://www.reg.ru/domain/new-gtlds>: max redirections reached
    2015-12-21 17:23:35 [scrapy] DEBUG: Redirecting (302) to <GET http://www.toraath.com/index.php?lid=35&name=Downloads&req=getit> from <GET http://www.toraath.com/index.php?lid=35&name=Downloads&req=getit>
    2015-12-21 17:23:36 [scrapy] DEBUG: Discarding <GET http://www.reg.ru/domain/new-gtlds>: max redirections reached
    2015-12-21 17:23:37 [scrapy] DEBUG: Discarding <GET http://www.reg.ru/domain/new-gtlds>: max redirections reached
    2015-12-21 17:23:38 [scrapy] DEBUG: Redirecting (302) to <GET http://www.toraath.com/index.php?lid=35&name=Downloads&req=getit> from <GET http://www.toraath.com/index.php?lid=35&name=Downloads&req=getit>
    2015-12-21 17:23:43 [scrapy] INFO: Crawled 10354 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
    2015-12-21 17:23:43 [scrapy] DEBUG: Redirecting (302) to <GET http://www.toraath.com/index.php?lid=35&name=Downloads&req=getit> from <GET http://www.toraath.com/index.php?lid=35&name=Downloads&req=getit>
    ...
    2015-12-21 17:45:38 [scrapy] DEBUG: Redirecting (302) to <GET http://www.toraath.com/index.php?lid=35&name=Downloads&req=getit> from <GET http://www.toraath.com/index.php?lid=35&name=Downloads&req=getit>
    2015-12-21 17:45:43 [scrapy] INFO: Crawled 10354 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
    

    This does not seem to be an issue with distributed-frontera since I could not find any code related to redirecting there.

    opened by lljrsr 25
  • [WIP] Added Cassandra backend

    [WIP] Added Cassandra backend

    This PR is a rebase of #128. Although I have completely changed the design and refactored the code, I have added @wpxgit commits(but squashed them) because this work was originally initiated by him.

    I have tried to follow the DRY methodology as much as possible, so I had to refactor some existing code.

    I have serialized dicts using Pickle, as a result this backend won't have problems discussed in #211.

    The PR includes unit tests and some integration tests with the backends integration testing framework.

    Its good that frontera has an integration test framework for testing backends in single threaded mode. However, having a similar framework for the distributed mode is very much needed.

    I am open to all sorts of suggestions :)

    opened by voith 17
  • cluster kafka db worker doesnt recognize partitions

    cluster kafka db worker doesnt recognize partitions

    Hi, Im trying to use cluster configuration. I've created topics in kafka and have it up and running. Im running into trouble starting the database worker. Tried: python -m frontera.worker.db --config config.dbw --no-incoming --partitions 0,1 got an error 0,1 not recognized, tried: python -m frontera.worker.db --config config.dbw --no-incoming --partitions 0 I was getting the same issue as in #359, but somehow that stopped happening.

    Now I'm getting: that kafka partitions are not recognized or iterrable, see error. Im using python 3.6 and the frontera from the repo (FYI qzm and cachetools still needed to be installed manually). Any ideas?

    File "/usr/lib64/python3.6/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/usr/lib64/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/usr/lib/python3.6/dist-packages/frontera/worker/db.py", line 246, in args.no_scoring, partitions=args.partitions) File "/usr/lib/python3.6/dist-packages/frontera/worker/stats.py", line 22, in init super(StatsExportMixin, self).init(settings, *args, **kwargs) File "/usr/lib/python3.6/dist-packages/frontera/worker/db.py", line 115, in init self.slot = Slot(self, settings, **slot_kwargs) File "/usr/lib/python3.6/dist-packages/frontera/worker/db.py", line 46, in init self.components = self._load_components(worker, settings, **kwargs) File "/usr/lib/python3.6/dist-packages/frontera/worker/db.py", line 55, in _load_components component = cls(worker, settings, stop_event=self.stop_event, **kwargs) File "/usr/lib/python3.6/dist-packages/frontera/worker/components/scoring_consumer.py", line 24, in init self.scoring_log_consumer = scoring_log.consumer() File "/usr/lib/python3.6/dist-packages/frontera/contrib/messagebus/kafkabus.py", line 219, in consumer return Consumer(self._location, self._enable_ssl, self._cert_path, self._topic, self._group, partition_id=None) File "/usr/lib/python3.6/dist-packages/frontera/contrib/messagebus/kafkabus.py", line 60, in init self._partitions = [TopicPartition(self._topic, pid) for pid in self._consumer.partitions_for_topic(self._topic)]

    opened by danmsf 16
  • [WIP] Downloader slot usage optimization

    [WIP] Downloader slot usage optimization

    Imagine, we have a queue of 10K urls from many different domains. Our task is to fetch it as fast as possible. At the same time we have a prioritization which tends to group URLs from the same domain. During downloading we want to be polite and limit per host RPS. So, picking just top URLs from the queue leeds us to the time waste, because connection pool of Scrapy downloader most of time underused.

    In this PR, I'm addressing this issue by propagating information about overused hostnames/IPs in downloader pool.

    opened by sibiryakov 16
  • Fixed scheduler process_spider_output() to yield requests

    Fixed scheduler process_spider_output() to yield requests

    fixes #253 Here's a screenshot using the same code discussed here. screen shot 2017-02-12 at 3 13 48 pm

    Nothing seems to break when testing this change manually. The only test that was failing was wrong IMO because it passed a list of requests and items and was only expecting items in return. I have modified that test to make it compatible with this patch.

    I've the split this PR into three commits:

    • The first commit adds a test to reproduce the bug.
    • The second commit fixes the bug
    • The third commit fixes the broken test discussed above

    A note about the tests added:

    The tests might be a little difficult to understand on the first sight. I would recommend to read the following code in order understand the tests:
    • https://github.com/scrapy/scrapy/blob/master/scrapy/core/spidermw.py#L34-L73: This is to understand how scrapy processes the different methods of the spider middleware.
    • https://github.com/scrapy/scrapy/blob/master/scrapy/core/scraper.py#L135-L147: This is to understand how the scrapy core executes the spider middleware methods and passes the control to the spider callbacks.

    I have simulated the above discussed code in order to write the test.

    opened by voith 15
  • New DELAY_ON_EMPTY functionality on FronteraScheduler terminates crawl right at start

    New DELAY_ON_EMPTY functionality on FronteraScheduler terminates crawl right at start

    While this is solved you can use this on your settings as a workaround:

    DELAY_ON_EMPTY=0.0
    

    The problem is in frontera.contrib.scrapy.schedulers.FrontieraScheduler, method _get_next_requests. If there are no pending requests and the test self._delay_next_call < time() fails, an empty list is returned which causes the crawl to terminate

    bug 
    opened by plafl 14
  • Fix SQL integer type for crc32 field

    Fix SQL integer type for crc32 field

    CRC32 is an unsigned 4-byte int, so it does not fit in a signed 4-byte int (Integer). There is no unsigned int type in the SQL standard, so I changed it to BigInteger instead. Without this change, both MySQL and Postgres complain that host_crc32 field value is out of bounds. Another option (to save space) would be to conver CRC32 into a signed 4-bit int, but this will complicate things, not sure it's worth it.

    opened by lopuhin 12
  • Use crawler settings as a fallback when there's no FRONTERA_SETTINGS

    Use crawler settings as a fallback when there's no FRONTERA_SETTINGS

    This is a follow up to https://github.com/scrapinghub/frontera/pull/45.

    It enables the manager to receive the crawler settings and then instantiate the frontera settings accordingly. I added a few tests that should make the new behavior a little clearer.

    Is something along this lines acceptable? How can it be improved?

    opened by josericardo 12
  • how can I know it works when I use it with scrapy?

    how can I know it works when I use it with scrapy?

    I did everything as the document running-the-rawl, and start to run

    scrapy crawl my-spider
    

    I notice the item being crawled from the console, but I don't know whether Frontera works.

    What I did

    image

    sandwarm/frontera/settings.py

    
    BACKEND = 'frontera.contrib.backends.sqlalchemy.Distributed'
    
    SQLALCHEMYBACKEND_ENGINE="mysql://acme:[email protected]:3306/acme"
    SQLALCHEMYBACKEND_MODELS={
        'MetadataModel': 'frontera.contrib.backends.sqlalchemy.models.MetadataModel',
        'StateModel': 'frontera.contrib.backends.sqlalchemy.models.StateModel',
        'QueueModel': 'frontera.contrib.backends.sqlalchemy.models.QueueModel'
    }
    
    SPIDER_MIDDLEWARES.update({
        'frontera.contrib.scrapy.middlewares.schedulers.SchedulerSpiderMiddleware': 1000,
    })
    
    DOWNLOADER_MIDDLEWARES.update({
        'frontera.contrib.scrapy.middlewares.schedulers.SchedulerDownloaderMiddleware': 1000,
    })
    
    SCHEDULER = 'frontera.contrib.scrapy.schedulers.frontier.FronteraScheduler'
    
    

    settings.py

    FRONTERA_SETTINGS = 'sandwarm.frontera.settings'
    
    

    Since I enable mysql backend, I am supposed to see connection error, for I don't start mysql yet.

    Thanks for your guys hard working, but please make the document easier for humans. for example, a very basic working example, currently, we need to gather all documents to get the basic idea, even the worse, it still doesn't work at all. I alreay spent a week on a working example.

    opened by vidyli 1
  • Project Status?

    Project Status?

    It's been a year since the last commit in the master branch? Do you have any plan to maintain this? I noticed a lot of issues doesn't get resolve, and lots of PR are still pending.

    opened by psdon 8
  • Message Decode Error

    Message Decode Error

    Getting following error when adding URL to Kafka for scrapy to parse

    2020-09-07 20:12:46 [messagebus-backend] WARNING: Could not decode message: b'http://quotes.toscrape.com/page/1/', error unpack(b) received extra data.
    
    opened by ab-bh 0
  • The `KeyError` throw when running to to_fetch in StateContext class: b'fingerprint'

    The `KeyError` throw when running to to_fetch in StateContext class: b'fingerprint'

    https://github.com/scrapinghub/frontera/blob/master/frontera/core/manager.py I use 0.8.1 code base in LOCAL_MODE, The KeyError throw when running to to_fetch in StateContext class:

    from line 801:

    class StatesContext(object):
    	...
        def to_fetch(self, requests):
            requests = requests if isinstance(requests, Iterable) else [requests]
            for request in requests:
                fingerprint = request.meta[b'fingerprint'] # error occured here!!!
    

    I think the reason is the meta b'fingerprint' used before it's setting:

    from line 302:

    class LocalFrontierManager(BaseContext, StrategyComponentsPipelineMixin, BaseManager):
        def page_crawled(self, response):
    ...
            self.states_context.to_fetch(response)  # here used  b'fingerprint'
            self.states_context.fetch()
            self.states_context.states.set_states(response)
            super(LocalFrontierManager, self).page_crawled(response) # but only here init!
            self.states_context.states.update_cache(response)
    

    from line 233:

    class BaseManager(object):			
        def page_crawled(self, response):
    ...
            self._process_components(method_name='page_crawled',
                                     obj=response,
                                     return_classes=self.response_model) # b'fingerprint' will be set when pipeline go through here
    		
    

    My corrent work aroud is add the line to to_fetch method of StateContext class:

        def to_fetch(self, requests):
            requests = requests if isinstance(requests, Iterable) else [requests]
            for request in requests:
                if b'fingerprint' not in request.meta:                
                    request.meta[b'fingerprint'] = sha1(request.url)
                fingerprint = request.meta[b'fingerprint']
                self._fingerprints[fingerprint] = request
    

    What is the collect way to fix this?

    opened by yujiaao 0
  • KeyError [b'frontier'] on Request Creation from Spider

    KeyError [b'frontier'] on Request Creation from Spider

    Issue might be related to #337

    Hi,

    I have already read in discussions here, that the scheduling of requests should be done by frontera and apparently even the creation should be done by the frontier and not by the spider. However, in the documentation of scrapy and frontera it is written that requests shall be yielded in the spider parse function.

    How should the process look like, if requests are to be created by the crawling strategy and not yielded by the spider? How does the spider trigger that?

    In my use case, I am using scrapy-selenium with scrapy and frontera (I use SeleniumRequests to be able to wait for JS loaded elements).

    I have to generate the URLs I want to scrape in two phases: I am yielding them firstly in the start_requests() method of the spider instead of a seeds file and yield requests for extracted links in the first of two parse functions.

    Yielding SeleniumRequests from start_requests works, but yielding SeleniumRequests from the parse function afterwards results in the following error (only pasted an extract, as the iterable error prints the same errors over and over):

    return (_set_referer(r) for r in result or ())
      File "/Users/user/opt/anaconda3/envs/frontera-update/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 64, in _evaluate_iterable
        for r in iterable:
      File "/Users/user/opt/anaconda3/envs/frontera-update/lib/python3.8/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
        return (r for r in result or () if _filter(r))
      File "/Users/user/opt/anaconda3/envs/frontera-update/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 64, in _evaluate_iterable
        for r in iterable:
      File "/Users/user/opt/anaconda3/envs/frontera-update/lib/python3.8/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
        return (r for r in result or () if _filter(r))
      File "/Users/user/opt/anaconda3/envs/frontera-update/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 64, in _evaluate_iterable
        for r in iterable:
      File "/Users/user/opt/anaconda3/envs/frontera-update/lib/python3.8/site-packages/frontera/contrib/scrapy/schedulers/frontier.py", line 112, in process_spider_output
        frontier_request = response.meta[b'frontier_request']
    KeyError: b'frontier_request'
    

    Very thankful for all hints and examples!

    opened by dkipping 3
Releases(v0.8.1)
  • v0.8.1(Apr 5, 2019)

  • v0.8.0.1(Jul 30, 2018)

  • v0.8.0(Jul 25, 2018)

    This is major release containing many architectural changes. The goal of these changes is make development and debugging of the crawling strategy easier. From now, there is an extensive guide in documentation on how to write a custom crawling strategy, a single process mode making much easier to debug crawling strategy locally and old distributed mode for production systems. Starting from this version there is no requirement to setup Apache Kafka or HBase to experiment with crawling strategies on your local computer.

    We also removed unnecessary, rarely used features: distributed spiders run mode, prioritisation logic from backends to make Frontera easier to use and understand.

    Here is a (somewhat) full change log:

    • PyPy (2.7.*) support,
    • Redis backend (kudos to @khellan),
    • LRU cache and two cache generations for HBaseStates,
    • Discovery crawling strategy, respecting robots.txt and leveraging sitemaps to discover links faster,
    • Breadth-first and depth-first crawling strategies,
    • new mandatory component in backend: DomainMetadata,
    • filter_links_extracted method in crawling strategy API to optimise calls to backends for state data,
    • create_request in crawling strategy is now using FronteraManager middlewares,
    • many batch gen instances,
    • support of latest kafka-python,
    • statistics are sent to message bus from all parts of Frontera,
    • overall reliability improvements,
    • settings for OverusedBuffer,
    • DBWorker was refactored and divided on components (kudos to @vshlapakov),
    • seeds addition can be done using s3 now,
    • Python 3.7 compatibility.
    Source code(tar.gz)
    Source code(zip)
  • v0.7.1(Feb 9, 2017)

    Thanks to @voith, a problem introduced with beginning of support of Python 3 when Frontera was supporting only keys and values stored as bytes in .meta fields is now solved. Many Scrapy middlewares weren't working or working incorrectly. This is still not tested properly, so please report any bugs.

    Other improvements include:

    • batched states refresh in crawling strategy,
    • proper access to redirects in Scrapy converters,
    • more readable and simple OverusedBuffer implementation,
    • examples, tests and docs fixes.

    Thank you all, for your contributions!

    Source code(tar.gz)
    Source code(zip)
  • v0.7.0(Nov 29, 2016)

    A long awaiting support of kafka-python 1.x.x client. Now Frontera is much more resistant to physical connectivity loss and is using new asynchronous Kafka API. Other improvements:

    • SW consumes less CPU (because of rare state flushing),
    • requests creation api is changed in BaseCrawlingStrategy, and now it's batch oriented,
    • new article in the docs on cluster setup,
    • disable scoring log consumption option in DB worker,
    • fix of hbase drop table,
    • improved tests coverage.
    Source code(tar.gz)
    Source code(zip)
  • v0.6.0(Aug 18, 2016)

    • Full Python 3 support 👏 👍 🍻 (https://github.com/scrapinghub/frontera/issues/106), all the thanks goes to @Preetwinder.
    • canonicalize_url method removed in favor of w3lib implementation.
    • The whole Request (incl. meta) is propagated to DB Worker, by means of scoring log (fixes https://github.com/scrapinghub/frontera/issues/131)
    • Generating Crc32 from hostname the same way for both platforms: Python 2 and 3.
    • HBaseQueue supports delayed requests now. ‘crawl_at’ field in meta with timestamp makes request available to spiders only after moment expressed with timestamp passed. Important feature for revisiting.
    • Request object is now persisted in HBaseQueue, allowing to schedule requests with specific meta, headers, body, cookies parameters.
    • MESSAGE_BUS_CODEC option allowing to choose other than default message bus codec.
    • Strategy worker refactoring to simplify it’s customization from subclasses.
    • Fixed a bug with extracted links distribution over spider log partitions (https://github.com/scrapinghub/frontera/issues/129).
    Source code(tar.gz)
    Source code(zip)
  • v0.5.3(Jul 22, 2016)

  • v0.5.2.3(Jul 18, 2016)

  • v0.5.2.2(Jun 29, 2016)

    • CONSUMER_BATCH_SIZE is removed and two new options are introduced SPIDER_LOG_CONSUMER_BATCH_SIZE and SCORING_LOG_CONSUMER_BATCH_SIZE
    • Traceback is thrown into log when SIGUSR1 is received in DBW or SW.
    • Finishing in SW is fixed when crawling strategy reports finished.
    Source code(tar.gz)
    Source code(zip)
  • v0.5.2.1(Jun 24, 2016)

    Before that release the default compression codec was Snappy. We found out Snappy support is broken in certain Kafka versions, and issued that release. The latest version has no compression codec enabled by default, and allows to choose the compression codec with KAFKA_CODEC_LEGACY option.

    Source code(tar.gz)
    Source code(zip)
  • v0.5.2(Jun 21, 2016)

  • v0.5.1.1(Jun 2, 2016)

  • v0.5.0(Jun 1, 2016)

    Here is the change log:

    • latest SQLAlchemy unicode-related crashes are fixed,
    • corporate website friendly canonical solver has been added.
    • crawling strategy concept evolved: added ability to add to queue an arbitrary URL (with transparent state check), FrontierManager available on construction,
    • strategy worker code was refactored,
    • default state introduced for links generated during crawling strategy operation,
    • got rid of Frontera logging in favor of Python native logging,
    • logging system configuration by means of logging.config using file,
    • partitions to instances can be assigned from command line now,
    • improved test coverage from @Preetwinder.

    Enjoy!

    Source code(tar.gz)
    Source code(zip)
  • v0.4.2(Apr 22, 2016)

    This release prevents installing kafka-python package versions newer than 0.9.5. Newer version has significant architectural changes and requires Frontera code adaptation and testing. If you are using Kafka message bus, than you're encouraged to install this update.

    Source code(tar.gz)
    Source code(zip)
  • v0.4.1(Jan 18, 2016)

    • fixed API docs generation on RTD,
    • added body field in Request objects, to support POST-type requests,
    • guidance on how to set MAX_NEXT_REQUESTS and settings docs fixes,
    • fixed colored logging.
    Source code(tar.gz)
    Source code(zip)
  • v0.4.0(Dec 30, 2015)

    A tremendous work was done:

    • distributed-frontera and frontera were merged together into the single project: to make it easier to use and understand,
    • Backend was completely redesigned. Now it's consisting of Queue, Metadata and States objects for low-level code and higher-level Backend implementations for crawling policies,
    • Added definition of run modes: single process, distributed spiders, distributed spider and backend.
    • Overall distributed concept is now integrated into Frontera, making difference between usage of components in single process and distributed spiders/backend run modes clearer.
    • Significantly restructured and augmented documentation, addressing user needs in a more accessible way.
    • Much less configuration footprint.

    Enjoy this new year release and let us know what you think!

    Source code(tar.gz)
    Source code(zip)
  • v0.3.3(Sep 29, 2015)

    • tldextract is no longer minimum required dependency,
    • SQLAlchemy backend now persists headers, cookies, and method, also _create_page method added to ease customization,
    • Canonical solver code (needs documentation)
    • Other fixes and improvements
    Source code(tar.gz)
    Source code(zip)
  • v0.3.2(Jun 19, 2015)

    Now, it's possible to configure Frontera from Scrapy settings. The order of precedence for configuration sources is following:

    1. Settings defined in the module pointed by FRONTERA_SETTINGS (higher precedence)
    2. settings defined in the Scrapy settings,
    3. default frontier settings.
    Source code(tar.gz)
    Source code(zip)
  • v0.3.1(May 25, 2015)

    Main issue solved in this version is that now, request callbacks and request.meta contents are successfully serializing and deserializing in SQL Alchemy-based backend. Therefore, majority of Scrapy extensions shouldn't suffer from loosing meta or callbacks passing over Frontera anymore. Second, there is hot fix for cold start problem, when seeds are added, and Scrapy is quickly finishing with no further activity. Well thought solution for this will be offered later.

    Source code(tar.gz)
    Source code(zip)
  • v0.3.0(Apr 15, 2015)

    • Frontera is the new name for Crawl Frontier.
    • Signature of get_next_requests method is changed, now it accepts arbitrary key-value arguments.
    • Overused buffer (subject to remove in the future in favor of downloader internal queue).
    • Backend internals became more customizable.
    • Scheduler now requests for new requests when there is free space in Scrapy downloader queue, instead of waiting for absolute emptiness.
    • Several Frontera middlewares are disabled by default.
    Source code(tar.gz)
    Source code(zip)
  • v0.2.0(Jan 12, 2015)

    • Added documentation (Scrapy Seed Loaders+Tests+Examples)
    • Refactored backend tests
    • Added requests library example
    • Added requests library manager and object converters
    • Added FrontierManagerWrapper
    • Added frontier object converters
    • Fixed script examples for new changes
    • Optional Color logging (only if available)
    • Changed Scrapy frontier and recorder integration to scheduler+middlewares
    • Changed default frontier backend
    • Added comment support to seeds
    • Added doc requirements for RTD build
    • Removed optional dependencies for setup.py and requirements
    • Changed tests to pytest
    • Updated docstrings and documentation
    • Changed frontier componets (Backend and Middleware) to abc
    • Modified Scrapy frontier example to use seed loaders
    • Refactored Scrapy Seed loaders
    • Added new fields to Request and Response frontier objects
    • Added ScrapyFrontierManager (Scrapy wrapper for Frontier Manager)
    • Changed frontier core objects (Page/Link to Request/Response)
    Source code(tar.gz)
    Source code(zip)
Owner
Scrapinghub
Turn web content into useful data
Scrapinghub
A Python Covid-19 cases tracker that scrapes data off the web and presents the number of Cases, Recovered Cases, and Deaths that occurred because of the pandemic.

A Python Covid-19 cases tracker that scrapes data off the web and presents the number of Cases, Recovered Cases, and Deaths that occurred because of the pandemic.

Alex Papadopoulos 1 Nov 13, 2021
✂️🕷️ Spider-Cut is a Network Mapper Framework (NMAP Framework)

Spider-Cut is a Network Mapper Framework (NMAP Framework) Installation | Usage | Creators | Donate Installation # Kali Linux | WSL

XforWorks 3 Mar 07, 2022
Crawler job that scrapes comments from social media posts and saves them in a S3 bucket.

Toxicity comments crawler Crawler job that scrapes comments from social media posts and saves them in a S3 bucket. Twitter Tweets and replies are scra

Douglas Trajano 2 Jan 24, 2022
Danbooru scraper with python

Danbooru Version: 0.0.1 License under: MIT License Dependencies Python: = 3.9.7 beautifulsoup4 cloudscraper Example of use Danbooru from danbooru imp

Sugarbell 2 Oct 27, 2022
河南工业大学 完美校园 自动校外打卡

HAUT-checkin 河南工业大学自动校外打卡 由于github actions存在明显延迟,建议直接使用腾讯云函数 特点 多人打卡 使用简单,仅需账号密码以及用于微信推送的uid 自动获取上一次打卡信息用于打卡 向所有成员微信单独推送打卡状态 完美校园服务器繁忙时造成打卡失败会自动重新打卡

36 Oct 27, 2022
A simple code to fetch comments below an Instagram post and save them to a csv file

fetch_comments A simple code to fetch comments below an Instagram post and save them to a csv file usage First you have to enter your username and pas

2 Jul 14, 2022
A universal package of scraper scripts for humans

Scrapera is a completely Chromedriver free package that provides access to a variety of scraper scripts for most commonly used machine learning and data science domains.

299 Dec 15, 2022
Pelican plugin that adds site search capability

Search: A Plugin for Pelican This plugin generates an index for searching content on a Pelican-powered site. Why would you want this? Static sites are

22 Nov 21, 2022
High available distributed ip proxy pool, powerd by Scrapy and Redis

高可用IP代理池 README | 中文文档 本项目所采集的IP资源都来自互联网,愿景是为大型爬虫项目提供一个高可用低延迟的高匿IP代理池。 项目亮点 代理来源丰富 代理抓取提取精准 代理校验严格合理 监控完备,鲁棒性强 架构灵活,便于扩展 各个组件分布式部署 快速开始 注意,代码请在release

SpiderClub 5.2k Jan 03, 2023
Goblyn is a Python tool focused to enumeration and capture of website files metadata.

Goblyn Metadata Enumeration What's Goblyn? Goblyn is a tool focused to enumeration and capture of website files metadata. How it works? Goblyn will se

Gustavo 46 Nov 22, 2022
Crawl the information of a given keyword on Google search engine

Crawl the information of a given keyword on Google search engine

4 Nov 09, 2022
Web3 Pancakeswap Sniper bot written in python3

Pancakeswap_BSC_Sniper_Bot Web3 Pancakeswap Sniper bot written in python3, Please note the license conditions! The first Binance Smart Chain sniper bo

Treading-Tigers 295 Dec 31, 2022
Crawl BookCorpus

These are scripts to reproduce BookCorpus by yourself.

Sosuke Kobayashi 590 Jan 03, 2023
哔哩哔哩爬取器:以个人为中心

Open Bilibili Crawer 哔哩哔哩是一个信息非常丰富的社交平台,我们基于此构造社交网络。在该网络中,节点包括用户(up主),以及视频、专栏等创作产物;关系包括:用户之间,包括关注关系(following/follower),回复关系(评论区),转发关系(对视频or动态转发);用户对创

Boshen Shi 3 Oct 21, 2021
Works very well and you can ask for the type of image you want the scrapper to collect.

Works very well and you can ask for the type of image you want the scrapper to collect. Also follows a specific urls path depending on keyword selection.

Memo Sim 1 Feb 17, 2022
Webservice wrapper for hhursev/recipe-scrapers (python library to scrape recipes from websites)

recipe-scrapers-webservice This is a wrapper for hhursev/recipe-scrapers which provides the api as a webservice, to be consumed as a microservice by o

1 Jul 09, 2022
Scrapy uses Request and Response objects for crawling web sites.

Requests and Responses¶ Scrapy uses Request and Response objects for crawling web sites. Typically, Request objects are generated in the spiders and p

Md Rashidul Islam 1 Nov 03, 2021
Semplice scraper realizzato in Python tramite la libreria BeautifulSoup

Semplice scraper realizzato in Python tramite la libreria BeautifulSoup

2 Nov 22, 2021
淘宝、天猫半价抢购,抢电视、抢茅台,干死黄牛党

taobao_seckill 淘宝、天猫半价抢购,抢电视、抢茅台,干死黄牛党 依赖 安装chrome浏览器,根据浏览器的版本找到对应的chromedriver下载安装 web版使用说明 1、抢购前需要校准本地时间,然后把需要抢购的商品加入购物车 2、如果要打包成可执行文件,可使用pyinstalle

2k Jan 05, 2023
API which uses discord to scrape NameMC searches/droptime/dropping status of minecraft names

NameMC Scrape API This is an api to scrape NameMC using message previews generated by discord. NameMC makes it a pain to scrape their website, but som

Twilak 2 Dec 22, 2021