A client interface for Scrapinghub's API

Last update: Sep 28, 2022

Overview

Client interface for Scrapinghub API

https://secure.travis-ci.org/scrapinghub/python-scrapinghub.svg?branch=master

The scrapinghub is a Python library for communicating with the Scrapinghub API.

Requirements

Python 2.7 or above

Installation

The quick way:

pip install scrapinghub

You can also install the library with MessagePack support, it provides better response time and improved bandwidth usage:

pip install scrapinghub[msgpack]

Documentation

Documentation is available online via Read the Docs or in the docs directory.

Comments

msgpack errors when using iter() with intervals between each batch call

Good Day!

I've encountered this peculiar issue when trying to save up memory by processing the items in chunks. Here's a strip down version of the code for reproduction of the issue:

import pandas as pd

from scrapinghub import ScrapinghubClient

def read_job_items_by_chunk(jobkey, chunk=10000):
    """In order to prevent OOM issues, the jobs' data must be read in
    chunks.

    This will return a generator of pandas DataFrames.
    """

    client = ScrapinghubClient("APIKEY123")

    item_generator = client.get_job(jobkey).items.iter()

    while item_generator:
        yield pd.DataFrame(
            [next(item_generator) for _ in range(chunk)]
        )

for df_chunk in read_job_items_by_chunk('123/123/123'):
    # having a small chunk-size like 10000 won't have any problems

for df_chunk in read_job_items_by_chunk('123/123/123', chunk=25000):
    # having a bug chunk-size like 25000 will throw out errors like the one below

Here's the common error it throws:

<omitted stack trace above>

    [next(item_generator) for _ in range(chunk)]
  File "/usr/local/lib/python2.7/site-packages/scrapinghub/client/proxy.py", line 115, in iter
    _path, requests_params, **apiparams
  File "/usr/local/lib/python2.7/site-packages/scrapinghub/hubstorage/serialization.py", line 33, in mpdecode
    for obj in unpacker:
  File "msgpack/_unpacker.pyx", line 459, in msgpack._unpacker.Unpacker.__next__ (msgpack/_unpacker.cpp:459)
  File "msgpack/_unpacker.pyx", line 390, in msgpack._unpacker.Unpacker._unpack (msgpack/_unpacker.cpp:390)
  File "/usr/local/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x8b in position 67: invalid start byte

Moreover, it throws out a different error when using a much bigger chunk-size, like 50000:

<omitted stack trace above>

    [next(item_generator) for _ in range(chunk)]
  File "/usr/local/lib/python2.7/site-packages/scrapinghub/client/proxy.py", line 115, in iter
    _path, requests_params, **apiparams
  File "/usr/local/lib/python2.7/site-packages/scrapinghub/hubstorage/serialization.py", line 33, in mpdecode
    for obj in unpacker:
  File "msgpack/_unpacker.pyx", line 459, in msgpack._unpacker.Unpacker.__next__ (msgpack/_unpacker.cpp:459)
  File "msgpack/_unpacker.pyx", line 390, in msgpack._unpacker.Unpacker._unpack (msgpack/_unpacker.cpp:390)
TypeError: unhashable type: 'dict'

I find that the workaround/solution for this would be to have a lower value for chunk. So far, 1000 works great.

This uses scrapy:1.5 stack in Scrapy Cloud.

I'm guessing this might have something to do with the long waiting time that happens when processing the pandas DataFrame chunk, and when the next batch of items are being iterated, the server might have deallocated the pointer to it or something.

May I ask if there might be a solution for this? since a much bigger chunk size will help with the speed of our jobs.

I've marked it as bug for now as this is quite an unexpected/undocumented behavior.

Cheers!

bug

opened by BurnzZ 10

basic py3.3 compatibility while keeping py2.7 compatibility

makes the api callable from py2.x and py3.x. Since scrapy itself is not yet python3 compatible this might be still useful if one has a control application/api written in py3 which should be able to control scrapy crawlers.

opened by ms5 9
UnicodeDecodeError while fetching items
It seems like I randomly get errors like this:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xde in position 174: invalid continuation byte at msgpack._cmsgpack.Unpacker._unpack (_unpacker.pyx:443) at msgpack._cmsgpack.Unpacker.__next__ (_unpacker.pyx:518) at mpdecode (/usr/local/lib/python3.7/site-packages/scrapinghub/hubstorage/serialization.py:33) at iter (/usr/local/lib/python3.7/site-packages/scrapinghub/client/proxy.py:115)

This happens while iterating the items through last_job.items.iter() It seems to happen about 50% of the time from what I see. I scrape the same website every day and run that function and sometimes it works fine, sometimes raise that error. I am not sure if this is an issue with this library or with the ScrapingHub API though but it is very problematic.

This happens on the latest (2.3.1) version
opened by mijamo 8
Use SHUB_JOBAUTH environment variable in utils.parse_auth method

Currently, the parse_auth method tries to get the API Key from the SH_APIKEY environment variable, which needs to be manually set either in spider's code or in the [docker] image's code. A regular practice is to create dummy users and associate them with the project so that real contributors don't have to share their API Keys.

Another option is to use the credentials provided by the SHUB_JOBAUTH, defined during runtime when executing jobs in the Scrapy Cloud platform.

Although it's possible to use Collections and Frontera, this is not a regular Dash API Key but a JWT Token generated in runtime by JobQ service, which works only for a part of our API endpoints (JobQ/Hubstorage).

I'd like to contribute with a Pull Request adding support for this ephemeral API Key.

opened by victor-torres 8
Avoid races for hubstorage frontier tests

Looks like there're races in sh.hubstorage.frontier tests, it's relatively easy to reproduce by rerunning the Travis job (I can't reproduce it locally) https://travis-ci.org/scrapinghub/python-scrapinghub/jobs/172664296

After checking internals, my guess is that as Batchuploader works in a separate thread trying to upload next messages batch from queue, and frontier.flush() operation waits for the queue being empty (by doing queue.join()), there's a probability that on context switch the queue is empty, but Batchuploader hasn't called callback yet, and the frontier.newcounter is not updated yet. In this case simple short delay should fix this, at least I wasn't able to reproduce the issue after the fix.

Could you please confirm/disapprove my finding please?

opened by vshlapakov 7
Add truncate method to collections

This makes possible to delete an entire Collection with a single API request, without having to iterate through records and therefore making multiple API requests.

opened by victor-torres 6

How to run a job?

I can't see how to run a job. There's two examples in the docs. In the project section:

For example, to schedule a spider run (it returns a job object):

>>> project.jobs.run('spider1', job_args={'arg1':'val1'})
<scrapinghub.client.Job at 0x106ee12e8>>

and in the spider section:

Like project instance, spider instance has jobs field to work with the spider's jobs.

To schedule a spider run:

>>> spider.jobs.run(job_args={'arg1:'val1'})
<scrapinghub.client.Job at 0x106ee12e8>>

Neither works, both throw AttributeError: 'Jobs' object has no attribute 'run'

opened by ollieglass 6

Some imports from standard lib collections are breaking on python 3.10
Hi everyone,

Based on an issue from another repo (https://github.com/okfn-brasil/querido-diario/issues/502), I noticed that scrapinghub is using some imports from standard lib collections that are deprecated and not working on Python 3.10.

In Python 3.8 I have these results on ipython console:

In [1]: from collections import Iterator <ipython-input-1-4fb967d2a9f8>:1: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3, and in 3.9 it will stop working from collections import Iterator In [2]: from collections import Iterable <ipython-input-2-c0513a1e6784>:1: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3, and in 3.9 it will stop working from collections import Iterable In [3]: from collections import MutableMapping <ipython-input-3-069a7babadbf>:1: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3, and in 3.9 it will stop working from collections import MutableMapping

According to this, it is necessary to change the imports of Iterable, Iterator and MutableMapping to get these items from "collections.abc" instead of just "collections"

Here are the list of imports that I found:

tests/client/test_job.py - from collections import Iterator

tests/client/test_frontiers.py - from collections import Iterable

tests/client/test_projects.py - from collections import defaultdict, Iterator

scrapinghub/hubstorage/resourcetype.py - from collections import MutableMapping
opened by lbmendes 5

Collections key not found with library

I'm curious about the difference between Collection.get() and Collection.iter(key=[KEY])

>>> key = '456/789'
>>> store = project.collections.get_store('trump')
>>> store.set({'_key': key, 'value': 'abc'})
>>> print(store.list(key=[key]))

[{'value': 'abc', '_key': '456/789'}]  # https://storage.scrapinghub.com/collections/9328/s/trump?key=456%2F789&meta=_key

>>> try:
>>>     print(store.get(key))
>>> except scrapinghub.client.exceptions.NotFound as e:
>>>     print(getattr(e, 'http_error', e))

404 Client Error: Not Found for url: https://storage.scrapinghub.com/collections/9328/s/trump/456/789

I assume that Collection.get() is a handy shortcut for the key-filtered .iter() function so I guess the point of my issue is that .get() will raise an exception if given bad input, for example slashes

opened by stav 5

project.jobs close_reason support needed
I would like to get the last "finished" job for a spider.

But if I do:

project.jobs(spider='myspider', state='finished', count=-1)

I will only get jobs with a state of finished but this may include jobs with a close_reason of shutdown or something other than "finished".

I would like to be able to do:

project.jobs(spider='myspider', close_reason='finished', count=-1)

which would of course assume that state is finished as well.
opened by stav 5
Drop versions earlier than Python 3.7 and update requirements
library upgrades

This update nominally drops support for Python 2.7, 3.5, and 3.6, and tests support for 3.10 to avoid libraries being pinned to very old versions, many of them with bugs or with security issues.

It's "nominally" because the code hasn't been changed except for deprecations enforced in Python 3.9 or 3.10.

disabled tests

Tests that required running test servers are disabled:

Running the servers locally is too complicated

There are no changes to the library's logic. Only required library versions were changed

The tests can be re-enabled by someone with access to test servers.
maintenance
opened by apalala 4

Add retry logic to Job Tag Update function

Description

An Internal Server Error pops up whenever a large number of tag updates run parallel or sequentially.

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/project-1.0-py3.10.egg/XX/utils/workflow/__init__.py", line 930, in run
    start_stage, active_stage, ran_stages = self.setup_continuation(
  File "/usr/local/lib/python3.10/site-packages/project-1.0-py3.10.egg/XX/utils/workflow/__init__.py", line 667, in setup_continuation
    self._discard_jobs(start_stage, ran_stages)
  File "/usr/local/lib/python3.10/site-packages/project-1.0-py3.10.egg/XX/utils/workflow/__init__.py", line 705, in _discard_jobs
    self.get_job(jobinfo["key"]).update_tags(
  File "/usr/local/lib/python3.10/site-packages/scrapinghub/client/[jobs.py](http://jobs.py/)", line 503, in update_tags
    self._client._connection._post('jobs_update', 'json', params)
  File "/usr/local/lib/python3.10/site-packages/scrapinghub/[legacy.py](http://legacy.py/)", line 120, in _post
    return self._request(url, params, headers, format, raw, files)
  File "/usr/local/lib/python3.10/site-packages/scrapinghub/client/[exceptions.py](http://exceptions.py/)", line 98, in wrapped
    raise ServerError(http_error=exc)
scrapinghub.client.exceptions.ServerError: Internal server error

This is not a problem if you are doing updates for a couple of jobs, but if you want to mass update this error will pop up eventually.

Adding adaptable retry logic to the update_tags function through that ServerError exception would make it easier to debug and implement large-scale workflows.

opened by ftadao 0

Incorrect information for Samples in Job documentation

At this link - https://python-scrapinghub.readthedocs.io/en/latest/client/overview.html#job-data-1, the description for samples refer to the job stats which is confusing and seems incorrect.

I think it should be runtime samples that the job uploaded;

Please correct me if I have misinterpreted this.

opened by gutsytechster 0
Jobs.iter() is unable to accept has_tag as a list.

From the docs:

jobs_summary = project.jobs.iter( ... has_tag=['new', 'verified'], lacks_tag='obsolete')

has_tag accepts a string but no a list. lacks_tag works perfectly fine with both.

opened by PeteRoyAlex 0

KeyError: 'status' when trying to schedule spider

I am getting this error when trying to schedule a spider. This is happening with version 2.3.1

Traceback (most recent call last):
  File "/home/molveyra/.local/share/virtualenvs/mollie-AtuAN_AE/lib/python3.8/site-packages/scrapinghub/legacy.py", line 157, in _decode_response
    if data['status'] == 'ok':
KeyError: 'status'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/molveyra/.local/share/virtualenvs/mollie-AtuAN_AE/lib/python3.8/site-packages/scrapinghub/client/exceptions.py", line 69, in wrapped
    return method(*args, **kwargs)
  File "/home/molveyra/.local/share/virtualenvs/mollie-AtuAN_AE/lib/python3.8/site-packages/scrapinghub/client/__init__.py", line 19, in _request
    return super(Connection, self)._request(*args, **kwargs)
  File "/home/molveyra/.local/share/virtualenvs/mollie-AtuAN_AE/lib/python3.8/site-packages/scrapinghub/legacy.py", line 143, in _request
    return self._decode_response(response, format, raw)
  File "/home/molveyra/.local/share/virtualenvs/mollie-AtuAN_AE/lib/python3.8/site-packages/scrapinghub/legacy.py", line 169, in _decode_response
    raise APIError("JSON response does not contain status")
scrapinghub.legacy.APIError: JSON response does not contain status

enhancement

opened by kalessin 6

collections.get_store is not working as documented

Upon going through collections doc we see 2. call .get_store(<somename>) to create or access the named collection you want (the collection will be created automatically if it doesn't exist) ; you get a "store" object back, But when you try this

>>> store = collections.get_store('store_which_does_not_exist')
>>> store.get('key_which_does_not_exist')
DEBUG:https://storage.scrapinghub.com:443 "GET /collections/462630/s/store_which_does_not_exist/key_which_does_not_exist HTTP/1.1" 404 46
2021-02-04 13:33:20 [urllib3.connectionpool] DEBUG: https://storage.scrapinghub.com:443 "GET /collections/462630/s/store_which_does_not_exist/key_which_does_not_exist HTTP/1.1" 404 46
DEBUG:<Response [404]>: b'unknown collection store_which_does_not_exist\n'
2021-02-04 13:33:20 [HubstorageClient] DEBUG: <Response [404]>: b'unknown collection store_which_does_not_exist\n'
*** scrapinghub.client.exceptions.NotFound: unknown collection store_which_does_not_exist

When we .set some value to store which doesn’t exist, store is created and then the values are stored.

>>> store.set({'_key': 'some_key', 'value': 'some_value'})
DEBUG:https://storage.scrapinghub.com:443 "POST /collections/462630/s/store_which_does_not_exist HTTP/1.1" 200 0
2021-02-04 13:36:56 [urllib3.connectionpool] DEBUG: https://storage.scrapinghub.com:443 "POST /collections/462630/s/store_which_does_not_exist HTTP/1.1" 200 0
According to docs, shouldn’t the store be created when we call .get_store ?

bug docs

opened by realslimshanky-sh 0

Releases(2.4.0)

2.4.0(Mar 10, 2022)
What's Changed

update iter() for better fallback in getting 'meta' argument by @BurnzZ in https://github.com/scrapinghub/python-scrapinghub/pull/146

switch from Travis to GH actions by @pawelmhm in https://github.com/scrapinghub/python-scrapinghub/pull/162

Python 3.10 compatibility by @elacuesta in https://github.com/scrapinghub/python-scrapinghub/pull/166

New Contributors

@pawelmhm made their first contribution in https://github.com/scrapinghub/python-scrapinghub/pull/162

Full Changelog: https://github.com/scrapinghub/python-scrapinghub/compare/2.3.1...2.4.0
Source code(tar.gz)
Source code(zip)
2.3.1(Mar 13, 2020)
update msgpack dependency

drop official support for Python 3.4

improve documentation, address a few PEP8 complaints

Source code(tar.gz)
Source code(zip)
2.3.0(Dec 17, 2019)
add items.list_iter method to iterate by chunks

fix retrying logic for HTTP errors

improve documentation

Source code(tar.gz)
Source code(zip)
2.1.1(Apr 25, 2019)
add Python 3.7 support

update msgpack dependency

fix iter logic for items/requests/logs

add truncate method to collections

improve documentation

Source code(tar.gz)
Source code(zip)
2.2.1(Aug 7, 2019)
use jobs.cancel command name to maintain consistency

provide basic documentation for the new feature

Source code(tar.gz)
Source code(zip)
2.2.0(Aug 7, 2019)
add a command to cancel multiple jobs per call

normalize and simplify using VCR.py cassettes in tests

Source code(tar.gz)
Source code(zip)
2.1.0(Jan 14, 2019)
add an option to schedule jobs with custom environment variables

fallback to SHUB_JOBAUTH environment variable if SH_APIKEY is not set

provide a unified connection timeout used by both internal clients

increase a chunk size when working with the items stats endpoint

Python 3.3 is considered unmaintained.
Source code(tar.gz)
Source code(zip)
2.0.0(Mar 29, 2017)

We're very happy to finally announce official major release of new Scrapinghub python client. Documentation is available online via Read The Docs http://python-scrapinghub.readthedocs.io/
Source code(tar.gz)
Source code(zip)
1.9.0(Nov 29, 2016)
python-scrapinghub 1.9.0

python-hubstorage merged into python-scrapinghub

all tests are improved and rewritten with py.test

hubstorage tests use vcrpy cassettes, work faster and don't require any external services to run

python-hubstorage is going to be considered deprecated, its next version will contain a deprecation warning and a proposal to use python-scrapinghub >=1.9.0 instead.
Source code(tar.gz)
Source code(zip)

Owner

Scrapinghub

Turn web content into useful data

GitHub Repository https://python-scrapinghub.readthedocs.io/

Proxy server that records responses for UI testing (and other things)

Welcome to playback-proxy 👋 A proxy tool that records communication (requests, websockets) between client and server. This recording can later be use

41 Apr 01, 2022

Python client library for Bigcommerce API

Bigcommerce API Python Client Wrapper over the requests library for communicating with the Bigcommerce v2 API. Install with pip install bigcommerce or

81 Dec 26, 2022

This is a straightforward python implementation to specifically grab basic infos about IPO companies in China from Sina Stock website.

SinaStockBasicInfoCollect This is a straightforward python implementation to specifically grab basic infos about IPO companies in China from Sina Stoc

1 Dec 09, 2021

ro.py is a modern, asynchronous Python 3 wrapper for the Roblox API.

81 Dec 26, 2022

The official Pushy SDK for Python apps.

pushy-python The official Pushy SDK for Python apps. Pushy is the most reliable push notification gateway, perfect for real-time, mission-critical app

1 Dec 21, 2021

Production Ontology Merging (PrOM) Framework

Production Ontology Merging (PrOM) Framework OWL 2 DL ontology merging framework tailored to the production domain Features preprocessing: translation

4 Nov 02, 2022

Proxy-Bot - Python proxy bot for telegram

Proxy-Bot 🤖 Proxy bot between the main chat and a newcomer, allows all particip

3 Apr 01, 2022

This repository provides a set functions to extract paragraphs from AWS Textract responses.

extract-paragraphs-with-aws-textract Since AWS Textract (the AWS OCR service) does not have a native function to extract paragraphs, this repository p

3 Jan 26, 2022

Telegram bot to check availability of vaccination slots in India.

cowincheckbot Telegram bot to check availability of vaccination slots in India. Setup Install requirements using pip3 install -r requirements.txt Crea

10 Jun 11, 2022

SOLSEA-NFT-EXPLORE - Using Streamlit to build a simple UI on top of the Solana API

SOLSEA NFT Explorer Using Streamlit to build a simple UI on top of the Solana AP

3 Mar 19, 2022

Bot Telegram per creare e gestire un Babbo Natale Segreto con amici ecc

Babbo Natale Segreto: Telegram Bot Bot Telegram per creare e gestire un Babbo Natale Segreto con amici ecc. Che cos'è? Il Babbo Natale Segreto è un gi

2 Jul 18, 2022

An advanced crypto trading bot written in Python

Jesse Jesse is an advanced crypto trading framework which aims to simplify researching and defining trading strategies. Why Jesse? In short, Jesse is

4.4k Jan 09, 2023

A Telegram bot that can stream Telegram files to users over HTTP

AK-FILE-TO-LINK-BOT A Telegram bot that can stream Telegram files to users over HTTP. Setup Install dependencies (see requirements.txt), configure env

3 Dec 29, 2021

A Bot to get RealTime Tweets to a Specific Chats from Desired Persons on Twitter to Telegram Chat.

TgTwitterStreamer A Bot to get RealTime Tweets to a Specific Chats from Desired Persons on Twitter to Telegram Chat. For Getting ENV's Refer this Link

69 Dec 20, 2022

Python wrapper for Revolt API

defectio is a direct implementation of the entire Revolt API and provides a way to authenticate and start communicating with Revolt servers. Similar interface to discord.py

26 Sep 18, 2022

Collect links to profiles by username through search engines

Marple Summary Collect links to profiles by username through search engines (currently Google and DuckDuckGo). Quick Start ./marple.py soxoj Results:

125 Dec 19, 2022

Notflix - Notion / Netflix and IMDb to organise your movie dates. Happy Valentine <3 from 0x1za

Welcome to notflix 👋 This is a project to help organise shows to watch with my

3 Feb 15, 2022

Short Program using Transavia's API to notify via email an user waiting for a flight at special dates and with the best price

Flight-Notifier Short Program using Transavia's API to notify via email an user waiting for a flight at special dates and with the best price Algorith

2 Apr 10, 2022

Бот Telegram для Школы в Капотне (ЦО № 1858)

co1858 Telegram Bot Активно разрабатывался в 2015-2016 году как учебный проект, с целью научиться создавать ботов для Telegram. Бот автоматически парс

4 Aug 30, 2022

The official command-line client for spyse.com

Spyse CLI The official command-line client for spyse.com. NOTE: This tool is currently in the early stage beta and shouldn't be used in production. Yo

43 Dec 08, 2022

A client interface for Scrapinghub's API

Related tags

Overview

Client interface for Scrapinghub API

Requirements

Installation

Documentation

Comments

library upgrades

disabled tests

Description

Releases(2.4.0)

2.4.0(Mar 10, 2022)

What's Changed

New Contributors

2.3.1(Mar 13, 2020)

2.3.0(Dec 17, 2019)

2.1.1(Apr 25, 2019)

2.2.1(Aug 7, 2019)

2.2.0(Aug 7, 2019)

2.1.0(Jan 14, 2019)

2.0.0(Mar 29, 2017)

1.9.0(Nov 29, 2016)