Video Games Web Scraper is a project that crawls websites and APIs and extracts video game related data from their pages.

Last update: Jan 12, 2022

Overview

Video Games Web Scraper

Video Games Web Scraper is a project that crawls websites and APIs and extracts video game related data from their pages.

This project uses an open-source and collaborative framework named Scrapy.

Sources

VideoGameGeek (vgg)

Installation

I strongly recommend that you install this project in a dedicated virtual environment to avoid conflicting with your system packages.

See Virtual Environments and Packages on how to create and use your virtual environment.

Use the package manager pip to install the requirements of this project.

pip install -r requirements.txt

Usage

You can start crawling a source using a spider.

scrapy crawl <spider>

VideoGameGeek

Spiders

vgg-games
vgg-hotitems

Developer Resources

Initialize your Development Environment

pip install -r requirements.txt

Create and Run Tests

See the Spiders Contracts for more instructions on how to create tests for spiders and then run:

scrapy check

Scrapy Documentation

See the Scrapy Documentation for more instructions on how to create and modify spiders.

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

Please make sure to update tests as appropriate.

Support

If you enjoy this repository, please star this repository. By starring a repository, it shows appreciation to the repository maintainer for their work. Many of GitHub's repository rankings depend on the number of stars a repository has.

License

MIT

Automated data scraper for Thailand COVID-19 data

The Researcher COVID data Automated data scraper for Thailand COVID-19 data Accessing the Data 1st Dose Provincial Vaccination Data 2nd Dose Provincia

31 Apr 17, 2022

A Smart, Automatic, Fast and Lightweight Web Scraper for Python

AutoScraper: A Smart, Automatic, Fast and Lightweight Web Scraper for Python This project is made for automatic web scraping to make scraping easy. It

4.8k Jan 4, 2023

A Web Scraper built with beautiful soup, that fetches udemy course information. Get udemy course information and convert it to json, csv or xml file

Udemy Scraper A Web Scraper built with beautiful soup, that fetches udemy course information. Installation Virtual Environment Firstly, it is recommen

15 May 17, 2022

A web Scraper for CSrankings.com that scrapes University and Faculty list for a particular country

A web Scraper for CSrankings.com that scrapes University and Faculty list for a particular country To run the file: Open terminal

2 Jun 6, 2022

Web and PDF Scraper Refactoring

Web and PDF Scraper Refactoring This repository contains the example code of the Web and PDF scraper code roast. Here are the links to the videos: Par

18 Dec 31, 2022

Python based Web Scraper which can discover javascript files and parse them for juicy information (API keys, IP's, Hidden Paths etc)

Python based Web Scraper which can discover javascript files and parse them for juicy information (API keys, IP's, Hidden Paths etc).

6 Aug 26, 2022

A web scraper that exports your entire WhatsApp chat history.

Comments

Bump scrapy from 2.6.1 to 2.6.2
Bumps scrapy from 2.6.1 to 2.6.2.

Release notes

Sourced from scrapy's releases.

2.6.2

Fixes a security issue around HTTP proxy usage, and addresses a few regressions introduced in Scrapy 2.6.0.

See the changelog.

Changelog

Sourced from scrapy's changelog.

Scrapy 2.6.2 (2022-07-25)

Security bug fix:

When :class:~scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware processes a request with :reqmeta:proxy metadata, and that :reqmeta:proxy metadata includes proxy credentials, :class:~scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware sets the Proxy-Authentication header, but only if that header is not already set.

There are third-party proxy-rotation downloader middlewares that set different :reqmeta:proxy metadata every time they process a request.

Because of request retries and redirects, the same request can be processed by downloader middlewares more than once, including both :class:~scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware and any third-party proxy-rotation downloader middleware.

These third-party proxy-rotation downloader middlewares could change the :reqmeta:proxy metadata of a request to a new value, but fail to remove the Proxy-Authentication header from the previous value of the :reqmeta:proxy metadata, causing the credentials of one proxy to be sent to a different proxy.

To prevent the unintended leaking of proxy credentials, the behavior of :class:~scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware is now as follows when processing a request:

If the request being processed defines :reqmeta:proxy metadata that includes credentials, the Proxy-Authorization header is always updated to feature those credentials.

If the request being processed defines :reqmeta:proxy metadata without credentials, the Proxy-Authorization header is removed unless it was originally defined for the same proxy URL.

To remove proxy credentials while keeping the same proxy URL, remove the Proxy-Authorization header.

If the request has no :reqmeta:proxy metadata, or that metadata is a falsy value (e.g. None), the Proxy-Authorization header is removed.

It is no longer possible to set a proxy URL through the :reqmeta:proxy metadata but set the credentials through the Proxy-Authorization header. Set proxy credentials through the :reqmeta:proxy metadata instead.

... (truncated)

Commits

aecbccb Bump version: 2.6.1 → 2.6.2

af7dd16 Merge pull request from GHSA-9x8m-2xpf-crp3

4205609 Fixed intersphinx references

e3e69d1 Pin documentation requirements (#5536)

54bfb96 Cover #5525 in the 2.6.2 release notes (#5535)

4ef7182 If TWISTED_REACTOR is None, reuse any pre-installed reactor (#5528)

1c1cd5d Update the 2.6.2 release notes

84c29a2 Unset the release date of still-unreleased 2.6.2 (#5503)

b9b9422 Merge pull request #5482 from alexpdev/parse_help_msg

915c288 edit

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies
opened by dependabot[bot] 0
Bump pillow from 9.0.0 to 9.0.1
Bumps pillow from 9.0.0 to 9.0.1.

Release notes

Sourced from pillow's releases.

9.0.1

https://pillow.readthedocs.io/en/stable/releasenotes/9.0.1.html

Changes

In show_file, use os.remove to remove temporary images. CVE-2022-24303 #6010 [@radarhere, @hugovk]

Restrict builtins within lambdas for ImageMath.eval. CVE-2022-22817 #6009 [radarhere]

Changelog

Sourced from pillow's changelog.

9.0.1 (2022-02-03)

In show_file, use os.remove to remove temporary images. CVE-2022-24303 #6010 [radarhere, hugovk]

Restrict builtins within lambdas for ImageMath.eval. CVE-2022-22817 #6009 [radarhere]

Commits

6deac9e 9.0.1 version bump

c04d812 Update CHANGES.rst [ci skip]

4fabec3 Added release notes for 9.0.1

02affaa Added delay after opening image with xdg-open

ca0b585 Updated formatting

427221e In show_file, use os.remove to remove temporary images

c930be0 Restrict builtins within lambdas for ImageMath.eval

75b69dd Dont need to pin for GHA

cd938a7 Autolink CWE numbers with sphinx-issues

2e9c461 Add CVE IDs

See full diff in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies
opened by dependabot[bot] 0
Bump pillow from 9.0.0 to 9.0.1
Bumps pillow from 9.0.0 to 9.0.1.

Release notes

Sourced from pillow's releases.

9.0.1

https://pillow.readthedocs.io/en/stable/releasenotes/9.0.1.html

Changes

In show_file, use os.remove to remove temporary images. CVE-2022-24303 #6010 [@radarhere, @hugovk]

Restrict builtins within lambdas for ImageMath.eval. CVE-2022-22817 #6009 [radarhere]

Changelog

Sourced from pillow's changelog.

9.0.1 (2022-02-03)

In show_file, use os.remove to remove temporary images. CVE-2022-24303 #6010 [radarhere, hugovk]

Restrict builtins within lambdas for ImageMath.eval. CVE-2022-22817 #6009 [radarhere]

Commits

6deac9e 9.0.1 version bump

c04d812 Update CHANGES.rst [ci skip]

4fabec3 Added release notes for 9.0.1

02affaa Added delay after opening image with xdg-open

ca0b585 Updated formatting

427221e In show_file, use os.remove to remove temporary images

c930be0 Restrict builtins within lambdas for ImageMath.eval

75b69dd Dont need to pin for GHA

cd938a7 Autolink CWE numbers with sphinx-issues

2e9c461 Add CVE IDs

See full diff in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies
opened by dependabot[bot] 0

Bump scrapy from 2.5.1 to 2.6.0

Bumps scrapy from 2.5.1 to 2.6.0.

Release notes

Sourced from scrapy's releases.

2.6.0

Security fixes for cookie handling (see details below)

Python 3.10 support

asyncio support is no longer considered experimental, and works out-of-the-box on Windows regardless of your Python version

Feed exports now support pathlib.Path output paths and per-feed item filtering and post-processing

See the full changelog

Security bug fixes

When a Request object with cookies defined gets a redirect response causing a new Request object to be scheduled, the cookies defined in the original Request object are no longer copied into the new Request object.

If you manually set the Cookie header on a Request object and the domain name of the redirect URL is not an exact match for the domain of the URL of the original Request object, your Cookie header is now dropped from the new Request object.

The old behavior could be exploited by an attacker to gain access to your cookies. Please, see the cjvr-mfj7-j4j8 security advisory for more information.

Note: It is still possible to enable the sharing of cookies between different domains with a shared domain suffix (e.g. example.com and any subdomain) by defining the shared domain suffix (e.g. example.com) as the cookie domain when defining your cookies. See the documentation of the Request class for more information.

When the domain of a cookie, either received in the Set-Cookie header of a response or defined in a Request object, is set to a public suffix <https://publicsuffix.org/>_, the cookie is now ignored unless the cookie domain is the same as the request domain.

The old behavior could be exploited by an attacker to inject cookies from a controlled domain into your cookiejar that could be sent to other domains not controlled by the attacker. Please, see the mfjm-vh54-3f96 security advisory for more information.

Changelog

Sourced from scrapy's changelog.

Scrapy 2.6.0 (2022-03-01)

Highlights:

:ref:Security fixes for cookie handling <2.6-security-fixes>
Python 3.10 support
:ref:asyncio support <using-asyncio> is no longer considered experimental, and works out-of-the-box on Windows regardless of your Python version
Feed exports now support :class:pathlib.Path output paths and per-feed :ref:item filtering <item-filter> and :ref:post-processing <post-processing>

.. _2.6-security-fixes:

Security bug fixes


-   When a :class:`~scrapy.http.Request` object with cookies defined gets a
    redirect response causing a new :class:`~scrapy.http.Request` object to be
    scheduled, the cookies defined in the original
    :class:`~scrapy.http.Request` object are no longer copied into the new
    :class:`~scrapy.http.Request` object.
If you manually set the ``Cookie`` header on a
:class:`~scrapy.http.Request` object and the domain name of the redirect
URL is not an exact match for the domain of the URL of the original
:class:`~scrapy.http.Request` object, your ``Cookie`` header is now dropped
from the new :class:`~scrapy.http.Request` object.
The old behavior could be exploited by an attacker to gain access to your
cookies. Please, see the cjvr-mfj7-j4j8 security advisory_ for more
information.
.. _cjvr-mfj7-j4j8 security advisory: https://github.com/scrapy/scrapy/security/advisories/GHSA-cjvr-mfj7-j4j8
.. note:: It is still possible to enable the sharing of cookies between
different domains with a shared domain suffix (e.g.
example.com and any subdomain) by defining the shared domain
suffix (e.g. example.com) as the cookie domain when defining
your cookies. See the documentation of the
:class:~scrapy.http.Request class for more information.


When the domain of a cookie, either received in the Set-Cookie header
of a response or defined in a :class:~scrapy.http.Request object, is set
to a public suffix &lt;https://publicsuffix.org/&gt;_, the cookie is now
</tr></table>

... (truncated)

Commits

6b63e7c Bump version: 2.5.0 → 2.6.0
e865c44 Merge pull request from GHSA-mfjm-vh54-3f96
8ce01b3 Merge pull request from GHSA-cjvr-mfj7-j4j8
aa0306a Cover 2.6.0 in the release notes (#5399)
08557e0 Pin old markupsafe when we pin old mitmproxy (#5427)
3b42ccf Add a link to Discord (#5422)
8840403 Merge pull request #5412 from Laerte/master
0b0eea3 Merge pull request #5419 from PendalF89/patch-2
187b5c8 Update the documentation link for robots.txt (#5415)
bbb693d Update downloader-middleware.rst
Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR
@dependabot recreate will recreate this PR, overwriting any edits that have been made to it
@dependabot merge will merge this PR after your CI passes on it
@dependabot squash and merge will squash and merge this PR after your CI passes on it
@dependabot cancel merge will cancel a previously requested merge and block automerging
@dependabot reopen will reopen this PR if it is closed
@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
@dependabot use these labels will set the current labels as the default for future PRs for this repo and language
@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies

opened by dependabot[bot] 0

Releases(v0.0.3)

v0.0.3(Mar 19, 2022)

[0.0.3] - 2022-03-19

Security

Remove Pillow from requirements Bump Scrapy from 2.5.1 to 2.6.1
Source code(tar.gz)
Source code(zip)
v0.0.2(Mar 5, 2022)
0.0.2 - 2022-03-05

Changed

Added versions to web crawler named vgg-games.

Source code(tar.gz)
Source code(zip)
v0.0.1(Jan 12, 2022)
0.0.1 - 2022-01-12

Added

A web crawler named vgg-games that extracts games details from VideoGameGeek.

A web crawler named vgg-hotitems that extracts hot items details from VideoGameGeek.

Source code(tar.gz)
Source code(zip)

Video Games Web Scraper is a project that crawls websites and APIs and extracts video game related data from their pages.

Related tags

Overview

Video Games Web Scraper

Sources

Installation

Usage

VideoGameGeek

Spiders

Developer Resources

Initialize your Development Environment

Create and Run Tests

Scrapy Documentation

Contributing

Support

License

You might also like...

Automated data scraper for Thailand COVID-19 data

A Smart, Automatic, Fast and Lightweight Web Scraper for Python

A Web Scraper built with beautiful soup, that fetches udemy course information. Get udemy course information and convert it to json, csv or xml file

A web Scraper for CSrankings.com that scrapes University and Faculty list for a particular country

Web and PDF Scraper Refactoring

Python based Web Scraper which can discover javascript files and parse them for juicy information (API keys, IP's, Hidden Paths etc)

A web scraper that exports your entire WhatsApp chat history.

A simple python web scraper.

Web scraper for Zillow

Comments

Bump scrapy from 2.6.1 to 2.6.2

2.6.2

Scrapy 2.6.2 (2022-07-25)

Bump pillow from 9.0.0 to 9.0.1

9.0.1

Changes

9.0.1 (2022-02-03)

Bump pillow from 9.0.0 to 9.0.1

9.0.1

Changes

9.0.1 (2022-02-03)

Bump scrapy from 2.5.1 to 2.6.0

2.6.0

Security bug fixes

Scrapy 2.6.0 (2022-03-01)

Releases(v0.0.3)

v0.0.3(Mar 19, 2022)

[0.0.3] - 2022-03-19

Security

v0.0.2(Mar 5, 2022)

0.0.2 - 2022-03-05

Changed

v0.0.1(Jan 12, 2022)

0.0.1 - 2022-01-12

Added

Owner

Albert Marrero

Html Content / Article Extractor, web scrapping lib in Python

TarkovScrappy - A nifty little bot that lets you know if a queried item might be required for a quest at some point in the land of Tarkov!

热搜榜-python爬虫+正则re+beautifulsoup+xpath

This was supposed to be a web scraping project, but somehow I've turned it into a spamming project

Web scrapping

Lovely Scrapper

WebScraping - Scrapes Job website for python developer jobs and exports the data to a csv file

A Web Scraper built with beautiful soup, that fetches udemy course information. Get udemy course information and convert it to json, csv or xml file

Simple python tool for the purpose of swapping latinic letters with cirilic ones and vice versa in txt, docx and pdf files in Serbian language

Web Scraping COVID 19 Meta Portal with Python

Poolbooru gelscraper - a simple python script for scraping images off gelbooru pools.

Scrape Twitter for Tweets

Webservice wrapper for hhursev/recipe-scrapers (python library to scrape recipes from websites)

Web and PDF Scraper Refactoring

Scrape data on SpaceX: Capsules, Rockets, Cores, Roadsters, SpaceX Info

The open-source web scrapers that feed the Los Angeles Times California coronavirus tracker.

Scrap-mtg-top-8 - A top 8 mtg scraper using python

一些爬虫相关的签名、验证码破解

This is a webscraper for a specific website

🥫 The simple, fast, and modern web scraping library