A universal package of scraper scripts for humans

Related tags

Web CrawlingScrapera
Overview

Logo

MIT License version-shield release-shield python-shield

Table of Contents
  1. About The Project
  2. Getting Started
  3. Usage
  4. Contributing
  5. Sponsors
  6. License
  7. Contact
  8. Acknowledgements

About The Project

Scrapera is a completely Chromedriver free package that provides access to a variety of scraper scripts for most commonly used machine learning and data science domains. Scrapera directly and asynchronously scrapes from public API endpoints, thereby removing the heavy browser overhead which makes Scrapera extremely fast and robust to DOM changes. Currently, Scrapera supports the following crawlers:

  • Images
  • Text
  • Audio
  • Videos
  • Miscellaneous

  • The main aim of this package is to cluster common scraping tasks so as to make it more convenient for ML researchers and engineers to focus on their models rather than worrying about the data collection process

    DISCLAIMER: Owner or Contributors do not take any responsibility for misuse of data obtained through Scrapera. Contact the owner if copyright terms are violated due to any module provided by Scrapera.

    Prerequisites

    Prerequisites can be installed separately through the requirements.txt file as below

    pip install -r requirements.txt

    Installation

    Scrapera is built with Python 3 and can be pip installed directly

    pip install scrapera

    Alternatively, if you wish to install the latest version directly through GitHub then run

    pip install git+https://github.com/DarshanDeshpande/Scrapera.git

    Usage

    To use any sub-module, you just need to import, instantiate and execute

    from scrapera.video.vimeo import VimeoScraper
    scraper = VimeoScraper()
    scraper.scrape('https://vimeo.com/191955190', '540p')

    For more examples, please refer to the individual test folders in respective modules

    Contributing

    Scrapera welcomes any and all contributions and scraper requests. Please raise an issue if the scraper fails at any instance. Feel free to fork the repository and add your own scrapers to help the community!
    For more guidelines, refer to CONTRIBUTING

    License

    Distributed under the MIT License. See LICENSE for more information.

    Sponsors

    Logo

    Contact

    Feel free to reach out for any issues or requests related to Scrapera

    Darshan Deshpande (Owner) - Email | LinkedIn

    Acknowledgements

    Owner
    Helping Machines Learn Better 💻😃
    A distributed crawler for weibo, building with celery and requests.

    A distributed crawler for weibo, building with celery and requests.

    SpiderClub 4.8k Jan 03, 2023
    Dude is a very simple framework for writing web scrapers using Python decorators

    Dude is a very simple framework for writing web scrapers using Python decorators. The design, inspired by Flask, was to easily build a web scraper in just a few lines of code. Dude has an easy-to-lea

    Ronie Martinez 326 Dec 15, 2022
    Python script to check if there is any differences in responses of an application when the request comes from a search engine's crawler.

    crawlersuseragents This Python script can be used to check if there is any differences in responses of an application when the request comes from a se

    Podalirius 13 Dec 27, 2022
    This tool can be used to extract information from any website

    WEB-INFO- This tool can be used to extract information from any website Install Termux and run the command --- $ apt-get update $ apt-get upgrade $ pk

    1 Oct 24, 2021
    Python script that reads Aliexpress offers urls from a Excel filename (.csv) and post then in a Telegram channel using a bot

    Aliexpress to telegram post Python script that reads Aliexpress offers urls from a Excel filename (.csv) and post then in a Telegram channel using a b

    Fernando 6 Dec 06, 2022
    让中国用户使用git从github下载的速度提高1000倍!

    序言 github上有很多好项目,但是国内用户连github却非常的慢.每次都要用插件或者其他工具来解决. 这次自己做一个小工具,输入github原地址后,就可以自动替换为代理地址,方便大家更快速的下载. 安装 pip install cit 主要功能与用法 主要功能 change 将目标地址转换为

    35 Aug 29, 2022
    Unja is a fast & light tool for fetching known URLs from Wayback Machine

    Unja Fetch Known Urls What's Unja? Unja is a fast & light tool for fetching known URLs from Wayback Machine, Common Crawl, Virus Total & AlienVault's

    Sheryar 10 Aug 07, 2022
    The first public repository that provides free BUBT website scraping API script on Github.

    BUBT WEBSITE SCRAPPING SCRIPT I think this is the first public repository that provides free BUBT website scraping API script on github. When I was do

    Md Imam Hossain 3 Feb 10, 2022
    京东茅台抢购

    截止 2021/2/1 日,该项目已无法使用! 京东:约满即止,仅限京东实名认证用户APP端抢购,2月1日10:00开始预约,2月1日12:00开始抢购(京东APP需升级至8.5.6版本及以上) 写在前面 本项目来自 huanghyw - jd_seckill,作者的项目地址我找不到了,找到了再贴上

    abee 73 Dec 03, 2022
    Scraping web pages to get data

    Scraping Data Get public data and save in database This is project use Python How to run a project 1 - Clone the repository 2 - Install beautifulsoup4

    Soccer Project 2 Nov 01, 2021
    A scrapy pipeline that provides an easy way to store files and images using various folder structures.

    scrapy-folder-tree This is a scrapy pipeline that provides an easy way to store files and images using various folder structures. Supported folder str

    Panagiotis Simakis 7 Oct 23, 2022
    Simple proxy scraper made by using ProxyScrape's api.

    What is Moon? Moon is a lightweight and fast proxy scraper made by using ProxyScrape's api. What can i do with this? You can use proxies for varietys

    1 Jul 04, 2022
    Simple tool to scrape and download cross country ski timings and results from live.skidor.com

    LiveSkidorDownload Simple tool to scrape and download cross country ski timings and results from live.skidor.com Usage: Put the python file in a dedic

    0 Jan 07, 2022
    A web crawler for recording posts in "sina weibo"

    Web Crawler for "sina weibo" A web crawler for recording posts in "sina weibo" Introduction This script helps collect attributes of posts in "sina wei

    4 Aug 20, 2022
    Web Scraping images using Selenium and Python

    Web Scraping images using Selenium and Python A propos de ce document This is a markdown document about Web scraping images and videos using Selenium

    Nafaa BOUGRAINE 3 Jul 01, 2022
    jd_maotai rpa 基于selenium驱动的jd抢购rpa机器人

    jd_maotai rpa 基于selenium驱动的jd抢购rpa机器人, 照顾我们这样的马大哈, 不会忘记抢购了, 祝大家过年都能喝上茅台. 特别声明: 本仓库发布的jd_maotai_rpa项目定义为自动化rpa项目, 是用于防止忘记参与jd茅台的活动(由于本人时常忘记), 而不是为了秒杀和抢

    35 Nov 18, 2022
    Instagram_scrapper - This project allow you to scrape the list of followers, following or both from a public Instagram account, and create a csv or excel file easily.

    Instagram_scrapper This project allow you to scrape the list of followers, following or both from a public Instagram account, and create a csv or exce

    Lakhdar Belkharroubi 5 Oct 17, 2022
    UdemyBot - A Simple Udemy Free Courses Scrapper

    UdemyBot - A Simple Udemy Free Courses Scrapper

    Gautam Kumar 112 Nov 12, 2022
    UsernameScraperTool - Username Scraper Tool With Python

    UsernameScraperTool Username Scraper for 40+ Social sites. How To use git clone

    E4crypt3d 1 Dec 20, 2022
    This was supposed to be a web scraping project, but somehow I've turned it into a spamming project

    Introduction This was supposed to be a web scraping project, but somehow I've turned it into a spamming project.

    Boss Perry (Pez) 1 Jan 23, 2022