Crawl the information of a given keyword on Google search engine

Overview

GoogleSpider

Crawl the information of a given keyword on Google search engine

Config

DataBase

Currently, data is stored in mongodb, and the database configuration is in line 15-19 of the setting. py file, which can be modified by yourself.

# MONGODB
MONGO_IP = "localhost"
MONGO_PORT = 27017
MONGO_DB = "Google_spider"
MONGO_USER_NAME = ""
MONGO_USER_PASS = ""

Log

LOG_NAME = os.path.basename(os.getcwd())
LOG_PATH = "log/%s.log" % LOG_NAME  # log path
LOG_LEVEL = "DEBUG"
LOG_COLOR = True  
LOG_IS_WRITE_TO_CONSOLE = True 
LOG_IS_WRITE_TO_FILE = True  
LOG_MODE = "w" 
LOG_MAX_BYTES = 10 * 1024 * 1024  # Maximum bytes
LOG_BACKUP_COUNT = 20  # Number of log files reserved
LOG_ENCODING = "utf8"  # code
OTHERS_LOG_LEVAL = "ERROR"  # leval

Spider

  • Download interval

    • SPIDER_SLEEP_TIME = [0, 1]
  • Maximum number of requests (100 by default)

    • SPIDER_MAX_RETRY_TIMES = 100

      Note

      If an illegal interface is encountered during crawling, an exception of 'user agent -- illegal interface' will be thrown, and then the crawler task will retry until the data is successfully crawled or more than 100 times

data structure

key value type example
title str “Donald Trump - Wikipedia”
keyword str “Trump"
url str "https://en.wikipedia.org/wiki/Donald_Trump"
text str Donald Trump - Wikipedia 1 hour ago · Donald John Trump (born June 14, 1946) is an American politician, media personality, and businessman who served as the 45th president of the United States ... Vice President: Mike Pence In office January 20, 2017 – January 20, 2021: In office; January 20, 2017 – January 20, 2021 Occupation: Politician; businessman; television presenter Parents: Fred Trump; Mary Anne MacLeod"

Quick start

Crawl the 3 page data with the keyword 'Trump'

from spiders.google_curl import GoogleCurl

spider = GoogleCurl('Trump', 3)
spider.start()

image-20211010170018075

  • The first parameter is the search keyword, and the second parameter is the number of pages crawled
Discord webhook spammer with proxy support and proxy scraper

Discord webhook spammer with proxy support and proxy scraper

3 Feb 27, 2022
12306抢票脚本

12306抢票脚本

罐子里的茶 457 Jan 05, 2023
This code will be able to scrape movies from a movie website and also provide download links to newly uploaded movies.

Movies-Scraper You are probably tired of navigating through a movie website to get the right movie you'd want to watch during the weekend. There may e

1 Jan 31, 2022
This was supposed to be a web scraping project, but somehow I've turned it into a spamming project

Introduction This was supposed to be a web scraping project, but somehow I've turned it into a spamming project.

Boss Perry (Pez) 1 Jan 23, 2022
Python framework to scrape Pastebin pastes and analyze them

pastepwn - Paste-Scraping Python Framework Pastebin is a very helpful tool to store or rather share ascii encoded data online. In the world of OSINT,

Rico 105 Dec 29, 2022
Example of scraping a paginated API endpoint and dumping the data into a DB

Provider API Scraper Example Example of scraping a paginated API endpoint and dumping the data into a DB. Pre-requisits Python = 3.9 Pipenv Setup # i

Alex Skobelev 1 Oct 20, 2021
This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster

This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster.

IST Research 1.1k Jan 06, 2023
A web crawler script that crawls the target website and lists its links

A web crawler script that crawls the target website and lists its links || A web crawler script that lists links by scanning the target website.

2 Apr 29, 2022
Scraping news from Ucsal portal with Scrapy.

NewsScraping Esse é um projeto de raspagem das últimas noticias, de 2021, do portal da universidade Ucsal http://noosfero.ucsal.br/institucional Tecno

Crissiano Pires 0 Sep 30, 2021
Nekopoi scraper using python3

Features Scrap from url Todo [+] Search by genre [+] Search by query [+] Scrap from homepage Example # Hentai Scraper from nekopoi import Hent

MhankBarBar 9 Apr 06, 2022
SearchifyX, predecessor to Searchify, is a fast Quizlet, Quizizz, and Brainly webscraper with various stealth features.

SearchifyX SearchifyX, predecessor to Searchify, is a fast Quizlet, Quizizz, and Brainly webscraper with various stealth features. SearchifyX lets you

28 Dec 20, 2022
This repo has the source code for the crawler and data crawled from auto-data.net

This repo contains the source code for crawler and crawled data of cars specifications from autodata. The data has roughly 45k cars

Tô Đức Anh 5 Nov 22, 2022
A Python module to bypass Cloudflare's anti-bot page.

cloudflare-scrape A simple Python module to bypass Cloudflare's anti-bot page (also known as "I'm Under Attack Mode", or IUAM), implemented with Reque

3k Jan 04, 2023
An automated, headless YouTube Watcher and Scraper

Searches YouTube, queries recommended videos and watches them. All fully automated and anonymised through the Tor network. The project consists of two independently usable components, the YouTube aut

44 Oct 18, 2022
Automated data scraper for Thailand COVID-19 data

The Researcher COVID data Automated data scraper for Thailand COVID-19 data Accessing the Data 1st Dose Provincial Vaccination Data 2nd Dose Provincia

Porames Vatanaprasan 31 Apr 17, 2022
a high-performance, lightweight and human friendly serving engine for scrapy

a high-performance, lightweight and human friendly serving engine for scrapy

Speakol Ads 30 Mar 01, 2022
An experiment to deploy a serverless infrastructure for a scrapy project.

Serverless Scrapy project This project aims to evaluate the feasibility of an architecture based on serverless technology for a web crawler using scra

José Ferraz Neto 5 Jul 08, 2022
A dead simple crawler to get books information from Douban.

Introduction A dead simple crawler to get books information from Douban. Pre-requesites Python 3 Install dependencies from requirements.txt (Optional)

Yun Wang 1 Jan 10, 2022
Subscrape - A Python scraper for substrate chains

subscrape A Python scraper for substrate chains that uses Subscan. Usage copy co

ChaosDAO 14 Dec 15, 2022
Scrapegoat is a python library that can be used to scrape the websites from internet based on the relevance of the given topic irrespective of language using Natural Language Processing

Scrapegoat is a python library that can be used to scrape the websites from internet based on the relevance of the given topic irrespective of language using Natural Language Processing. It can be ma

10 Jul 06, 2022