A Telegram crawler to search groups and channels automatically and collect any type of data from them.

Overview

Introduction

This is a crawler I wrote in Python using the APIs of Telethon months ago. This tool was not intended to be publicly available for a number of reasons, but eventually I decided to distribute it "as it is". Any contribution to the project is more than welcome :)

Installation

Python 3.8.2 and Telethon 1.21.1 are required (along with other common packages, just read the imports), I don't guarantee it works with newer versions of Telethon.

To install Telethon just read their documentation, while to install this repository just git clone and run python3.8 scraper.py only after configurated the script properly (next section).

Configuration

To use this tool you have to first obtain an API ID and an API HASH from Telegram: you can do this by following this page. Once done the ID and the HASH can be inserted into the code and the script can be launched. The first time it runs, it will ask to insert the telephone number.

Usage

In the code, there are two methods to initialize the crawler: init_empty() and init(). The former is used for the very first time that the script has been launched, while the latter is needed only in specific situations (read the code for details). Once the crawler has been launched with init_empty() and terminated, it basically processed all the groups/channels where the account is already in, collecting all the links shared in the chats along with a number of other data such that:

  1. Name of the group/channel
  2. Username
  3. List of members (just for groups)
  4. List of the last n messages
  5. Other metadata...

groups

These information have been saved in a pickle file called groups. Other files given in output are to_be_processed and edges. The former is a list of links that will be processed in the next iteration (see later) and the other one is a list of tuples of the form (group id, [group id list]) where the first entry represents the so called destination vertex and the list represents the origin vertices. Indeed, this uncommon data structure is the edge list of the search graph produced by the crawler (this is useful for data mining purposes, for instance I used it to perform link prediction between groups/channels exploiting both the graph structure and the messages). Probably is not the best data structure, since you have to reverse it later on if you want to actually use it for other tasks, but it is faster than other solutions to update.

Once the initialization is complete, you can comment init_empty() and uncomment start() in main() to process the new links collected before. This will generate three new files: groups2, to_be_processed2 and edges2. Now you have to merge the old files with the new ones (I actually have a script that does that, but it is customized according to my environment so I will publish it as soon as I have time to generalize it): pay attention to the to_be_processed file, because you don't want to process it entirely in each run. You need to divide it since it will take too long, indeed there are some limitations to process a lot of groups, therefore you want to play smarter... next section.

Limitations

Telegram doesn't want that you play with users' data therefore for each account, if I recall correctly, you can join 25 groups per hour, then Telegram will stop you. The script handles this, so if your usage is not so intensive it will not make your task unfeasible. But if you need hundreds or thousands of groups, well, you have to parallelize the script. I won't publish the code for doing this, but it is not so difficult to make this idea practical.

Owner
Computer Science student at "La Sapienza"
Discord webhook spammer with proxy support and proxy scraper

Discord webhook spammer with proxy support and proxy scraper

3 Feb 27, 2022
Linkedin webscraping - Linkedin web scraping with python

linkedin_webscraping This is the first step of a full project called "LinkedIn J

Pedro Dib 4 Apr 24, 2022
TarkovScrappy - A nifty little bot that lets you know if a queried item might be required for a quest at some point in the land of Tarkov!

TarkovScrappy A nifty little bot that lets you know if a queried item might be required for a quest at some point in the land of Tarkov! Hideout items

Joshua Smeda 2 Apr 11, 2022
This was supposed to be a web scraping project, but somehow I've turned it into a spamming project

Introduction This was supposed to be a web scraping project, but somehow I've turned it into a spamming project.

Boss Perry (Pez) 1 Jan 23, 2022
:arrow_double_down: Dumb downloader that scrapes the web

You-Get NOTICE: Read this if you are looking for the conventional "Issues" tab. You-Get is a tiny command-line utility to download media contents (vid

Mort Yao 46.4k Jan 03, 2023
Haphazard scripts for scraping bitcoin/bitcoin data from GitHub

This is a quick-and-dirty tool used to scrape bitcoin/bitcoin pull request and commentary data. Each output/pr number folder contains comments.json:

James O'Beirne 8 Oct 12, 2022
Simply scrape / download all the media from an fansly account.

Simply scrape / download all the media from an fansly account. Providing updates as long as its continuously gaining popularity, so hit the ⭐ button!

Mika C. 334 Jan 01, 2023
OSTA web scraper, for checking the status of school buses in Ottawa

OSTA-La-Vista OSTA web scraper, for checking the status of school buses in Ottawa. Getting Started Using a Raspberry Pi, download Python 3, and option

1 Jan 28, 2022
Automatically download and crop key information from the arxiv daily paper.

Arxiv daily 速览 功能:按关键词筛选arxiv每日最新paper,自动获取摘要,自动截取文中表格和图片。 1 测试环境 Ubuntu 16+ Python3.7 torch 1.9 Colab GPU 2 使用演示 首先下载权重baiduyun 提取码:il87,放置于code/Pars

HeoLis 20 Jul 30, 2022
PS5 bot to find a console in france for chrismas 🎄🎅🏻 NOT FOR SCALPERS

Une PS5 pour Noël Python + Chrome --headless = une PS5 pour noël MacOS Installer chrome Tweaker le .yaml pour la listes sites a scrap et les criteres

Olivier Giniaux 3 Feb 13, 2022
This scrapper scrapes the mail ids of faculty members from a given linl/page and stores it in a csv file

This scrapper scrapes the mail ids of faculty members from a given linl/page and stores it in a csv file

Devansh Singh 1 Feb 10, 2022
A scalable frontier for web crawlers

Frontera Overview Frontera is a web crawling framework consisting of crawl frontier, and distribution/scaling primitives, allowing to build a large sc

Scrapinghub 1.2k Jan 02, 2023
Python framework to scrape Pastebin pastes and analyze them

pastepwn - Paste-Scraping Python Framework Pastebin is a very helpful tool to store or rather share ascii encoded data online. In the world of OSINT,

Rico 105 Dec 29, 2022
Crawler in Python 3.7, 3.8. 3.9. Pypy3

Description Python Crawler written Python 3. (Supports major Python releases Python3.6, Python3.7 and Python 3.8) Installation and Use Setup VirtualEn

Vinit Kumar 2 Mar 12, 2022
此脚本为 python 脚本,实现原理为利用 selenium 定位相关元素,再配合点击事件完成浏览器的自动化.

此脚本为 python 脚本,实现原理为利用 selenium 定位相关元素,再配合点击事件完成浏览器的自动化.

N0el4kLs 5 Nov 19, 2021
基于Github Action的定时HITsz疫情上报脚本,开箱即用

HITsz Daily Report 基于 GitHub Actions 的「HITsz 疫情系统」访问入口 定时自动上报脚本,开箱即用。 感谢 @JellyBeanXiewh 提供原始脚本和 idea。 感谢 @bugstop 对脚本进行重构并新增 Easy Connect 校内代理访问。

Ter 56 Nov 27, 2022
TikTok Username Swapper/Claimer/etc

TikTok-Turbo TikTok Username Swapper/Claimer/etc I wanted to create it as fast as possible but i eventually gave up and recoded it many many many many

Kevin 12 Dec 19, 2022
Telegram group scraper tool

Telegram Group Scrapper

Wahyusaputra 2 Jan 11, 2022
Library to scrape and clean web pages to create massive datasets.

lazynlp A straightforward library that allows you to crawl, clean up, and deduplicate webpages to create massive monolingual datasets. Using this libr

Chip Huyen 2.1k Jan 06, 2023
Google Maps crawler using Selenium

Google Maps Crawler using Selenium Built as part of the Antifragile Dev Project Selenium crawler that browses Google Maps as a regular user and stores

Guilherme Latrova 46 Dec 16, 2022