PaperRobot: a paper crawler that can quickly download numerous papers, facilitating paper studying and management

Overview

PaperRobot

PaperRobot 是一个论文抓取工具,可以快速批量下载大量论文,方便后期进行持续的论文管理与学习。

PaperRobot通过多个接口抓取论文,目前抓取成功率维持在90%以上。通过配置Config文件,可以抓取任意计算机领域相关会议的论文。

example

Installation

  • Download this tool
git clone https://github.com/mo-xiaoxi/PaperRobot.git
  • Install dependencies
sudo pip3 install -r requirements.txt

Python version: Python 3 (>=3.7).

Why build this tool?

  1. 通过这个工具可以构建自己的论文数据库。具体参考:如何建立独属于你自己的论文数据库
  2. 一个方便的论文调研工具: Secpaper. 论文调研必备!
  3. 提取论文的摘要,自动翻译推送整理一些会议的研究简报,可以快速地过一下每个会议论文的内容,感兴趣的再阅读对应的pdf。
  4. 对会议研究热点、作者变化等等进行归类与整理。 如Computer Science Rankings.

Usage

$ python run.py --help
usage: run.py [-h] [-m {d,s}] [-c {ccs,uss,sp,ndss,dsn,raid,imc,asiaccs,acsac,sigcomm}] [-s YEAR_START] [-e YEAR_END] [-b BIBTEX] [-t TITLE] [-u URL] [--all {bibtex,pdf}]

OPTIONS:
  -h, --help            show this help message and exit
  -m {d,s}, --mode {d,s}
                        s:show info, d: download
  -c {ccs,uss,sp,ndss,dsn,raid,imc,asiaccs,acsac,sigcomm}, --conference {ccs,uss,sp,ndss,dsn,raid,imc,asiaccs,acsac,sigcomm}
                        The target conference.
  -s YEAR_START, --year_start YEAR_START
                        The start year of paper.
  -e YEAR_END, --year_end YEAR_END
                        The end year of paper.
  -b BIBTEX, --bibtex BIBTEX
                        Download with bibtex file.
  -t TITLE, --title TITLE
                        Download with Google search.
  -u URL, --url URL     Dowanload with url.
  --all {bibtex,pdf}    Download all bibbex or papers,2001-2022 by default

Example

  • 基于Title下载论文 python run.py -t "A Large-scale Analysis of Email Sender Spoofing Attacks"
  • 基于URL下载论文 python run.py -u "https://www.usenix.org/conference/usenixsecurity21/presentation/shen-kaiwen"
  • 基于bib下载论文 python run.py -b bibtex/example.bib
  • 获取NDSS 2021会议论文 python run.py -c ndss -s 2021 -e 2022
  • 获取NDSS 2001-2021会议论文 python run.py -c ndss -s 2001 -e 2022
  • 获取所有会议的bibtex文件 python run.py --all bibtex
  • 获取所有会议的pdf文件 python run.py --all bibtex

其他说明:

  • PaperRobot通过dblp抓取对应会议的bibtex,以保证通用性,理论上支持任意DBLP上收录的会议。

    通过配置下列数据,可以增加新的会议支持。

    LIB = {
        "ccs": "CCS",
        "uss": "Usenix_Security",
        "sp": "S&P",
        "ndss": "NDSS",
        "dsn": "DSN",
        "raid": "RAID",
        "imc": "IMC",
        "asiaccs": "ASIACCS",
        "acsac": "ACSAC",
        "sigcomm": "SIGCOMM",
    }
  • 多个PDF辅助抓取接口:

    • 通过doi序列号在SCI-HUB抓取论文(zotera适用方法)
    • 论文官方网站抓取论文
    • 通过google搜索抓取论文
    • 通过crossRef网站抓取论文(这个接口效果不是特别好)
  • keep_cookies.py 用于维护某些站点的登陆状态,需要单独运行。

    • 维护登陆状态的原因是某些网站(如dl.acm)需要登陆才能下载pdf。

      用户需要单独配置config中的账号密码,账号密码为学校账号与密码。

    • 若在教育网IP内访问, 则不需要维护Cookie信息,教育网IP直接可以下载PDF。

    • 用户也可以手动维护cookie信息,利用burpsuite等一系列工具导出cookie,写入data/cookie.json文件即可。

TODO

  • 更好的文档说明,中英文文档分开。
  • 修改日志信息到英文版本
  • 多进程+多协程并发处理
  • 代理池构建
  • 使用重试修饰器重写需重试的函数
Owner
moxiaoxi
CTF Player of Tea-Deliverers, Blue-Lotus. Ph.D. Student at Tsinghua University. Research on Protocol Security.
moxiaoxi
Goblyn is a Python tool focused to enumeration and capture of website files metadata.

Goblyn Metadata Enumeration What's Goblyn? Goblyn is a tool focused to enumeration and capture of website files metadata. How it works? Goblyn will se

Gustavo 46 Nov 22, 2022
IGLS - Instagram Like Scraper CLI tool

IGLS - Instagram Like Scraper It's a web scraping command line tool based on python and selenium. Description This is a trial tool for learning purpos

Shreshth Goyal 5 Oct 29, 2021
Newsscraper - A simple Python 3 module to get crypto or news articles and their content from various RSS feeds.

NewsScraper A simple Python 3 module to get crypto or news articles and their content from various RSS feeds. 🔧 Installation Clone the repo locally.

Rokas 3 Jan 02, 2022
Twitter Eye is a Twitter Information Gathering Tool With Twitter Eye

Twitter Eye is a Twitter Information Gathering Tool With Twitter Eye, you can search with various keywords and usernames on Twitter.

Jolanda de Koff 19 Dec 12, 2022
LSpider 一个为被动扫描器定制的前端爬虫

LSpider LSpider - 一个为被动扫描器定制的前端爬虫 什么是LSpider? 一款为被动扫描器而生的前端爬虫~ 由Chrome Headless、LSpider主控、Mysql数据库、RabbitMQ、被动扫描器5部分组合而成。

Knownsec, Inc. 321 Dec 12, 2022
A web service for scanning media hosted by a Matrix media repository

Matrix Content Scanner A web service for scanning media hosted by a Matrix media repository Installation TODO Development In a virtual environment wit

Brendan Abolivier 5 Dec 01, 2022
Semplice scraper realizzato in Python tramite la libreria BeautifulSoup

Semplice scraper realizzato in Python tramite la libreria BeautifulSoup

2 Nov 22, 2021
用python爬取江苏几大高校的就业网站,并提供3种方式通知给用户,分别是通过微信发送、命令行直接输出、windows气泡通知。

crawler_for_university 用python爬取江苏几大高校的就业网站,并提供3种方式通知给用户,分别是通过微信发送、命令行直接输出、windows气泡通知。 环境依赖 wxpy,requests,bs4等库 功能描述 该项目基于python,通过爬虫爬各高校的就业信息网,爬取招聘信

8 Aug 16, 2021
Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.

Pattern Pattern is a web mining module for Python. It has tools for: Data Mining: web services (Google, Twitter, Wikipedia), web crawler, HTML DOM par

Computational Linguistics Research Group 8.4k Jan 08, 2023
Simple Web scrapper Bot to scrap webpages using Requests, html5lib and Beautifulsoup.

WebScrapperRoBot Simple Web scrapper Bot to scrap webpages using Requests, html5lib and Beautifulsoup. Mark your Star ⭐ ⭐ What is Web Scraping ? Web s

Nuhman Pk 53 Dec 21, 2022
PS5 bot to find a console in france for chrismas 🎄🎅🏻 NOT FOR SCALPERS

Une PS5 pour Noël Python + Chrome --headless = une PS5 pour noël MacOS Installer chrome Tweaker le .yaml pour la listes sites a scrap et les criteres

Olivier Giniaux 3 Feb 13, 2022
A Very simple free proxy list scraper.

Scrappp A Very simple free proxy list scraper, made in python The tool scrape proxy from diffrent sites and api's. Screenshots About the script !!! RE

Joji aka Moncef 12 Oct 27, 2022
Raspi-scraper is a configurable python webscraper that checks raspberry pi stocks from verified sellers

Raspi-scraper is a configurable python webscraper that checks raspberry pi stocks from verified sellers.

Louie Cai 13 Oct 15, 2022
This tool can be used to extract information from any website

WEB-INFO- This tool can be used to extract information from any website Install Termux and run the command --- $ apt-get update $ apt-get upgrade $ pk

1 Oct 24, 2021
Automatically scrapes all menu items from the Taco Bell website

Automatically scrapes all menu items from the Taco Bell website. Returns as PANDAS dataframe.

Sasha 2 Jan 15, 2022
Amazon web scraping using Scrapy Framework

Amazon-web-scraping-using-Scrapy-Framework Scrapy Scrapy is an application framework for crawling web sites and extracting structured data which can b

Sejal Rajput 1 Jan 25, 2022
🤖 Threaded Scraper to get discord servers from disboard.org written in python3

Disboard-Scraper Threaded Scraper to get discord servers from disboard.org written in python3. Setup. One thread / tag If you whant to look for multip

Ѵιcнч 11 Nov 01, 2022
An application that on a given url, crowls a web page and gets all words, sorts and counts them.

Web-Scrapping-1 An application that on a given url, crowls a web page and gets all words, sorts and counts them. Installation Using the package manage

adriano atambo 1 Jan 16, 2022
A web Scraper for CSrankings.com that scrapes University and Faculty list for a particular country

A look into what we're building Demo.mp4 Prerequisites Python 3 Node v16+ Steps to run Create a virtual environment. Activate the virtual environment.

2 Jun 06, 2022
TarkovScrappy - A nifty little bot that lets you know if a queried item might be required for a quest at some point in the land of Tarkov!

TarkovScrappy A nifty little bot that lets you know if a queried item might be required for a quest at some point in the land of Tarkov! Hideout items

Joshua Smeda 2 Apr 11, 2022