Visual scraping for Scrapy

Last update: Jan 05, 2023

Related tags

Overview

Portia

Portia is a tool that allows you to visually scrape websites without any programming knowledge required. With Portia you can annotate a web page to identify the data you wish to extract, and Portia will understand based on these annotations how to scrape data from similar pages.

Running Portia

The easiest way to run Portia is using Docker:

You can run Portia using Docker & official Portia-image by running:

docker run -v ~/portia_projects:/app/data/projects:rw -p 9001:9001 scrapinghub/portia

You can also set up a local instance with Docker-compose by cloning this repo & running from the root of the folder:

docker-compose up

For more detailed instructions, and alternatives to using Docker, see the Installation docs.

Documentation

Documentation can be found from Read the docs. Source files can be found in the docs directory.

Comments

unable to deploy with scrapyd-deploy

Hello,

Could you please help me figure out what I'm doing wrong ? Here are the steps: i followed the portia install manual - all ok i created a new project, entered an url, tagged an item - all ok clicked "continue browsing", browsed through site, items were being extracted as expected - all ok

Next i wanted to deploy my spider: 1st try : i tried to run, as the docs specified, scrapyd-deploy your_scrapyd_target -p project_name - got error - scrapyd wasn't installed fix: pip install scrapyd 2nd try : i launched scrapyd server (also missing from the docs), accessed http://localhost:6800/ -all ok after a brief reading of scrapyd docs i found out i had to edit the file scrapy.cfg from my project : slyd/data/projects/new_project/scrapy.cfg added the following : [deploy:local] url = http://localhost:6800/

went back to the console, checked all is ok : $:> scrapyd-deploy -l local http://localhost:6800/

$:> scrapyd-deploy -L local default

seemed ok so i gave it another try : $scrapyd-deploy local -p default Packing version 1418722113 Deploying to project "default" in http://localhost:6800/addversion.json Server response (200): {"status": "error", "message": "IOError: [Errno 21] Is a directory: '/Users/Mihai/Work/www/4ideas/MarketWatcher/portia_tryout/portia/slyd/data/projects/new_project'"}

What am I missing ?

opened by MihaiCraciun 40
ImportError: No module named jsonschema.exceptions

After correct installation, when I try to run

twistd -n slyd

I get

Traceback (most recent call last): File "/usr/local/bin/twistd", line 14, in run() File "/usr/local/lib/python2.7/dist-packages/twisted/scripts/twistd.py", line 27, in run app.run(runApp, ServerOptions) File "/usr/local/lib/python2.7/dist-packages/twisted/application/app.py", line 642, in run runApp(config) File "/usr/local/lib/python2.7/dist-packages/twisted/scripts/twistd.py", line 23, in runApp _SomeApplicationRunner(config).run() File "/usr/local/lib/python2.7/dist-packages/twisted/application/app.py", line 376, in run self.application = self.createOrGetApplication() File "/usr/local/lib/python2.7/dist-packages/twisted/application/app.py", line 436, in createOrGetApplication ser = plg.makeService(self.config.subOptions) File "/home/euphorbium/Projects/mtg/scraper/portia-master/slyd/slyd/tap.py", line 55, in makeService root = create_root(config) File "/home/euphorbium/Projects/mtg/scraper/portia-master/slyd/slyd/tap.py", line 27, in create_root from slyd.crawlerspec import (CrawlerSpecManager, File "/home/euphorbium/Projects/mtg/scraper/portia-master/slyd/slyd/crawlerspec.py", line 12, in from jsonschema.exceptions import ValidationError ImportError: No module named jsonschema.exceptions

opened by Euphorbium 28
How to install portia2.0 correctly by docker?

when I install portia by docker I had a problems. when I run ember build -w ,the results is:

Could not start watchman; falling back to NodeWatcher for file system events. Visit http://ember-cli.com/user-guide/#watchman for more info. components/browser-iframe.js: line 175, col 39, 'TreeMirror' is not defined. components/browser-iframe.js: line 212, col 9, '$' is not defined. components/browser-iframe.js: line 227, col 23, '$' is not defined. components/browser-iframe.js: line 281, col 20, '$' is not defined. 4 errors components/inspector-panel.js: line 92, col 16, '$' is not defined. components/inspector-panel.js: line 92, col 33, '$' is not defined. components/inspector-panel.js: line 102, col 16, '$' is not defined. 3 errors components/save-status.js: line 48, col 16, 'moment' is not defined. 1 error controllers/projects/project/conflicts/conflict.js: line 105, col 13, '$' is not defined. 1 error controllers/projects/project/spider.js: line 10, col 25, 'URI' is not defined. 1 error routes/projects/project/conflicts.js: line 5, col 16, '$' is not defined. 1 error services/web-socket.js: line 10, col 15, 'URI' is not defined. services/web-socket.js: line 15, col 12, 'URI' is not defined. 2 errors utils/browser-features.js: line 14, col 13, 'Modernizr' is not defined. 1 error utils/tree-mirror-delegate.js: line 30, col 20, '$' is not defined. utils/tree-mirror-delegate.js: line 66, col 17, '$' is not defined. 2 errors utils/utils.js: line 15, col 20, 'URI' is not defined. utils/utils.js: line 50, col 9, 'Raven' is not defined. utils/utils.js: line 56, col 9, 'Raven' is not defined. 3 errors ===== 10 JSHint Errors Build successful - 1641ms. Slowest Trees | Total
----------------------------------------------+--------------------- broccoli-persistent-filter:Babel > [Babel:... | 275ms
JSHint app | 149ms
SourceMapConcat: Concat: Vendor /assets/ve... | 89ms
broccoli-persistent-filter:Babel | 85ms
broccoli-persistent-filter:Babel | 82ms
Slowest Trees (cumulative) | Total (avg)
----------------------------------------------+--------------------- broccoli-persistent-filter:Babel > [Ba... (2) | 294ms (147 ms)
broccoli-persistent-filter:Babel (5) | 257ms (51 ms)
JSHint app (1) | 149ms
SourceMapConcat: Concat: Vendor /asset... (1) | 89ms
broccoli-persistent-filter:TemplateCom... (2) | 88ms (44 ms)

I wanna know where is wrong. By the way,this step I followed the some ISSUES,if I just follow the Documentation to install portia by docker(docker build -t portia .),when I run the portia,the http://localhost:9001/static/index.html will show me nothing. So,how to install portia2.0 correctly by docker?Could someone help me?

opened by ChoungJX 19
Support for JS

Add support for JS based sites. It would be nice to have UI support for configuring instead of having to do it manually at the Scrapy level.

Perhaps we can allow users to enable or disable sending requests via Splash.
feature

opened by shaneaevans 19
Crawls duplicate/identical URLs
If there are duplicate URLs on the page they will be crawled and exported as many times as you see the link.

It is very unusual circumstances that you would need to crawl the same URL more than once.

I have two proposals:

Have a checkbox so you can tick "Avoid visiting duplicate links".

Alternatively, add filtering options in the link crawler to filter by HTML markup too, that way only only links with certain classes.
opened by jpswade 18
Question: how to store data in json

Hi, recently I have installed portia and it's quite good. But since I'm not very familiar with it, I have met a problem: In my template I used variants to scrape the website and it worked well when I clicked continue browsing or turned to a similiar page. However, when I started running the portia spider in cmd, it could successfully scrape data yet failed to store the whole data scraped in the json file. I found that in the json file only the data in the page I annotated were stored. I guess the problem is "WARNING: Dropped: Duplicate product scraped at http://bbs.chinadaily.com.cn/forum-83-206.html, first one was scraped at http://bbs.chinadaily.com.cn/forum-83-2.html" (it appeared in cmd when the spider was running), but I don't know how to solve the problem. I hope someone can help me as soon as possible. Thanks very much!

opened by SophiaCY 17
Next page navigation

Is the new portia scrape next page data? I build spider in portia with nextpage recorded option which is run in docker, but when i deployed it in scrapyd there is no data scraped i got output like

2016-03-29 15:57:55 [scrapy] INFO: Spider opened 2016-03-29 15:57:55 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-03-29 15:57:55 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023 2016-03-29 15:58:32 [scrapy] DEBUG: Retrying <POST http://localhost:8050/render.html?job_id=dfe9afdef59811e5b9ad000c29120335> (failed 1 times): 504 Gateway Time-out 2016-03-29 15:58:55 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-03-29 15:59:07 [scrapy] DEBUG: Retrying <POST http://localhost:8050/render.html?job_id=dfe9afdef59811e5b9ad000c29120335> (failed 2 times): 504 Gateway Time-out 2016-03-29 15:59:38 [scrapy] DEBUG: Crawled (200) <POST http://localhost:8050/render.html?job_id=dfe9afdef59811e5b9ad000c29120335> (referer: None) 2016-03-29 15:59:39 [scrapy] DEBUG: Filtered offsite request to 'www.successfactors.com': <GET http://www.successfactors.com/> 2016-03-29 15:59:51 [scrapy] DEBUG: Crawled (200) <POST http://localhost:8050/render.html?job_id=dfe9afdef59811e5b9ad000c29120335> (referer: None) 2016-03-29 15:59:51 [scrapy] DEBUG: Crawled (200) <POST http://localhost:8050/render.html?job_id=dfe9afdef59811e5b9ad000c29120335> (referer: None) 2016-03-29 15:59:52 [scrapy] DEBUG: Crawled (200) <POST http://localhost:8050/render.html?job_id=dfe9afdef59811e5b9ad000c29120335> (referer: None) 2016-03-29 15:59:52 [scrapy] DEBUG: Crawled (200) <POST http://localhost:8050/render.html?job_id=dfe9afdef59811e5b9ad000c29120335> (referer: None) 2016-03-29 15:59:52 [scrapy] DEBUG: Crawled (200) <POST http://localhost:8050/render.html?job_id=dfe9afdef59811e5b9ad000c29120335> (referer: None) 2016-03-29 15:59:52 [scrapy] DEBUG: Filtered offsite request to 'www.cpchem.com': <GET http://www.cpchem.com/> 2016-03-29 15:59:52 [scrapy] DEBUG: Crawled (200) <POST http://localhost:8050/render.html?job_id=dfe9afdef59811e5b9ad000c29120335> (referer: None) 2016-03-29 15:59:55 [scrapy] INFO: Crawled 7 pages (at 7 pages/min), scraped 0 items (at 0 items/min) 2016-03-29 15:59:57 [scrapy] DEBUG: Crawled (200) <POST http://localhost:8050/render.html?job_id=dfe9afdef59811e5b9ad000c29120335> (referer: None) 2016-03-29 15:59:58 [scrapy] DEBUG: Crawled (200) <POST http://localhost:8050/render.html?job_id=dfe9afdef59811e5b9ad000c29120335> (referer: None) 2016-03-29 15:59:59 [scrapy] DEBUG: Crawled (200) <POST http://localhost:8050/render.html?job_id=dfe9afdef59811e5b9ad000c29120335> (referer: None) 2016-03-29 16:00:00 [scrapy] DEBUG: Crawled (200) <POST http://localhost:8050/render.html?job_id=dfe9afdef59811e5b9ad000c29120335> (referer: None) 2016-03-29 16:00:00 [scrapy] DEBUG: Crawled (200) <POST http://localhost:8050/render.html?job_id=dfe9afdef59811e5b9ad000c29120335> (referer: None) 2016-03-29 16:00:03 [scrapy] DEBUG: Crawled (200) <POST http://localhost:8050/render.html?job_id=dfe9afdef59811e5b9ad000c29120335> (referer: None) 2016-03-29 16:00:04 [scrapy] DEBUG: Crawled (200) <POST http://localhost:8050/render.html?job_id=dfe9afdef59811e5b9ad000c29120335> (referer: None) 2016-03-29 16:00:04 [scrapy] DEBUG: Crawled (200) <POST http://localhost:8050/render.html?job_id=dfe9afdef59811e5b9ad000c29120335> (referer: None)

there scrapyd and splash are running in local, is there any idea for this issue?

thanks in advance

opened by agramesh 16
Running Portia with Docker

For windows machine I installed docker and this command "docker build -t portia ." also successfully run . after that i received error while entering this section

docker run -i -t --rm -v <PROJECT_FOLDER>/data:/app/slyd/data:rw
-p 9001:9001
--name portia
portia

how configure this command and what is <PROJECT_FOLDER>. Could please help on this to test.

opened by agramesh 15
Cant visit http://localhost:8000/

After installed vagrant and virtualbox on my win7 amd64 pc, I successfully launched the Ubuntu virtual machine.

However, when I visited http://localhost:8000/, I got this.

What am i missing?

opened by fakegit 14
Try to resolve the "No module named slyd.tap error"?

When I start bin/slyd i received the following error in cent-os

slyd]$ bin/slyd 2015-08-27 17:18:29+0530 [-] Log opened. 2015-08-27 17:18:29.318689 [-] Splash version: 1.6 2015-08-27 17:18:29.319315 [-] Qt 4.8.5, PyQt 4.10.1, WebKit 537.21, sip 4.14.6, Twisted 15.2.1, Lua 5.1 2015-08-27 17:18:29.319412 [-] Open files limit: 1024 2015-08-27 17:18:29.319473 [-] Open files limit increased from 1024 to 4096 2015-08-27 17:18:29.534741 [-] Xvfb is started: ['Xvfb', ':2431', '-screen', '0', '1024x768x24'] Xlib: extension "RANDR" missing on display ":2431". 2015-08-27 17:18:29.637987 [-] Traceback (most recent call last): 2015-08-27 17:18:29.638202 [-] File "bin/slyd", line 41, in 2015-08-27 17:18:29.638312 [-] splash.server.main() 2015-08-27 17:18:29.638408 [-] File "/usr/lib/python2.7/site-packages/splash/server.py", line 373, in main 2015-08-27 17:18:29.638572 [-] max_timeout=opts.max_timeout 2015-08-27 17:18:29.638665 [-] File "/usr/lib/python2.7/site-packages/splash/server.py", line 279, in default_splash_server 2015-08-27 17:18:29.638801 [-] max_timeout=max_timeout 2015-08-27 17:18:29.638887 [-] File "bin/slyd", line 15, in make_server 2015-08-27 17:18:29.638981 [-] from slyd.tap import makeService

2015-08-27 17:18:29.639099 [-] ImportError: No module named slyd.tap

How could it resolve

opened by agramesh 14
error unexpected error

hi all

i dont know why this happen and how to solve this i got some error while i click new sample the GUI for enter new sample not happen but got this error appear

can anyone help me to solve this?

opened by jemjov 13
How to get fields exported in correct order?

I've got this working on Docker, and can run a spider with something along the lines of docker exec -t portialatest portiacrawl "/app/data/projects/nproject" "nproject.com" -o "/mnt/dc.csv" I've got spiders setup for a few sites which all scrape the same data, my problem is that the exported fields are in different orders, I've tried inserting FIELDS_TO_EXPORT and also CSV_EXPORT_FIELDS into settings.py but seems to have no effect. I'm wanting to get the fields output all in the same order - the name of the fields (Annotations) are Ref, Beds, Baths, Price, Location Could some kind soul point me to the correct solution please? Many thanks.

opened by buttonsbond 0
fix(sec): upgrade scrapy-splash to 0.8.0
What happened？

There are 1 security vulnerabilities found in scrapy-splash 0.7.2

CVE-2021-41124

What did I do？

Upgrade scrapy-splash from 0.7.2 to 0.8.0 for vulnerability fix

What did you expect to happen？

Ideally, no insecure libs should be used.

The specification of the pull request

PR Specification from OSCS
opened by chncaption 0
Fix portia_server
this PR is abandoned in favor of Gerapy

fix #883 #920 #913 #907 #903 #902 #895 #877 #842 #812 #811 #790 #760 #742 ...

todo

[x] fix portia_server

[x] bump version from 2.0.8 to 2.1.0

[x] fix slybot (at least the build is working, may produce runtime errors)

[ ] fix some snapshot tests

[x] fix portiaui

[x] fix slyd

[ ] support frames #494

related https://github.com/scrapinghub/portia2code/pull/12 https://github.com/scrapy/scrapely/pull/122
opened by milahu 6
Correct typo in docker setup docs

Minor typo in docs - name of environment variable used in documentation info box (PROJECT_FOLDER) didn't match the actual name used in the example shell commands (PROJECTS_FOLDER)

opened by mz8i 0
Browser version error message

I am running Portia using Docker on Windows. Even though I just installed and updated Chrome today to Version 95.0.4638.54 (Official Build) (64-bit), I am getting an error that I need to use an up-to-date Browser as soon as I open the portia (localhost:9001).

opened by fptmark 2

Releases(slybot_0.10)

slybot_0.10(Apr 15, 2015)

Source code(tar.gz)
Source code(zip)
slybot-0.10.tar.gz(248.17 KB)
slybot-0.10.zip(263.20 KB)

Owner

Scrapinghub

Turn web content into useful data

GitHub Repository

A way to scrape sports streams for use with Jellyfin.

Sportyfin Description Stream sports events straight from your Jellyfin server. Sportyfin allows users to scrape for live streamed events and watch str

38 Nov 05, 2022

A Python library for automating interaction with websites.

Home page https://mechanicalsoup.readthedocs.io/ Overview A Python library for automating interaction with websites. MechanicalSoup automatically stor

4.3k Jan 07, 2023

Auto Join: A GitHub action script to automatically invite everyone to the organization who star your repository.

Auto Invite To The Organization By Star A GitHub Action script to automatically invite everyone to your organization that stars your repository. What

11 Dec 11, 2022

Tool to scan for secret files on HTTP servers

snallygaster Finds file leaks and other security problems on HTTP servers. what? snallygaster is a tool that looks for files accessible on web servers

2k Dec 28, 2022

This is a script that scrapes the longitude and latitude on food.grab.com

grab This is a script that scrapes the longitude and latitude for any restaurant in Manila on food.grab.com, location can be adjusted. Search Result p

0 Nov 22, 2021

Visual scraping for Scrapy

Portia Portia is a tool that allows you to visually scrape websites without any programming knowledge required. With Portia you can annotate a web pag

8.7k Jan 05, 2023

A universal package of scraper scripts for humans

Scrapera is a completely Chromedriver free package that provides access to a variety of scraper scripts for most commonly used machine learning and data science domains.

299 Dec 15, 2022

Scrapy uses Request and Response objects for crawling web sites.

Requests and Responses¶ Scrapy uses Request and Response objects for crawling web sites. Typically, Request objects are generated in the spiders and p

1 Nov 03, 2021

Pyrics is a tool to scrape lyrics, get rhymes, generate relevant lyrics with rhymes.

Pyrics Pyrics is a tool to scrape lyrics, get rhymes, generate relevant lyrics with rhymes. ./test/run.py provides the full function in terminal cmd

1 Feb 12, 2022

A simple proxy scraper that utilizes the requests module in python.

Proxy Scraper A simple proxy scraper that utilizes the requests module in python. Usage Depending on your python installation your commands may vary.

3 Sep 08, 2021

Scrape all the media from an OnlyFans account - Updated regularly

3.2k Dec 29, 2022

一款利用Python来自动获取QQ音乐上某个歌手所有歌曲歌词的爬虫软件

QQ音乐歌词爬虫一款利用Python来自动获取QQ音乐上某个歌手所有歌曲歌词的爬虫软件，默认去除了所有演唱会（Live）版本的歌曲。使用方法直接运行python run.py即可，然后输入你想获取的歌手名字，然后静静等待片刻。 output目录下保存生成的歌词和歌名文件。以周杰伦为例，会生成两

11 Jul 27, 2022

Screenhook is a script that captures an image of a web page and send it to a discord webhook.

screenshot from the web for discord webhooks screenhook is a script that captures an image of a web page and send it to a discord webhook.

3 Jun 04, 2022

An Web Scraping API for MDL(My Drama List) for Python.

PyMDL An API for MyDramaList(MDL) based on webscraping for python. Description An API for MDL to make your life easier in retriving and working on dat

6 Dec 10, 2022

HappyScrapper - Google news web scrapper with python

HappyScrapper ~ Google news web scrapper INSTALLATION ♦ Clone the repository ♦ O

0 Nov 07, 2022

The first public repository that provides free BUBT website scraping API script on Github.

BUBT WEBSITE SCRAPPING SCRIPT I think this is the first public repository that provides free BUBT website scraping API script on github. When I was do

3 Feb 10, 2022

Kusonime scraper using python3

Features Scrap from url Scrap from recommendation Search by query Todo [+] Search by genre Example # Get download url from kusonime import Scrap

2 Jan 28, 2022

Scrapy-soccer-games - Scraping information about soccer games from a few websites

scrapy-soccer-games Esse projeto tem por finalidade pegar informação de tabela d

2 Jul 20, 2022

Open Crawl Vietnamese Text

Open Crawl Vietnamese Text This repo contains crawled Vietnamese text from multiple sources. This list of a topic-centric public data sources in high

4 Jan 05, 2022

Displays market info for the LUNI token on the Terra Blockchain

LuniBot for Discord Displays market info for the LUNI/LUNA token on the Terra Blockchain (Webscrape method currently scraping CoinMarketCap). Will evo

0 Jan 22, 2022

Visual scraping for Scrapy

Related tags

Overview

Portia

Running Portia

Documentation

Comments

What happened？

What did I do？

What did you expect to happen？

The specification of the pull request

Releases(slybot_0.10)

slybot_0.10(Apr 15, 2015)

Owner

Scrapinghub

A way to scrape sports streams for use with Jellyfin.

A Python library for automating interaction with websites.

Auto Join: A GitHub action script to automatically invite everyone to the organization who star your repository.

Tool to scan for secret files on HTTP servers

This is a script that scrapes the longitude and latitude on food.grab.com

Visual scraping for Scrapy

A universal package of scraper scripts for humans

Scrapy uses Request and Response objects for crawling web sites.

Pyrics is a tool to scrape lyrics, get rhymes, generate relevant lyrics with rhymes.

A simple proxy scraper that utilizes the requests module in python.

Scrape all the media from an OnlyFans account - Updated regularly

一款利用Python来自动获取QQ音乐上某个歌手所有歌曲歌词的爬虫软件

Screenhook is a script that captures an image of a web page and send it to a discord webhook.

An Web Scraping API for MDL(My Drama List) for Python.

HappyScrapper - Google news web scrapper with python

The first public repository that provides free BUBT website scraping API script on Github.

Kusonime scraper using python3

Scrapy-soccer-games - Scraping information about soccer games from a few websites

Open Crawl Vietnamese Text

Displays market info for the LUNI token on the Terra Blockchain