Snowflake database loading utility with Scrapy integration

Overview

Snowflake Stage Exporter

Snowflake database loading utility with Scrapy integration.
Meant for streaming ingestion of JSON serializable objects into Snowflake stages and tables.

Installation

> pip install git+https://github.com/hermit-crab/snowflake-stage-exporter.git

Basic example

from snowflake_stage_exporter import SnowflakeStageExporter

with SnowflakeStageExporter(
    user='...',
    password='...',
    account='...',
    table_path='MY_DATABASE.PUBLIC.{item_type_name}',
) as exporter:
    exporter.export_item({'name': 'Jack', 'salary': 100}, item_type_name='employee')
    exporter.export_item({'name': 'Sal', 'salary': 90, 'extra_info': {'age': 20}}, item_type_name='employee')
    exporter.export_item({'title': 'Steel Grill', 'price': 5.5}, item_type_name='product')
    exporter.finish_export()  # flushes all stage buffers, creates tables and populates them with data inside stages

After you call finish() 2 tables will be created: EMPLOYEE (2 rows, 3 columns) and PRODUCT (1 row, 2 columns) located inside database MY_DATABASE and database schema PUBLIC (Snowflake default database schema).

Same thing achieved via Scrapy integration.

How this works

For each object that you feed into the exporter it will write it into a local buffer (temporary JSON file). Once a configurable maximum buffer size is reached the file is uploaded to Snowflake internal stage via PUT statement. Upon the end of the execution exporter will create all specified tables then instruct Snowflake to populate each table from every staged JSON file via COPY INTO

statements

  • If you output to multiple tables then a buffer is maintained for each.
  • Alternatively you can create / populate tables as soon as the buffers are flushed via *_on parameters described below.
  • *_on parameters also allow you to disable any table creation / population and just deal with the stages yourself.
  • For table creation the exporter will try to figure out column types dynamically during execution, otherwise you can pass them explicitly via parameter.

Why "Stages"?

The use of local buffers and stages opposed to typical SQL INSERT statements is motivated largely by Snowflake performance implications and their billing model (see https://community.snowflake.com/s/question/0D50Z00008JpBymSAF/implications-of-multiple-insert-statements-vs-copy-into).

An illustrative example can be a long running Scrapy / ScrapyCloud job that constantly outputs data. If the job was to keep the connection constantly executing the INSERTs - Snowflake would also keep the warehouse running / consuming the credits for the entire duration of the job.

Secondary consideration was for allowing the user to be able to work with purely just the stages like one would work with S3 or similar blob file storage. This covers cases when user would needs their own sophisticated table management approach and simply wants a convenient place to store raw data.

IMPORTANT NOTE: it won't make much sense to use this library if you're already working with S3 or similar storages (including just local machine) where your data is one of the Snowflake supported file formats. Snowflake has built-in support of ingesting several 3rd party blob storages and for local files you can upload them via PUT statements.

Configuration

All of the configurations are done via arguments to main exporter class SnowflakeStageExporter.

  • user/password/account - Snowflake account credentials, passed as is to snowflake.connector.connect.
  • connection_kwargs - any additional parameters to snowflake.connector.connect.
  • table_path - table path to use.
    • If you specify database / database schema in connection_kwargs you won't need to specify them in the table path.
    • The path can include template variables which are expanded when you feed an item to exporter. By default only item variable is passed (e.g. 'MY_DB.PUBLIC.TABLE_{item[entity_type]}' here it's assumed all of your items have "entity_type" field).
    • Any additional variables you can pass yourself as keyword arguments when calling exporter.export_item().
    • Additionally in Scrapy integration the following fields are passed:
      • spider - spider instance.
      • item_type_name - type(item).__name__. In the basic example above you passed this explicitly yourself.
  • stage - which internal stage to use. By default user stage ("@~") is used.
  • stage_path - naming for the files being uploaded to the stage.
    • By default it's "{table_path}/{instance_ms}_{batch_n}.jl" where table_path is table_path with all variables resolved, instance_ms epoch milliseconds when exporter was instantiated and batch_n being sequential number of the buffer.
    • In Scrapy integration by default this is "{table_path}/{job}/{instance_ms}_{batch_n}.jl" where job is the key of the ScrapyCloud job or "local" if spider ran locally.
  • max_file_size - maximum buffer size in bytes. 1GiB by default.
  • predefined_column_types - dictionary of table_path to Snowflake columns types for table creation.
    • e.g. {"MY_DB.PUBLIC.PRODUCT": {"title": "STRING", "price": "NUMBER"}, "MY_DB.PUBLIC.EMPLOYEE": {"name": "STRING", "salary": "NUMBER", "extra_info": "OBJECT"}}.
  • ignore_unexpected_fields - ignore fields not passed in predefined_column_types during table creation / population.
    • True by default but only takes effect when table does have predefined column types.
    • The data is still exported in full to the staged files.
  • allow_varying_value_types - if False during table creation / population skip columns that had multiple value types.
    • True by default. VARIANT type is assigned to such column.
    • Error is logged when False and such column is encountered.
    • Takes effect only when there is a need for exporter to figure out the column type.
    • The data is still exported in full to the staged files.
  • create_tables_on - one of "finish/flush/never". "finish" by default. "flush" is for each time a file is staged.
  • populate_tables_on - ditto.
  • clear_stage_on - same as above but "never" is default. Each file is removed from stage individually when enabled.

Configuration (Scrapy)

All of the exporter instance parameters are exposed as Scrapy settings like SNOWFLAKE_ (e.g. SNOWFLAKE_MAX_FILE_SIZE).

Once a Scrapy job ends, all remaining buffers are flushed. If the job outcome is not "finished" (something went wrong) then no table creation / table population / stage clear takes place.

TODO

  • Unit tests >_>.
  • Test on windows?
.
Owner
Oleg T.
Oleg T.
用python爬取江苏几大高校的就业网站,并提供3种方式通知给用户,分别是通过微信发送、命令行直接输出、windows气泡通知。

crawler_for_university 用python爬取江苏几大高校的就业网站,并提供3种方式通知给用户,分别是通过微信发送、命令行直接输出、windows气泡通知。 环境依赖 wxpy,requests,bs4等库 功能描述 该项目基于python,通过爬虫爬各高校的就业信息网,爬取招聘信

8 Aug 16, 2021
Transistor, a Python web scraping framework for intelligent use cases.

Web data collection and storage for intelligent use cases. transistor About The web is full of data. Transistor is a web scraping framework for collec

BOM Quote Manufacturing 212 Nov 05, 2022
Discord webhook spammer with proxy support and proxy scraper

Discord webhook spammer with proxy support and proxy scraper

3 Feb 27, 2022
Creating Scrapy scrapers via the Django admin interface

django-dynamic-scraper Django Dynamic Scraper (DDS) is an app for Django which builds on top of the scraping framework Scrapy and lets you create and

Holger Drewes 1.1k Dec 17, 2022
fork huanghyw/jd_seckill

Jd_Seckill 特别声明: 本仓库发布的jd_seckill项目中涉及的任何脚本,仅用于测试和学习研究,禁止用于商业用途,不能保证其合法性,准确性,完整性和有效性,请根据情况自行判断。 本项目内所有资源文件,禁止任何公众号、自媒体进行任何形式的转载、发布。

512 Jan 03, 2023
Web scraping library and command-line tool for text discovery and extraction (main content, metadata, comments)

trafilatura: Web scraping tool for text discovery and retrieval Description Trafilatura is a Python package and command-line tool which seamlessly dow

Adrien Barbaresi 704 Jan 06, 2023
The open-source web scrapers that feed the Los Angeles Times California coronavirus tracker.

The open-source web scrapers that feed the Los Angeles Times' California coronavirus tracker. Processed data ready for analysis is available at datade

Los Angeles Times Data and Graphics Department 51 Dec 14, 2022
Anonymously scrapes onlinesim.ru for new usable phone numbers.

phone-scraper Anonymously scrapes onlinesim.ru for new usable phone numbers. Usage Clone the repository $ git clone https://github.com/thomasgruebl/ph

16 Oct 08, 2022
Goblyn is a Python tool focused to enumeration and capture of website files metadata.

Goblyn Metadata Enumeration What's Goblyn? Goblyn is a tool focused to enumeration and capture of website files metadata. How it works? Goblyn will se

Gustavo 46 Nov 22, 2022
A webdriver-based script for reserving Tsinghua badminton courts.

AutoReserve A webdriver-based script for reserving badminton courts. 使用说明 下载 chromedriver 选择当前Chrome对应版本 安装 selenium pip install selenium 更改场次、金额信息dat

Payne Zhang 4 Nov 09, 2021
Snowflake database loading utility with Scrapy integration

Snowflake Stage Exporter Snowflake database loading utility with Scrapy integration. Meant for streaming ingestion of JSON serializable objects into S

Oleg T. 0 Dec 06, 2021
Scraping followers of an instagram account

ScrapInsta A script to scraping data from Instagram Install First of all you can run: pip install scrapinsta After that you need to install these requ

Matheus Kolln 1 Sep 05, 2021
Meme-videos - Scrapes memes and turn them into a video compilations

Meme Videos Scrapes memes from reddit using praw and request and then converts t

Partho 12 Oct 28, 2022
WebScraper - A script that prints out a list of all EXTERNAL references in the HTML response to an HTTP/S request

Project A: WebScraper A script that prints out a list of all EXTERNAL references

2 Apr 26, 2022
联通手机营业厅自动做任务、签到、领流量、领积分等。

联通手机营业厅自动完成每日任务,领流量、签到获取积分等,月底流量不发愁。 功能 沃之树领流量、浇水(12M日流量) 每日签到(1积分+翻倍4积分+第七天1G流量日包) 天天抽奖,每天三次免费机会(随机奖励) 游戏中心每日打卡(连续打卡,积分递增至最高

2k May 06, 2021
Basic-html-scraper - A complete how to of web scraping with Python for beginners

basic-html-scraper Code from YT Video This video includes a complete how to of w

John 12 Oct 22, 2022
Web crawling framework based on asyncio.

Web crawling framework for everyone. Written with asyncio, uvloop and aiohttp. Requirements Python3.5+ Installation pip install gain pip install uvloo

Jiuli Gao 2k Jan 05, 2023
一个m3u8视频流下载脚本

一个Python的m3u8流视频下载脚本 介绍 m3u8流视频日益常见,目前好用的下载器也有很多,我把之前自己写的一个小脚本分享出来,供广大网友使用。写此程序的目的在于给视频下载爱好者提供一个下载样例,可直接调用,勿再重复造轮子。 使用方法 在python中直接运行程序或进行外部调用 import

Nchu 0 Oct 10, 2021
Scrape Twitter for Tweets

Backers Thank you to all our backers! 🙏 [Become a backer] Sponsors Support this project by becoming a sponsor. Your logo will show up here with a lin

Ahmet Taspinar 2.2k Jan 05, 2023
Divar.ir Ads scrapper

Divar.ir Ads Scrapper Introduction This project first asynchronously grab Divar.ir Ads and then save to .csv and .xlsx files named data.csv and data.x

Iman Kermani 4 Aug 29, 2022