Python virtual filesystem for SQLite to read from and write to S3

Overview

sqlite-s3vfs CircleCI Test Coverage

Python virtual filesystem for SQLite to read from and write to S3.

No locking is performed, so client code must ensure that writes do not overlap with other writes or reads. If multiple writes happen at the same time, the database will probably become corrupt and data be lost.

Inspired by phiresky's sql.js-httpvfs, dacort's Stack Overflow answer, and sqlite-s3-query.

How does it work?

sqlite-s3vfs stores the SQLite database in fixed-sized blocks, and each is stored as a separate object in S3. SQLite stores its data in fixed-size pages, and always writes exactly a page at a time. This virtual filesystem translates page reads and writes to block reads and writes. In the case of SQLite pages being the same size as blocks, which is the case by default, each page write results in exactly one block write.

Separate objects are required since S3 does not support the partial replace of an object; to change even 1 byte, it must be re-uploaded in full.

Installation

sqlite-s3vfs depends on APSW, which is not officially available on PyPI, but can be installed directly from GitHub.

pip install sqlite-s3vfs
pip install https://github.com/rogerbinns/apsw/releases/download/3.36.0-r1/apsw-3.36.0-r1.zip --global-option=fetch --global-option=--version --global-option=3.36.0 --global-option=--sqlite --global-option=build --global-option=--enable-all-extensions

Usage

sqlite-s3vfs is an APSW virtual filesystem that requires boto3 for its communication with S3.

import apsw
import boto3
import sqlite_s3vfs

# A boto3 bucket resource
bucket = boto3.Session().resource('s3').Bucket('my-bucket')

# An S3VFS for that bucket
s3vfs = sqlite_s3vfs.S3VFS(bucket=bucket)

# sqlite-s3vfs stores many objects under this prefix
# Note that it's not typical to start a key prefix with '/'
key_prefix = 'my/path/cool.sqlite'

# Connect, insert data, and query
with apsw.Connection(key_prefix, vfs=s3vfs.name) as db:
    cursor = db.cursor()
    cursor.execute(f'''
        CREATE TABLE foo(x,y);
        INSERT INTO foo VALUES(1,2);
    ''')
    cursor.execute('SELECT * FROM foo;')
    print(cursor.fetchall())

See the APSW documentation for more examples.

Serializing (getting a regular SQLite file out of the VFS)

The bytes corresponding to a regular SQLite file can be extracted with the serialize_iter function, which returns an iterable,

for chunk in s3vfs.serialize_iter(key_prefix=key_prefix):
    print(chunk)

or with serialize_fileobj, which returns a non-seekable file-like object. This can be passed to Boto3's upload_fileobj method to upload a regular SQLite file to S3.

target_obj = boto3.Session().resource('s3').Bucket('my-target-bucket').Object('target/cool.sqlite')
target_obj.upload_fileobj(s3vfs.serialize_fileobj(key_prefix=key_prefix))

Deserializing (getting a regular SQLite file into the VFS)

# Any iterable that yields bytes can be used. In this example, bytes come from
# a regular SQLite file already in S3
source_obj = boto3.Session().resource('s3').Bucket('my-source-bucket').Object('source/cool.sqlite')
bytes_iter = source_obj.get()['Body'].iter_chunks()

s3vfs.deserialize_iter(key_prefix='my/path/cool.sqlite', bytes_iter=bytes_iter)

Tests

The tests require the dev dependencies and APSW to installed, and MinIO started

pip install -r requirements-dev.txt
pip install https://github.com/rogerbinns/apsw/releases/download/3.36.0-r1/apsw-3.36.0-r1.zip --global-option=fetch --global-option=--version --global-option=3.36.0 --global-option=--all --global-option=build --global-option=--enable-all-extensions
./start-minio.sh

can be run with pytest

pytest

and finally Minio stopped

./stop-minio.sh
Owner
Department for International Trade
Department for International Trade
PaddingZip - a tool that you can craft a zip file that contains the padding characters between the file content.

PaddingZip - a tool that you can craft a zip file that contains the padding characters between the file content.

phithon 53 Nov 07, 2022
RMfuse provides access to your reMarkable Cloud files in the form of a FUSE filesystem

RMfuse provides access to your reMarkable Cloud files in the form of a FUSE filesystem. These files are exposed either in their original format, or as PDF files that contain your annotations. This le

Robert Schroll 82 Nov 24, 2022
Find potentially sensitive files

find_files Find potentially sensitive files This script searchs for potentially sensitive files based off of file name or string contained in the file

4 Aug 20, 2022
CredSweeper is a tool to detect credentials in any directories or files.

CredSweeper is a tool to detect credentials in any directories or files. CredSweeper could help users to detect unwanted exposure of credentials (such as personal information, token, passwords, api k

Samsung 54 Dec 13, 2022
Singer is an open source standard for moving data between databases, web APIs, files, queues, and just about anything else you can think of.

Singer is an open source standard for moving data between databases, web APIs, files, queues, and just about anything else you can think of. Th

Singer 1.1k Jan 05, 2023
Vericopy - This Python script provides various usage modes for secure local file copying and hashing.

Vericopy This Python script provides various usage modes for secure local file copying and hashing. Hash data is captured and logged for paths before

15 Nov 05, 2022
A tool written in python to generate basic repo files from github

A tool written in python to generate basic repo files from github

Riley 7 Dec 02, 2021
A platform independent file lock for Python

py-filelock This package contains a single module, which implements a platform independent file lock in Python, which provides a simple way of inter-p

Benedikt Schmitt 497 Jan 05, 2023
Transforme rapidamente seu arquivo CSV (de qualquer tamanho) para SQL de forma rápida.

Transformador de CSV para SQL Transforme rapidamente seu arquivo CSV (de qualquer tamanho) para SQL de forma rápida, e com isso insira seus dados usan

William Rodrigues 4 Oct 17, 2022
A python script to pull the transactions of an Algorand wallet and put them into a CSV file.

AlgoCSV A python script to pull the transactions of an Algorand wallet and put them into a CSV file. Dependancies: Requests Main features: Groups: Com

21 Jun 25, 2022
csv2ir is a script to convert ir .csv files to .ir files for the flipper.

csv2ir csv2ir is a script to convert ir .csv files to .ir files for the flipper. For a repo of .ir files, please see https://github.com/logickworkshop

Alex 38 Dec 31, 2022
Extract longest transcript or longest CDS transcript from GTF annotation file or gencode transcripts fasta file.

Extract longest transcript or longest CDS transcript from GTF annotation file or gencode transcripts fasta file.

laojunjun 13 Nov 23, 2022
Measure file similarity in a many-to-many fashion

Mesi Mesi is a tool to measure the similarity in a many-to-many fashion of long-form documents like Python source code or technical writing. The outpu

GatorEducator 3 Feb 02, 2022
QSynthesis is a Python3 API to perform I/O based program synthesis of bitvector expressions.

QSynthesis is a Python3 API to perform I/O based program synthesis of bitvector expressions. It aims at facilitating code deobfuscation. The algorithm is greybox approach combining both a blackbox I/

Quarkslab 103 Dec 30, 2022
Convert All TXT Files To One File.

AllToOne Convert All TXT Files To One File. Hi 👋 , I'm Alireza A Python Developer Boy 🔭 I’m currently working on my C# projects 🌱 I’m currently Lea

4 Jun 07, 2022
ZipFly is a zip archive generator based on zipfile.py

ZipFly is a zip archive generator based on zipfile.py. It was created by Buzon.io to generate very large ZIP archives for immediate sending out to clients, or for writing large ZIP archives without m

Buzon 506 Jan 04, 2023
File support for asyncio

aiofiles: file support for asyncio aiofiles is an Apache2 licensed library, written in Python, for handling local disk files in asyncio applications.

Tin Tvrtković 2.1k Jan 01, 2023
Convert CSV files into a SQLite database

csvs-to-sqlite Convert CSV files into a SQLite database. Browse and publish that SQLite database with Datasette. Basic usage: csvs-to-sqlite myfile.cs

Simon Willison 731 Dec 27, 2022
Small Python script to generate a calendar (.ics) file from SIMASTER courses schedule.

simaster.ics Small Python script to generate a calendar (.ics) file from SIMASTER courses schedule. Usage Getting the events.json file from SIMASTER O

Faiz Jazadi 8 Nov 02, 2022
This is just a GUI that detects your file's real extension using the filetype module.

Real-file.extnsn This is just a GUI that detects your file's real extension using the filetype module. Requirements Python 3.4 and above filetype modu

1 Aug 08, 2021