dbd is a database prototyping tool that enables data analysts and engineers to quickly load and transform data in SQL databases.

Overview

dbd: database prototyping tool

dbd is a database prototyping tool that enables data analysts and engineers to quickly load and transform data in SQL databases.

dbd helps you with following tasks:

  • Loading CSV, JSON, Excel, and Parquet data to database. It supports both local and online files (HTTP URLs). Data can be loaded incrementally or in full.
  • Transforming data in existing database tables using insert-from-sql statements.
  • Executing DDL (Data Definition Language) SQL scripts (statements like CREATE SCHEMA, etc.).

How dbd works

dbd processes a model directory that contains directories and files:

  • Directories create new database schemas.
  • Files create new database table or view. The new table's or view's name is the same as the data file name.
    • .csv, .json, .xlsx, and .parquet data files are introspected and loaded to database as tables.
    • .sql files that contain SQL SELECT statements are executed and the result is loaded to database as table or view.
    • .ref files contain one or more local paths or URLs pointing to supported data files. The referenced files are loaded to database as tables.
    • .yaml files contain metadata for the files above. The .yaml file has the same name as a data, .sql, or .ref file and specifies details of target table's columns (data types, constraints, indexes, etc.). .yaml files are optional. If not specified, dbd uses defaults (e.g. TEXT data types for CSV columns)
    • .ddl files contain multiple SQL statements separated by semicolon that are executed against the database.

dbd knows the correct order in which to process files in the model directory to respect mutual dependencies between created objects.

How dbd works

dbd currently supports Postgres, MySQL/MariaDB, SQLite, Snowflake, BigQuery, and Redshift databases.

Getting started and Examples

A short 5-minute getting started tutorial is available here.

You can also check out dbd's examples here. The easiest way how to execute them is to either clone or download dbd's github repository and start with the SQLite examples.

python3 -m venv dbd-env
source dbd-env/bin/activate
pip3 install dbd
git clone https://github.com/zsvoboda/dbd.git
cd dbd/examples/sqlite/basic
dbd run . 

These commands should create a new basic.db SQLite database with area, population, and state tables that are created and loaded from the corresponding files in the model directory.

Installing dbd

dbd requires Python 3.8 or higher.

Prerequisites

Check that you have a recent version of Python 3.8 or higher.

python3 -V

if not use a package manager to install the latest python:

On Fedora run:

sudo yum install python3

On Ubuntu run:

sudo apt install python3

Install Python virtual environment:

On Fedora run:

sudo yum install python3-virtualenv

On Ubuntu run:

sudo apt install python3-venv

On Windows just install Python 3.8 or higher from the Store.

Then activate the virtual environment:

On Linux run:

python3 -m venv dbd-env
source dbd-env/bin/activate

On Windows run:

python3 -m venv dbd-env
call dbd-env\Scripts\activate.bat

PyPI

pip3 install dbd

OR

git clone https://github.com/zsvoboda/dbd.git
cd dbd
pip3 install .

Running dbd

dbd installs a command line executable that must reside on your path. Sometimes Python places the executable (called dbd) outside of your PATH. Try to execute dbd after the installation. If the command cannot be found, try to execute

export PATH=~/.local/bin:$PATH

and run dbd again. pip3 usually complains about the fact that the directory where it is placing the executable is not in PATH. You need to take the scripts directory that it suggests and put it on your PATH.

Once you can execute the dbd command, clone the dbd repository and start with the SQLite examples:

git clone https://github.com/zsvoboda/dbd.git
cd dbd/examples/sqlite/basic
dbd run . 

You can also start with this tutorial.

Starting a new dbd project

You can generate dbd project initial layout by executing init command:

dbd init <new-project-name>

The init command generates a new dbd project directory with the following content:

  • model directory that contains the content files.
  • dbd.profile configuration file that defines database connections. The profile file is usually shared by more dbd projects.
  • dbd.project project configuration file references one of the connections from the profile file and define the model directory location.

dbd profile configuration file

dbd stores database connections in the dbd.profile configuration file. dbd searches for it in the current directory or in your home directory. You can use --profile option to point it to a profile file in different location.

The profile file is YAML file with the following structure:

databases:
  db1:
    db.url: 
   
  db2:
    db.url: 
   
  db3:
    db.url: 
   

Read this document for more details about specific SQLAlchemy database URL formats.

dbd project configuration file

dbd stores project configuration in project configuration file that is usually stored in your dbd project directory. dbd searches for dbd.project file in your project's directory root. You can also use the --project option of the dbd command to specify a custom project configuration file.

The project configuration file also uses YAML format and references dbd model directory and databse connection from a profile config file. All paths in project file are either absolute or relative to the directory where the profile file is located.

For example:

model: ./model
database: db2

Model directory

Model directory contains directories and files. Directories represent database schemas. Files, in most cases, represent database tables.

For example, this model directory layout

dbd-model-directory
+- schema1
 +-- us_states.csv
+- schema2
 +-- us_counties.csv

creates two database schemas: schema1 and schema2 and two database tables: us_states in schema1 and us_counties in schema2. Both tables are populated with the data from the CSV files.

dbd supports following files located in the model directory:

  • DATA files: .csv, .json, .xls, .xlsx, .parquet files are loaded to database as tables
  • REF files: .ref files contain one or more absolute or relative paths to local files or URLs of online data files that are loaded to database as tables. All referenced files must have the same structure as they are loaded to the same table.
  • SQL files: .sql with SQL SELECT statements are executed using insert-from-select SQL construct. The INSERT command is generated (the SQL file only contains a SQL SELECT statement)
  • DDL files: contain a sequence of SQL statements separated by semicolon. The DDL files can be named prolog.ddl and epilog.ddl. The prolog.ddl is executed before all other files in a specific schema. The epilog.ddl is executed last. The prolog.ddl and epilog.ddl in the top-level model directory are executed as the very first or the very last files in the model.
  • YAML files: specify additional configuration for the DATA, SQL, and REF files.

REF files

.ref file contains one or more references to files that dbd loads to the database as tables. The references can be URLs, absolute file paths or paths relative to the .ref file. All referenced data files must have the same structure as they are loaded to the same database table.

Here is an example of a .ref file:

https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/01-03-2022.csv
https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/01-04-2022.csv
../data/01-05-2022.csv
../data/01-06-2022.csv

The paths and URLs can point to data files with different formats (e.g. CSV or JSON) as long as the files have the same structure (number of columns and column types).

SQL files

.sql file performs SQL data transformation in the target database. It contains a SQL SELECT statement that dbd wraps in insert-from-select statement, executes it, and stores the result into a table or view that inherits its name from the SQL file name.

Here is an example of us_states.sql file that creates a new us_states database table:

SELECT
        state.abbrev AS state_code,
        state.state AS state_name,
        population.population AS state_population,
        area.area_sq_mi  AS state_area_sq_mi
    FROM state
        JOIN population ON population.state = state.abbrev
        JOIN area on area.state_name = state.state

YAML files

.yaml file specifies additional configuration for a corresponding DATA, REF or SQL file with the same base file name. Here is a YAML configuration example for the us_states.sql file above:

table:
  columns:
    state_code:
      nullable: false
      primary_key: true
      type: CHAR(2)
    state_name:
      nullable: false
      index: true
      type: VARCHAR(50)
    state_population:
      nullable: false
      type: INTEGER
    state_area_sq_mi:
      nullable: false
      type: INTEGER
process:
  materialization: table
  mode: drop

This .yaml file re-types the state_population and the state_area_sq_mi columns to INTEGER, disallows NULL values in all columns, and makes the state_code column table's primary key.

You don't have to describe all table's columns. The columns that you leave out will have their types set to the default TEXT datatype in case of DATA files and is defined by the insert-from-select in case of SQL files.

The us_states.sql table is dropped and data are re-loaded in full everytime the dbd executes this model.

Table section

.yaml file's columns are mapped to a columns of the table that dbd creates from a corresponding DATA, REF or SQL file. For example, a CSV header columns or SQL SELECT column AS column clauses.

dbd supports following column's parameters:

  • type: column's SQL type.
  • primary_key: is the column part of table's primary key (true|false)?
  • foreign_keys: all other database table columns that are referenced from a column in table (in format foreign-table.referenced-column).
  • nullable: does column allow null values (true|false)?
  • index: is column indexed (true|false)?
  • unique: does column store unique values (true|false)?

Process section

The process section defines following processing options:

  • materialization: specifies whether dbd creates a physical table or a view when processing SQL file. The REF and DATA files always yield physical table.
  • mode: specifies what dbd does with table's data. You can specify values drop, truncate, or keep. The mode option is ignored for views.

Jinja templates

Most of model files support Jinja2 templates. For example, this REF file loads 6 CSV files to database (4 online files from a URL and 2 from a local filesystem):

{% for n in range(4) %}
https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/01-0{{ n+1 }}-2022.csv
{% endfor %}
../data/01-05-2022.csv
../data/01-06-2022.csv

Profile an project configuration files also us Jinja2 templates. You can expend any environment variable with the {{ environment-variable-name }} syntax. For example, you can define your database connection parameters like username or password in environment variables and use this profile configuration file:

databases:
  states_snowflake:
    db.url: "snowflake://{{ SNOWFLAKE_USER }}:{{ SNOWFLAKE_PASSWORD }}@{{ SNOWFLAKE_ACCOUNT_IDENTIFIER }}/{{ SNOWFLAKE_DB }}/{{ SNOWFLAKE_SCHEMA }}?warehouse={{SNOWFLAKE_WAREHOUSE }}"
  covid:
    db.url: "snowflake://{{ SNOWFLAKE_USER }}:{{ SNOWFLAKE_PASSWORD }}@{{ SNOWFLAKE_ACCOUNT_IDENTIFIER }}/{{ SNOWFLAKE_DB }}/{{ SNOWFLAKE_SCHEMA }}?warehouse={{SNOWFLAKE_WAREHOUSE }}"

Fast data loading mode

All supported database engines except SQLite support fast data loading mode. In this mode, data are loaded to a database table using bulk load (SQL COPY) command instead of individual INSERT statements.

MySQL and Redshift require additional configuration to enable fast data loading mode. Without this extra configuration dbd reverts to slow inserting mode via INSERT statements.

MySQL

To enable fast loading mode, you need specify local_infile=1 query parameter in the MySQL connection url. You also must enable the LOCAL INFILE mode on your MySQL server. You can for example do this by executing this SQL statement:

SET GLOBAL local_infile = true;

Redshift

To enable fast loading mode, you need specify copy_stage parameter in the dbd.project configuration file. The copy_stage parameter must reference a storage definition in your dbd.profile configuration file. Check the example configuration files in the examples/redshift/covid_cz directory. Here are the example definitions of the environment variables that these configuration files use:

export AWS_COVID_STAGE_S3_URL="s3://covid/stage"
export AWS_COVID_STAGE_S3_ACCESS_KEY="AKIA43SWERQGXMUYFIGMA"
export AWS_COVID_STAGE_S3_S3_SECRET_KEY="iujI78eDuFFGJF6PSjY/4CIhEJdMNkuS3g4t0BRwX"

License

dbd code is open-sourced under BSD 3-clause license.

Resources and References

Owner
Zdenek Svoboda
Zdenek Svoboda
Example Python codes that works with MySQL and Excel files (.xlsx)

Python x MySQL x Excel by Zinglecode Example Python codes that do the processes between MySQL database and Excel spreadsheet files. YouTube videos MyS

Potchara Puttawanchai 1 Feb 07, 2022
Estoult - a Python toolkit for data mapping with an integrated query builder for SQL databases

Estoult Estoult is a Python toolkit for data mapping with an integrated query builder for SQL databases. It currently supports MySQL, PostgreSQL, and

halcyon[nouveau] 15 Dec 29, 2022
a small, expressive orm -- supports postgresql, mysql and sqlite

peewee Peewee is a simple and small ORM. It has few (but expressive) concepts, making it easy to learn and intuitive to use. a small, expressive ORM p

Charles Leifer 9.7k Dec 30, 2022
Database connection pooler for Python

Nimue Strange women lying in ponds distributing swords is no basis for a system of government! --Dennis, Peasant Nimue is a database connection pool f

1 Nov 09, 2021
A fast unobtrusive MongoDB ODM for Python.

MongoFrames MongoFrames is a fast unobtrusive MongoDB ODM for Python designed to fit into a workflow not dictate one. Documentation is available at Mo

getme 45 Jun 01, 2022
asyncio compatible driver for elasticsearch

asyncio client library for elasticsearch aioes is a asyncio compatible library for working with Elasticsearch The project is abandoned aioes is not su

97 Sep 05, 2022
Creating a python package to convert /transfer excelsheet data to a mysql Database Table

Creating a python package to convert /transfer excelsheet data to a mysql Database Table

Odiwuor Lameck 1 Jan 07, 2022
google-cloud-bigtable Apache-2google-cloud-bigtable (🥈31 · ⭐ 3.5K) - Google Cloud Bigtable API client library. Apache-2

Python Client for Google Cloud Bigtable Google Cloud Bigtable is Google's NoSQL Big Data database service. It's the same database that powers many cor

Google APIs 39 Dec 03, 2022
Async database support for Python. 🗄

Databases Databases gives you simple asyncio support for a range of databases. It allows you to make queries using the powerful SQLAlchemy Core expres

Encode 3.2k Dec 30, 2022
Databank is an easy-to-use Python library for making raw SQL queries in a multi-threaded environment.

Databank Databank is an easy-to-use Python library for making raw SQL queries in a multi-threaded environment. No ORM, no frills. Thread-safe. Only ra

snapADDY GmbH 4 Apr 04, 2022
edaSQL is a library to link SQL to Exploratory Data Analysis and further more in the Data Engineering.

edaSQL is a python library to bridge the SQL with Exploratory Data Analysis where you can connect to the Database and insert the queries. The query results can be passed to the EDA tool which can giv

Tamil Selvan 8 Dec 12, 2022
SAP HANA Connector in pure Python

SAP HANA Database Client for Python A pure Python client for the SAP HANA Database based on the SAP HANA Database SQL Command Network Protocol. pyhdb

SAP 299 Nov 20, 2022
python-bigquery Apache-2python-bigquery (🥈34 · ⭐ 3.5K · 📈) - Google BigQuery API client library. Apache-2

Python Client for Google BigQuery Querying massive datasets can be time consuming and expensive without the right hardware and infrastructure. Google

Google APIs 550 Jan 01, 2023
Implementing basic MySQL CRUD (Create, Read, Update, Delete) queries, using Python.

MySQL with Python Implementing basic MySQL CRUD (Create, Read, Update, Delete) queries, using Python. We can connect to a MySQL database hosted locall

MousamSingh 5 Dec 01, 2021
Make Your Company Data Driven. Connect to any data source, easily visualize, dashboard and share your data.

Redash is designed to enable anyone, regardless of the level of technical sophistication, to harness the power of data big and small. SQL users levera

Redash 22.4k Dec 30, 2022
A collection of awesome sqlite tools, scripts, books, etc

Awesome Series @ Planet Open Data World (Countries, Cities, Codes, ...) • Football (Clubs, Players, Stadiums, ...) • SQLite (Tools, Books, Schemas, ..

Planet Open Data 205 Dec 16, 2022
GINO Is Not ORM - a Python asyncio ORM on SQLAlchemy core.

GINO - GINO Is Not ORM - is a lightweight asynchronous ORM built on top of SQLAlchemy core for Python asyncio. GINO 1.0 supports only PostgreSQL with

GINO Community 2.5k Dec 27, 2022
SQL for Humans™

Records: SQL for Humans™ Records is a very simple, but powerful, library for making raw SQL queries to most relational databases. Just write SQL. No b

Kenneth Reitz 6.9k Jan 07, 2023
A supercharged SQLite library for Python

SuperSQLite: a supercharged SQLite library for Python A feature-packed Python package and for utilizing SQLite in Python by Plasticity. It is intended

Plasticity 703 Dec 30, 2022
Kafka Connect JDBC Docker Image.

kafka-connect-jdbc This is a dockerized version of the Confluent JDBC database connector. Usage This image is running the connect-standalone command w

Marc Horlacher 1 Jan 05, 2022