Python script for transferring data between three drives in two separate stages

Last update: Nov 10, 2021

Related tags

Overview

Waterlock

Waterlock is a Python script meant for incrementally transferring data between three folder locations in two separate stages. It performs hash verification and persistently tracks data transfer progress using SQLite.

I am not responsible for any lost data. This was an evening coding project. Use at your own discretion.

Use Case & Features

The use-case Waterlock was designed for is moving files from one computer (i.e. your home server) to a intermediary drive (i.e. a portable hard drive), and then from the hard drive to another computer (i.e. an offsite backup server).

It will fill the intermediary drive with as many files as it can, aside from a user-configurable amount of reserve-space.
It performs blake2 checksums with every file copy, comparing it to the initial hash value stored in the SQLite database to ensure that data is not corrupted.
It uses a SQLite database to track what data has been moved. As a result, you can incrementally move data from one location to another with minimal user input.
Every time Waterlock is run on the source location, it will check for any files that have been recently modified (based on timestamp, not hash). Any modified files will have their hash & modification timestamps updated in the database, in addition to being marked as unmoved such that they are transferred again and updated. Note that Waterlock does not version files. Nevertheless, silently corrupted files should theoretically not be transferred over unless their modification timestamp has been adjusted.
Every time Waterlock is run on the source location, it will check for any files that were previously moved to the intermediary drive but did not reach the destination. If these files are no longer on the intermediary drive due to accidental deletion for instance, Waterlock will move those files to the intermediary drive again.

Example Use Case: I use Waterlock to transfer large files that are too large to transfer over the network to an offsite backup location at a relatives house. Each time I visit I run the script on my home server to load the external drive, then run it again on the offsite-backup server.

Usage

Change the settings at the top of the script, using absolute file paths. While relative paths may work, they are more error prone due to string formatting issues. Store the script on the intermediary drive itself and run it from there. It will automatically create waterlock.db and a cargo folder where the data will be stored. Note that after the final transfer to the destination, Waterlock will not delete data on the intermediary drive.

python waterlock.py

If you are familiar with Python, you can also fully verify all the files on the middle or destination drives to ensure that the hashes match what is stored in the database. This is done using two additional class functions called verify_middle() and verify_destination(). The code to verify files on the destination would be as follows:

if __name__ == "__main__":
    wl = Waterlock( source_directory=source_directory, 
                    end_directory=end_direcotry, 
                    reserved_space=reserved_space
                    )
    wl.start()
    wl.verify_destination()

Why 'Waterlock'?

It is named Waterlock after marine locks used to move ships through waterways of different water levels in multiple stages.

Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.

2 Nov 20, 2021

Python script for transferring data between three drives in two separate stages

Related tags

Overview

Waterlock

Use Case & Features

Usage

Why 'Waterlock'?

You might also like...

Catalogue data - A Python Scripts to prepare catalogue data

This is a python script to navigate and extract the FSD50K dataset

Python script to automate the plotting and analysis of percentage depth dose and dose profile simulations in TOPAS.

fds is a tool for Data Scientists made by DAGsHub to version control data and code at once.

A data parser for the internal syncing data format used by Fog of World.

Functional Data Analysis, or FDA, is the field of Statistics that analyses data that depend on a continuous parameter.

Fancy data functions that will make your life as a data scientist easier.

A Big Data ETL project in PySpark on the historical NYC Taxi Rides data

Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.

Releases(latest)

Owner

David Swanlund

Cold Brew: Distilling Graph Node Representations with Incomplete or Missing Neighborhoods

Top 50 best selling books on amazon

Sensitivity Analysis Library in Python (Numpy). Contains Sobol, Morris, Fractional Factorial and FAST methods.

Data pipelines built with polars

Vectorizers for a range of different data types

MEAD: A Large-scale Audio-visual Dataset for Emotional Talking-face Generation [ECCV2020]

PyPDC is a Python package for calculating asymptotic Partial Directed Coherence estimations for brain connectivity analysis.

Hydrogen (or other pure gas phase species) depressurization calculations

University Challenge 2021 With Python

Wafer Fault Detection - Wafer circleci with python

Techdegree Data Analysis Project 2

Data Competition: automated systems that can detect whether people are not wearing masks or are wearing masks incorrectly

This tool parses log data and allows to define analysis pipelines for anomaly detection.

High Dimensional Portfolio Selection with Cardinality Constraints

A real-time financial data streaming pipeline and visualization platform using Apache Kafka, Cassandra, and Bokeh.

Option Pricing Calculator using the Binomial Pricing Method (No Libraries Required)

a tool that compiles a csv of all h1 program stats

Fast, flexible and easy to use probabilistic modelling in Python.

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis

Hue Editor: Open source SQL Query Assistant for Databases/Warehouses