MongoDB utility to inflate the contents of small collection to a new larger collection

Overview

MongoDB Data Inflater ("data-inflater")

The data-inflater tool is a MongoDB utility to automate the creation of a new large database collection using data sourced from an existing smaller database collection.

By default, the utility will use the Atlas 'sample data set' database collection sample_mflix.movies as the source collection. However, most users will provide parameters to the utility to specify the use of their own database and source collection. If you do want to use the Atlas sample data set, see the sample data manual page for more information.

The data-inflater utility issues multiple concurrent aggregation processes, each copying batches of records in parallel for increased performance. The resulting collection will contain documents with duplicated data but with new unique _id field values. The variance ratio of data in the new collection will approximately reflect the variance ratio of the source collection. Therefore, you should ensure you have supplied at least a few different documents (if not a few hundred or thousand) in the source collection.

If you are running a sharded cluster, the utility will ensure the target collection is sharded with a shard key, and where it can, it will pre-split the chunks to avoid subsequent needless balancer overhead. For example, if you specify the --shardkey parameter for this utility to reference a field (e.g. product_name) as the range based shard key, before creating the target collection, the utility will introspect the spread of values for the shard key field (e.g. product_name). The utility will then create pre-split chunks in the new empty target collection before any data is copied to it, to maximise performance.

How To Run

In a running MongoDB cluster (self-managed or running in Atlas), ensure you have created and populated a source collection with at least one sample record in it (ideally more with varying values for the fields across the different documents to reflect the shape and variance you desire).

Ensure Python3 (version 3.8 or greater) and the MongoDB Python Driver (PyMongo) are already installed on your workstation. Example to install PyMongo:

pip3 install --user pymongo

Ensure the .py script is executable and then execute the following to view the utility's help instructions and the full list of parameters that you can provide:

./data-inflater.py -h

Execute the following to connect to a locally running single server database (default port) to copy and expand the data from an existing source collection, mydb.mySrcColl, to an a new collection, mydb.myDestColl, which will contain 1 million records:

./data-inflater.py --url 'mongodb://localhost:27017' -d 'mydb' -c 'mySrcColl' -t 'myDestColl' -s 1000000

Execute the following to connect to an Atlas cluster (ensure you've already loaded the Atlas sample data set), to inflate the data from the source movies collection to the new movies_big collection, which will contain 100 million records (note, first change the URL username, password and hostname shown, to match the URL of your Atlas cluster):

./data-inflater.py --url 'mongodb+srv://usr:[email protected]/'
Owner
Paul Done
Paul Done
Conveniently measures the time of your loops, contexts and functions.

Conveniently measures the time of your loops, contexts and functions.

Maciej J Mikulski 79 Nov 15, 2022
A fast Python implementation of Ac Auto Mechine

A fast Python implementation of Ac Auto Mechine

Jin Zitian 1 Dec 07, 2021
cssOrganizer - organize a css file by grouping them into categories

This python project was created to scan through a CSS file and produce a more organized CSS file by grouping related CSS Properties within selectors. Created in my spare time for fun and my own utili

Andrew Espindola 0 Aug 31, 2022
Just some scripts to export vector tiles to geojson.

Vector tiles to GeoJSON Nowadays modern web maps are usually based on vector tiles. The great thing about vector tiles is, that they are not just imag

Lilith Wittmann 77 Jul 26, 2022
Entropy-controlled contexts in Python

Python module ordered ordered module is the opposite to random - it maintains order in the program. import random x = 5 def increase(): global x

HyperC 36 Nov 03, 2022
Group imports from Windows binaries

importsort This is a tool that I use to group imports from Windows binaries. Sometimes, you have a gigantic folder full of executables, and you want t

【☆ ゆう ☆ 】 15 Aug 27, 2022
Finds price floor for every single attribute in a given collection

Solana Solanart Scanner Enjoy the Free Code Steps to run Download VS Code

Dalton Nisbett 19 Oct 20, 2022
A thing to simplify listening for PG notifications with asyncpg

A thing to simplify listening for PG notifications with asyncpg

ANNA 18 Dec 23, 2022
Allows you to canibalize methods from classes effectively implementing trait-oriented programming

About This package enables code reuse in non-inheritance way from existing classes, effectively implementing traits-oriented programming pattern. Stor

1 Dec 13, 2021
Python 3 script unpacking statically x86 CryptOne packer.

Python 3 script unpacking statically x86 CryptOne packer. CryptOne versions: 2021/08 until now (2021/12)

5 Feb 23, 2022
A time table app to notify the user about their class timings

kivyTimeTable A time table app to notify the user about their class timings Features This project incorporates some features i wanted to see in a time

2 Dec 15, 2021
Modeling Category-Selective Cortical Regions with Topographic Variational Autoencoders

Modeling Category-Selective Cortical Regions with Topographic Variational Autoencoders Getting Started Install requirements with Anaconda: conda env c

T. Andy Keller 4 Aug 22, 2022
A fancy and practical functional tools

Funcy A collection of fancy functional tools focused on practicality. Inspired by clojure, underscore and my own abstractions. Keep reading to get an

Alexander Schepanovski 2.9k Jan 07, 2023
🌲 A simple BST (Binary Search Tree) generator written in python

Tree-Traversals (BST) 🌲 A simple BST (Binary Search Tree) generator written in python Installation Use the package manager pip to install BST. Usage

Jan Kupczyk 1 Dec 12, 2021
Cleaning-utils - a collection of small Python functions and classes which make cleaning pipelines shorter and easier

cleaning-utils [] [] [] cleaning-utils is a collection of small Python functions

4 Aug 31, 2022
Rabbito is a mini tool to find serialized objects in input values

Rabbito-ObjectFinder Rabbito is a mini tool to find serialized objects in input values What does Rabbito do Rabbito has the main object finding Serial

7 Dec 13, 2021
.bvh to .mcfunction file converter.

bvh-to-mcf .bvh file to .mcfunction converter

Hanmin Kim 28 Nov 21, 2022
Daiho Tool is a Script Gathering for Windows/Linux systems written in Python.

Daiho is a Script Developed with Python3. It gathers a total of 22 Discord tools (including a RAT, a Raid Tool, a Nuker Tool, a Token Grabberr, etc). It has a pleasant and intuitive interface to faci

AstraaDev 32 Jan 05, 2023
Give you a better view of your Docker registry disk usage.

registry-du Give you a better view of your Docker registry disk usage. This small tool will analysis your Docker registry(vanilla or Harbor both work)

Nova Kwok 16 Jan 07, 2023
Aggregating gridded data (xarray) to polygons

A package to aggregate gridded data in xarray to polygons in geopandas using area-weighting from the relative area overlaps between pixels and polygons.

Kevin Schwarzwald 42 Nov 09, 2022