Improving your data science workflows with

Last update: Dec 23, 2022

Related tags

Data Analysis make_better_defaults

Overview

Make Better Defaults

Author: Kjell Wooding [email protected]

This is the git repo for Makefiles: One great trick for making your conda environments more managable. A Pydata Global 2021 talk given on October 28, 2021 by Kjell Wooding.

Getting Started

To get started, type "make".

To follow along, watch the video once it's posted.

To learn more about Easydata, the framework that generated this repo, see the Getting Started Guide.

The Tips

Use git and virtual environments. Always.
Good workflow trumps good tooling
Good workflow means not having to remember things
Use one virtual environment per git repo. Give them both the same name.
Maintain virtual environments as code.
Use Lockfiles: Separate "what you want" from "what you need".
Auto-document your workflow
Don't be afraid to "Nuke it from orbit"

The Implementation

See https://github.com/hackalog/make_better_defaults

Directory Structure

See Project Organization for details on how this project is organized on disk.

This project was built using Easydata, a python framework aimed at making your data science workflow reproducible.

Owner

Kjell Wooding

Computer Engineer. Mathematician. Current Obsession: Reproducible Data Science

Kjell Wooding

GitHub Repository

Python ELT Studio, an application for building ELT (and ETL) data flows.

The Python Extract, Load, Transform Studio is an application for performing ELT (and ETL) tasks. Under the hood the application consists of a two parts.

55 Nov 18, 2022

Candlestick Pattern Recognition with Python and TA-Lib

Candlestick-Pattern-Recognition-with-Python-and-TA-Lib Goal Look at the S&P500 to try and get a better understanding of these candlestick patterns and

11 Oct 07, 2022

Python tools for querying and manipulating BIDS datasets.

PyBIDS is a Python library to centralize interactions with datasets conforming BIDS (Brain Imaging Data Structure) format.

180 Dec 18, 2022

Statsmodels: statistical modeling and econometrics in Python

About statsmodels statsmodels is a Python package that provides a complement to scipy for statistical computations including descriptive statistics an

8k Dec 29, 2022

Two phase pipeline + StreamlitTwo phase pipeline + Streamlit

Two phase pipeline + Streamlit This is an example project that demonstrates how to create a pipeline that consists of two phases of execution. In betw

1 Nov 17, 2021

Flexible HDF5 saving/loading and other data science tools from the University of Chicago

deepdish Flexible HDF5 saving/loading and other data science tools from the University of Chicago. This repository also host a Deep Learning blog: htt

255 Dec 10, 2022

Exploratory Data Analysis of the 2019 Indian General Elections using a dataset from Kaggle.

2019-indian-election-eda Exploratory Data Analysis of the 2019 Indian General Elections using a dataset from Kaggle. This project is a part of the Cou

5 Oct 10, 2022

Pandas-based utility to calculate weighted means, medians, distributions, standard deviations, and more.

weightedcalcs weightedcalcs is a pandas-based Python library for calculating weighted means, medians, standard deviations, and more. Features Plays we

98 Dec 31, 2022

Galvanalyser is a system for automatically storing data generated by battery cycling machines in a database

Galvanalyser is a system for automatically storing data generated by battery cycling machines in a database, using a set of "harvesters", whose job it

20 Sep 28, 2022

Py-price-monitoring - A Python price monitor

A Python price monitor This project was focused on Brazil, so the monitoring is

1 Jan 04, 2022

Validation and inference over LinkML instance data using souffle

Translates LinkML schemas into Datalog programs and executes them using Souffle, enabling advanced validation and inference over instance data

7 Aug 07, 2022

Shot notebooks resuming the main functions of GeoPandas

Shot notebooks resuming the main functions of GeoPandas, 2 notebooks written as Exercises to apply these functions.

1 Jan 12, 2022

Full automated data pipeline using docker images

Create postgres tables from CSV files This first section is only relate to creating tables from CSV files using postgres container alone. Just one of

1 Nov 21, 2021

Hg002-qc-snakemake - HG002 QC Snakemake

HG002 QC Snakemake To Run Resources and data specified within snakefile (hg002QC

2 Feb 16, 2022

A python package which can be pip installed to perform statistics and visualize binomial and gaussian distributions of the dataset

GBiStat package A python package to assist programmers with data analysis. This package could be used to plot : Binomial Distribution of the dataset p

4 Oct 17, 2022

Handle, manipulate, and convert data with units in Python

unyt A package for handling numpy arrays with units. Often writing code that deals with data that has units can be confusing. A function might return

304 Jan 02, 2023

Convert monolithic Jupyter notebooks into Ploomber pipelines.

Soorgeon Join our community | Newsletter | Contact us | Blog | Website | YouTube Convert monolithic Jupyter notebooks into Ploomber pipelines. soorgeo

65 Dec 16, 2022

Creating a statistical model to predict 10 year treasury yields

Predicting 10-Year Treasury Yields Intitially, I wanted to see if the volatility in the stock market, represented by the VIX index (data source), had

10 Oct 27, 2021

Gaussian processes in TensorFlow

Website | Documentation (release) | Documentation (develop) | Glossary Table of Contents What does GPflow do? Installation Getting Started with GPflow

1.7k Jan 06, 2023

Single-Cell Analysis in Python. Scales to >1M cells.

Scanpy – Single-Cell Analysis in Python Scanpy is a scalable toolkit for analyzing single-cell gene expression data built jointly with anndata. It inc

1.4k Jan 05, 2023