Research using python - Guide for development of research code (using Anaconda Python)

Overview

Guide for development of research code
(using Anaconda Python)

TL;DR:

One time setup

  1. Install git and go through its one time setup, bare minimum:
    git config --global user.name “First Last”
    git config --global user.email “first[email protected]”
    git config --global core.editor editor_of_choice
    
    Editor option for the few folks on windows (haven't tried it myself):
    git config --global core.editor "'input/path/to/notepad++.exe' -multiInst -notabbar -nosession -noPlugin"
    
  2. Install git-lfs and run git lfs install.
  3. Install miniconda.
  4. Sign up for a GitHub account.
  5. Generate an SSH key and add it to your GitHub account.

Once per repository setup

  1. Create empty repository on GitHub, lets call it my_project.
  2. Initial commit into local repository and push to remote: 0. Create local repository (also creates new directory) git init my_project
    1. Create a markdown file, README.md describing the project.
    2. Create an environment_dev.yml file based on this example. Change the environment name to an appropriate one and add relevant packages.
    3. Copy this pre-commit configuration file.
    4. Copy this .gitignore file and add file types you want git to ignore.
    5. Add file types to be tracked by git-lfs based on file extension, creates the .gitattributes file (e.g. git lfs track "*.pth")
    6. Copy this .flake8 file to customize the tool settings.
git add README.md environment_dev.yml .pre-commit-config.yaml .gitattributes .gitignore .flake8
git commit
git branch -M main
git remote add origin [email protected]:user_name/my_project.git
git push -u origin main
  1. Create virtual environment activate it and set up pre-commit:
    conda env create -f environment_dev.yml
    conda activate my_project_dev
    pre-commit install
    

Start working

  1. Activate virtual environment conda activate my_project_dev
  2. Create new branch off of main:
git checkout main
git checkout -b my_new_branch
  1. Work.
  2. Commit locally and push to remote (origin can be either a fork, if using a triangular workflow, or the original repository if using a centralized workflow):
git add file1 file2 file3
git commit
git push origin my_new_branch
  1. Create a pull request on GitHub and after tests pass merge into main branch.

If code is not in the remote repository, consider it lost.

Long version

Why should you care?

Most scientists need to write code as part of their research. This is a "physical" embodiment of the underlying algorithmic and mathematical theory. Traditionally the software engineering standards applied to code written as part of research have been rather low (rampant code duplication...). In the past decade we have seen this change. Primarily because it is now much more common for researchers to share their code (often due to the "encouragement" of funding agencies) in all its glory.

When sharing code, we expect it to comply with some minimal software engineering standards including design, readability, and testing.

I strive to follow the guidance below, but don't always. Still, it's important to have a goal to strive towards. To quote Lewis Carol (If you don't know where you're going, any road will take you there). From Alice's Adventures in Wonderland:

“Would you tell me, please, which way I ought to go from here?” “That depends a good deal on where you want to get to,” said the Cat. “I don’t much care where-” said Alice. “Then it doesn’t matter which way you go,” said the Cat. "-so long as I get somewhere,” Alice added as an explanation.“Oh, you’re sure to do that,” said the Cat, “if you only walk long enough.”

Personal pet peeves, in no particular order:

  • A single commit of all the code in the GitHub repository. Yes, you're sharing code but it did not magically materialize in its final form, be transparent so that we can trust the code and see how it developed over time. We can learn from paths that did not pan out almost as much as from the path that did. By providing all of the history we can see which algorithmic paths were attempted and did not work out. Help others avoid going down dead-end paths.
  • Repository contains .DS_Store files. Yes, we know you are proud of your Mac. I like OSX too, but seriously, you should have added this file type to the .gitignore file when setting up the repository.
  • Deep learning code sans-data, sans-weight files. This is completely useless in terms of reproducibility. Don't "share" like this.
  • Code duplication with minor, hard to detect, differences between copies.

Version control

  1. Use a version control system, currently Git is the VCS of the day. Learn how to use it (introduction to git slide deck).
  2. Use a remote repository, your cloud backup. Keep it private during development and then make it public upon publication acceptance. Free services GitHub, BitBucket.
  3. Do not commit binary or large files into the repository. Use git-lfs. Beware the Jupyter notebook. Do not commit notebooks with output as this will cause the repository size to blow up, particularly if output includes images. Clear the output before committing.
  4. Use the pre-commit framework to improve (1) compliance to code style (2) avoid commits of large/binary files, AWS credentials and private keys. We all need a little help complying with our self imposed constraints (example configuration file). Note that git pre-commit hooks do not preclude non-compliant commits, as a determined user can go around the hooks, git commit --no-verify.

Writing code (Python as a use case)

Many languages have style, testing and documentation tools and conventions. Here we focus on Python, but the concepts are similar for all languages.

  1. Style - Use consistent style and enforce it. Other human beings need to read the code and readily understand it. Write code that is compliant with PEP8 (the Python style guide):
    • Use flake8 to enforce PEP8.
    • Use the Black code formatter, works for scripts and Jupyter notebooks (for Jupyter notebook support pip install black[jupyter] instead of the regular pip install black). It does not completely agree with flake8, so use both?
    • Some folks don't like the Black formatting, it isn't all roses. An alternative is autopep8.
  2. Testing - Write nominal regression tests at the same time you implement the functionality. Non-rigorous regression testing is acceptable in a research setting as we explore various solutions. The more rigorous the testing the easier it will be for a development team to get code into production. Use pytest for this task.
  3. Documentation - Write the documentation while you are implementing. Start by adding a README file to your repository (use markdown or restructured text). It should include a general description of the repository contents, how to run the programs and possibly instructions on how to build them from source code. Generally, when we postpone writing documentation we will likely never do it. That's fine too, as long as you are willing to admit to yourself that you are consciously choosing to not document your code. In Python, use a consistent Docstring format. Two popular ones are Google style and NumPy style.
  4. Reproducible environment - include instructions or configuration files to reproduce the environment in which the code is expected to work. In Python you provide files listing all package dependencies enabling the creation of the appropriate virtual environment in which to run the program. A requirements.txt for plain Python, or an environment.yml for the anaconda Python distribution. For development we often rely on additional packages not required for usage (e.g. pytest). Consequentially we include a requirements_dev.txt (environment_dev.yml) in addition to the requirements.txt (environment.yml) files. Sample requirements.txt, requirements_dev.txt and environment.yml, environment_dev.yml files.
  5. Your code is a mathematical multi-parametric function that depends on many parameters beyond the input. These parameters are either:
  • Hard coded - best avoided if they need to be changed for different inputs.
  • Given as arguments on the command-line, appropriate when you have a few, less than five. Several popular Python modules/packages that support parsing command-line arguments: argparse, click and docopt. Personally I use argparse (example usage available here).
  • Specified in a configuration file. These usually use XML or JSON formats. I use JSON (example configuration file and short script that reads it). The parameters file is given on the command-line so we also get to use argparse.

Continuous integration

Automate testing and possibly delivery using continuous integration. There are many CI services that readily integrate with remote hosted git services. In the past I've used TravisCI and CircleCI. Currently using GitHub Actions. All of these rely on a yaml based configuration files to define workflows.

An example GitHub actions workflow which runs the same tests as the pre-commit defined above is available here.

Owner
Ziv Yaniv
Ziv Yaniv
Python implementation for Active Directory certificate abuse

Certipy is a Python tool to enumerate and abuse misconfigurations in Active Directory Certificate Services (AD CS). Based on the C# variant Ce

Oliver Lyak 1.3k Jan 09, 2023
HogwartsRegister - A Hogwarts Register With Python

A Hogwarts Register Installation download code git clone https://github.com/haor

0 Feb 12, 2022
Old versions of Deadcord that are problematic or used as reference.

⚠️ Unmaintained and broken. We have decided to release the old version of Deadcord before our v1.0 rewrite. (which will be equiped with much more feat

Galaxzy 1 Feb 10, 2022
IPO Checker for NEPSE

IPO Checker Checks more than one account for an IPO. Usage: ipo_checker.py [-h] --file FILE IPO Checker for a list. optional arguments: -h, --help

Sagar Tamang 4 Sep 20, 2022
Zues Auto Claimer Leaked By bazooka#0001

Zues Auto Claimer Leaked By bazooka#0001 put proxies in prox.txt put ssid in sid.txt put all users you want to target in user.txt for the login just t

1 Jan 15, 2022
Semantic Data Management - Property Graphs 📈

SDM - Lab 1 @ UPC 👨🏻‍💻 Table of contents Introduction Property Graph Dataset 1. Introduction This repo is all about what we have done in SDM lab 1

Mohammad Zain Abbas 1 Mar 20, 2022
Windows Task Manager with special features, written in Python.

Killer That damn Chrome ⬇ Download here · 👋 Join our discord Tired of trying to kill processes with the default Windows Task Manager? Selecting one b

Nathan Araújo 49 Jan 03, 2023
tox-gh is a tox plugin which helps running tox on GitHub Actions with multiple different Python versions on multiple workers in parallel

tox-gh is a tox plugin which helps running tox on GitHub Actions with multiple different Python versions on multiple workers in parallel. This project is inspired by tox-travis.

tox development team 19 Dec 26, 2022
A totally unrealistic cell growth/reproduction simulation.

A totally unrealistic cell growth/reproduction simulation.

Andrien Wiandyano 1 Oct 24, 2021
How did Covid affect businesses?

NYC_Business_Analysis How did Covid affect businesses? COVID's effect on NYC businesses We all know that businesses in NYC have been affected by COVID

AK 1 Jan 15, 2022
Mdisk - 🚧 On Construction 🚧

Mdisk Install For Package pip install mdisk pip install git+https://github.com/HeimanPictures/Mdisk.git Usage You can use this as python module or via

AkKiL 6 Aug 08, 2022
Implent of Oracle Base line and Lea-3 Baseline

Oracle-Baseline Implent of Oracle Base line and Lea-3 Baseline Oracle Oracle : This model is used to obtain an oracle with a greedy algorithm similar

Andrew Zeng 2 Nov 12, 2021
Add all JuliaLang unicode abbreviations to AutoKey.

Autokey Unicode characters Usage This script adds all the unicode character abbreviations supported by Julia to autokey. However, instead of [TAB], th

Randolf Scholz 49 Dec 02, 2022
HatAsm - a HatSploit native powerful assembler and disassembler that provides support for all common architectures

HatAsm - a HatSploit native powerful assembler and disassembler that provides support for all common architectures.

EntySec 8 Nov 09, 2022
The first Python 1v1.lol triggerbot working with colors !

1v1.lol TriggerBot Afin d'utiliser mon triggerbot, vous devez activer le plein écran sur 1v1.lol sur votre naviguateur (quelque-soit ce dernier). Vous

Venax 5 Jul 25, 2022
Python Freecell Solver

freecell Python Freecell Solver Very early version right now. You can pick a board by changing the file path in freecell.py If you want to play a game

Ben Kaufman 1 Nov 26, 2021
Protocol Buffers for the Rest of Us

Protocol Buffers for the Rest of Us Motivation protoletariat has one goal: fixing the broken imports for the Python code generated by protoc. Usage He

Phillip Cloud 76 Jan 04, 2023
Security-related flags and options for C compilers

Getting the maximum of your C compiler, for security

135 Nov 11, 2022
This is a Python script to detect rapid upwards price changes (pumps) in a cryptocurrency pairing

A python script to detect a rapid upwards price brekout (pump) in a cryptocurrency pairing, through pandas and Binance API.

3 May 25, 2022
Sailwind Mod Manager

Sailwind Mod Manager The Sailwind Mod Manager is an open source mod manager for the Sailwind community. It currently allows you to browse and download

Max 3 Jul 15, 2022