ONT Analysis Toolkit (OAT)

Overview

Code style: black

                                       ,d
                                       88
              ,adPPYba,  ,adPPYYba, MM88MMM
              a8"     "8a ""     `Y8   88   
              8b       d8 ,adPPPPP88   88   
              "8a,   ,a8" 88,    ,88   88,  
              `"YbbdP"'  `"8bbdP"Y8   "Y888

ONT Analysis Toolkit (OAT)

A pipeline to facilitate sequencing of viral genomes (amplified with a tiled amplicon scheme) and assembly into consensus genomes. Supported viruses currently include Human betaherpesvirus 5 (CMV) and SARS-CoV-2. ONT sequencing data from a MinION can be monitored in real time with the rampart module. All analysis steps are handled by the analysis module, whereby sequencing data are analysed using a pipeline written in snakemake, with the choice of tools heavily influenced by the artic minion pipeline. Both steps can be run in order with a single command using the all module.

Author: Dr Charles Foster

Starting out

To begin with, clone this github repository:

git clone https://github.com/charlesfoster/ont-analysis-toolkit.git

cd ont-analysis-toolkit

Next, install most dependencies using conda:

conda env create -f environment.yml

Pro tip: if you install mamba, you can create the environment with that command instead of conda. A lot of conda headaches go away: it's a much faster drop-in replacement for conda.

conda install mamba
mamba env create -f environment.yml

Install the pipeline correctly by activating the conda environment and using the setup.py script with:

conda activate oat
pip install .

Other dependencies:

  • Demultiplexing is done using guppy_barcoder. The program will need to be installed and in your path.
  • If analysing SARS-CoV-2 data, lineages are typed using pangolin. Accordingly, pangolin needs to be installed according to instructions at https://github.com/cov-lineages/pangolin. The pipeline will fail if the pangolin environment can't be activated.
  • Variants are called using medaka and longshot. Ideally we could install these via mamba in the main environment.yml file, but there are sadly some necessary libraries for medaka that are incompatible with our main oat environment. Consequently, I have written the analysis pipeline so that snakemake automatically installs medaka and its dependencies into an isolated environment during execution of the oat pipeline. The environment is only created the first time you run the pipeline, but if you run the pipeline from a different working directory in the future, the environment will be created again. Solution: always run oat from the same working directory

tl;dr: you don't need to do anything for variant calling to work; just don't get confused during the initial pipeline run when the terminal indicates creation of a new conda environment. You should run oat from the same directory each time, otherwise a new conda environment will be created each time.

Usage

The environment with all necessary tools is installed as 'oat' for brevity. The environment should first be activated:

conda activate oat

Then, to run the pipeline, it's as simple as:

oat 
   

   

where should be replaced with the full path to a spreadsheet with minimal metadata for the ONT sequencing run (see example spreadsheet: run_data_example.csv). The most important thing to remember is that the 'run_name' in the spreadsheet must exactly match the name of the sequencing run, as set up in MinKNOW.

Note that there are many additional options/settings to take advantage of:

A pipeline for sequencing and analysis of viral genomes using an ONT MinION positional arguments: samples_file Path to file with sample metadata (.csv format). See example spreadsheet for minimum necessary information. optional arguments: -h, --help show this help message and exit -b , --barcode_kit Barcode kit that you used: '12' (SQK-RBK004) or '96' (SQK-RBK110-96) (default: 12) -c , --consensus_freq Variant allele frequency threshold for a variant to be incorporated into consensus genome. Variants below this frequency will be incorporated with an IUPAC ambiguity. Default: 0.8 -d, --demultiplex Demultiplex reads using guppy_barcoder. By default, assumes reads were already demultiplexed by MinKNOW. Reads are demultiplexed into the output directory. -f, --force Force overwriting of completed files in snakemake analysis (Default: files not overwritten) -n, --dry_run Dry run only -m rampart | analysis | all, --module rampart | analysis | all Pipeline module to run: 'rampart', 'analysis' or 'all' (rampart followed by analysis). Default: 'all' -o OUTDIR, --outdir OUTDIR Output directory. Default: /path/to/ont- analysis-toolkit/analysis_results + 'run_name' from samples spreadsheet --rampart_outdir RAMPART_OUTDIR Output directory. Default: /path/to/ont- analysis-toolkit/rampart_files -p, --print_dag Save directed acyclic graph (DAG) of workflow to outdir -r REFERENCE, --reference REFERENCE Reference genome to use: 'MN908947.3' (SARS-CoV-2), 'NC_006273.2' (CMV Merlin). Other references can be used, but the corresponding assembly (fasta) and annotation (gff3 from Ensembl) must be added to /home/cfos/miniconda3/envs/oat/lib/python3.9/site- packages/oat/references (Default: MN908947.3) -t , --threads Number of threads to use -v VARIANT_CALLER, --variant_caller VARIANT_CALLER Variant caller to use. Choices: 'medaka-longshot'. Default: 'medaka-longshot' --create_envs_only Create conda environments for snakemake analysis, but do no further analysis. Useful for initial pipeline setup. Default: False --snv_min SNV_MIN Minimum variant allele frequency for an SNV to be kept Default: 0.2 --delete_reads Delete demultiplexed reads after analysis --redo_analysis Delete entire analysis output directory and contents for a fresh run --version show program's version number and exit --minknow_data MINKNOW_DATA Location of MinKNOW data root. Default: /var/lib/minknow/data --max_memory Maximum memory (in MB) that you would like to provide to snakemake. Default: 53456MB --quiet Stop printing of snakemake commands to screen. --report Generate report (currently non-functional).">

                                       ,d
                                       88
              ,adPPYba,  ,adPPYYba, MM88MMM
              a8"     "8a ""     `Y8   88   
              8b       d8 ,adPPPPP88   88   
              "8a,   ,a8" 88,    ,88   88,  
              `"YbbdP"'  `"8bbdP"Y8   "Y888

        OAT: ONT Analysis Toolkit (version 0.2.0)

usage: oat [options] 
          
           

A pipeline for sequencing and analysis of viral genomes using an ONT MinION

positional arguments:
  samples_file          Path to file with sample metadata (.csv format). See
                        example spreadsheet for minimum necessary information.

optional arguments:
  -h, --help            show this help message and exit
  -b 
           
            , --barcode_kit 
            
             
                        Barcode kit that you used: '12' (SQK-RBK004) or '96'
                        (SQK-RBK110-96) (default: 12)
  -c 
             
              , --consensus_freq 
              
                Variant allele frequency threshold for a variant to be incorporated into consensus genome. Variants below this frequency will be incorporated with an IUPAC ambiguity. Default: 0.8 -d, --demultiplex Demultiplex reads using guppy_barcoder. By default, assumes reads were already demultiplexed by MinKNOW. Reads are demultiplexed into the output directory. -f, --force Force overwriting of completed files in snakemake analysis (Default: files not overwritten) -n, --dry_run Dry run only -m rampart | analysis | all, --module rampart | analysis | all Pipeline module to run: 'rampart', 'analysis' or 'all' (rampart followed by analysis). Default: 'all' -o OUTDIR, --outdir OUTDIR Output directory. Default: /path/to/ont- analysis-toolkit/analysis_results + 'run_name' from samples spreadsheet --rampart_outdir RAMPART_OUTDIR Output directory. Default: /path/to/ont- analysis-toolkit/rampart_files -p, --print_dag Save directed acyclic graph (DAG) of workflow to outdir -r REFERENCE, --reference REFERENCE Reference genome to use: 'MN908947.3' (SARS-CoV-2), 'NC_006273.2' (CMV Merlin). Other references can be used, but the corresponding assembly (fasta) and annotation (gff3 from Ensembl) must be added to /home/cfos/miniconda3/envs/oat/lib/python3.9/site- packages/oat/references (Default: MN908947.3) -t 
               
                , --threads 
                
                  Number of threads to use -v VARIANT_CALLER, --variant_caller VARIANT_CALLER Variant caller to use. Choices: 'medaka-longshot'. Default: 'medaka-longshot' --create_envs_only Create conda environments for snakemake analysis, but do no further analysis. Useful for initial pipeline setup. Default: False --snv_min SNV_MIN Minimum variant allele frequency for an SNV to be kept Default: 0.2 --delete_reads Delete demultiplexed reads after analysis --redo_analysis Delete entire analysis output directory and contents for a fresh run --version show program's version number and exit --minknow_data MINKNOW_DATA Location of MinKNOW data root. Default: /var/lib/minknow/data --max_memory 
                 
                   Maximum memory (in MB) that you would like to provide to snakemake. Default: 53456MB --quiet Stop printing of snakemake commands to screen. --report Generate report (currently non-functional). 
                 
                
               
              
             
            
           
          

What does the pipeline do?

RAMPART Module

All input files for RAMPART are generated based on your input spreadsheet, and a web browser is launched to view the sequencing in real time.

Analysis Module

Reads are mapped to the relevant reference genome with minimap2. Amplicon primers are trimmed using samtools ampliconclip. Variants are called using medaka and longshot, followed by filtering and consensus genome assembly using bcftools. The amino acid consequences of SNPs are inferred using bcftools csq. If analysing SARS-CoV-2, lineages are inferred using pangolin. Finally, a variety of sample QC metrics are combined into a final QC file.

Other Notes

Protocols

A protocol for the amplicon scheme needs (a) to be installed in the pipeline, and (b) named in the run_data.csv spreadsheet for analyses to work correctly. The pipeline comes with the Midnight protocol for SARS-CoV-2 pre-installed (https://www.protocols.io/view/sars-cov2-genome-sequencing-protocol-1200bp-amplic-bwyppfvn). Adding additional protocols is fairly easy:

  1. Make a directory called /path/to/ont-analysis-toolkit/oat/protocols/ARTICV3 (needs to be in all caps)

  2. Make a directory within ARTICV3 called 'rampart'

    (a) Put the normal rampart files within that directory (genome.json, primers.json, protocol.json, references.fasta)

  3. Make a directory within ARTICV3 called 'schemes'

    (a) Put the 'scheme.bed' file with primer coordinates in the 'schemes' directory

  4. Make sure you're in /path/to/ont-analysis-toolkit/, then activate the conda environment and use the following command: pip install .

Done!

Amino acid consequences

For the amino acid consequences step to work, a requirement is an annotation file for the chosen reference genome. The annotations must be in gff3 format, and must be in the 'Ensembl flavour' of gff3. There is a script included in the repository that can convert an NCBI gff3 file into an 'Ensembl flavour' gff3 file: /path/to/ont-analysis-toolkit/oat/scripts/gff2gff.py.

Credits

  • When this pipeline is used, citations should be found for the programs used internally.
  • The gff3 file I included for SARS-CoV-2 was originally sent to me by Torsten Seemann.
  • Being new to using snakemake + wrapper scripts, I used pangolin as a guide for directory structure and rule creation - so thanks to them.
  • The analysis module was heavily influenced by the ARTIC team, especially the artic minion pipeline.
  • gff2gff.py is based on work by Damien Farrell https://dmnfarrell.github.io/bioinformatics/bcftools-csq-gff-format
You might also like...
Red Team Toolkit is an Open-Source Django Offensive Web-App which is keeping the useful offensive tools used in the red-teaming together.
Red Team Toolkit is an Open-Source Django Offensive Web-App which is keeping the useful offensive tools used in the red-teaming together.

RedTeam Toolkit Note: Only legal activities should be conducted with this project. Red Team Toolkit is an Open-Source Django Offensive Web-App contain

A Static Analysis Tool for Detecting Security Vulnerabilities in Python Web Applications
A Static Analysis Tool for Detecting Security Vulnerabilities in Python Web Applications

This project is no longer maintained March 2020 Update: Please go see the amazing Pysa tutorial that should get you up to speed finding security vulne

SpiderFoot automates OSINT collection so that you can focus on analysis.
SpiderFoot automates OSINT collection so that you can focus on analysis.

SpiderFoot is an open source intelligence (OSINT) automation tool. It integrates with just about every data source available and utilises a range of m

Yuyu Scanner is a Web Reconnaissance & Web Analysis Scanner to find assets and information about targets.
Yuyu Scanner is a Web Reconnaissance & Web Analysis Scanner to find assets and information about targets.

Yuyu Scanner Yuyu Scanner is a Web Reconnaissance & Web Analysis Scanner to find assets and information about targets. installation ! run as root

ThePhish: an automated phishing email analysis tool
ThePhish: an automated phishing email analysis tool

ThePhish ThePhish is an automated phishing email analysis tool based on TheHive, Cortex and MISP. It is a web application written in Python 3 and base

Lazarus analysis tools and research report
Lazarus analysis tools and research report

Lazarus Research This repository publishes analysis reports and analysis tools for Operation Dream Job and Operation JTrack for Lazarus. Tools Python

IDA scripts for hypervisor (Hyper-v) analysis and reverse engineering automation
IDA scripts for hypervisor (Hyper-v) analysis and reverse engineering automation

Re-Scripts IA32-VMX-Helper (IDA-Script) IA32-MSR-Decoder (IDA-Script) IA32 VMX Helper It's an IDA script (Updated IA32 MSR Decoder) which helps you to

Android Malware (Analysis | Scoring) System
Android Malware (Analysis | Scoring) System

An Obfuscation-Neglect Android Malware Scoring System Quark-Engine is also bundled with Kali Linux, BlackArch. A trust-worthy, practical tool that's r

Malware arcane - Scripts and notes on my malware analysis journey

Malware Arcane Repository of notes and scripts I use when doing malware analysis

Releases(v0.10.1)
  • v0.10.1(Apr 27, 2022)

    While the 'oat' pipeline has undergone continuous development since its inception, this is the first designated release (well, pre-release). The pipeline should be fully functional, but please let me know if you encounter any errors or bugs.

    The pipeline has the most incorporated features for SARS-CoV-2, but works well with other viruses if you install them properly as per the instructions in the README.md file.

    Source code(tar.gz)
    Source code(zip)
This program is a WiFi cracker, you can test many passwords for a desired wifi to find the wifi password!

WiFi_Cracker About the Program: This program is a WiFi cracker! Just run code and select a desired wifi to start cracking 💣 Note: you can use this pa

Sina.f 13 Dec 08, 2022
M.E.A.T. - Mobile Evidence Acquisition Toolkit

M.E.A.T. - Mobile Evidence Acquisition Toolkit Meet M.E.A.T! From Jack Farley - BlackStone Discovery This toolkit aims to help forensicators perform d

1 Nov 11, 2021
Get important strings inside [Info.plist] & and Binary file also all output of result it will be saved in [app_binary].json , [app_plist_file].json file

Get important strings inside [Info.plist] & and Binary file also all output of result it will be saved in [app_binary].json , [app_plist_file].json file

12 Sep 28, 2022
Passphrase-wordlist - Shameless clone of passphrase wordlist

This repository is NOT official -- the original repository is located on GitLab

Jeff McJunkin 2 Feb 05, 2022
Genpyteal - Experiment to rewrite Python into PyTeal using RedBaron

genpyteal Converts Python to PyTeal. Your mileage will vary depending on how muc

Jason Livesay 9 Oct 19, 2022
Dumping revelant information on compromised targets without AV detection

DonPAPI Dumping revelant information on compromised targets without AV detection DPAPI dumping Lots of credentials are protected by DPAPI (link ) We a

Login Securite 580 Jan 09, 2023
Rouge Spammers with a mission to disrupt the peace of the valley ? Fear not we will STOMP the Spammers

Rouge Spammers with a mission to disrupt the peace of the valley ? Fear not we will STOMP the Spammers New Update : adding 'on-review' tag on an issue

A N U S H 13 Sep 19, 2021
Find existing email addresses by nickname using API/SMTP checking methods without user notification. Please, don't hesitate to improve cat's job! 🐱🔎 📬

mailcat The only cat who can find existing email addresses by nickname. Usage First install requirements: pip3 install -r requirements.txt Then just

282 Dec 30, 2022
The Decompressoin tool for Vxworks MINIFS

MINIFS-Decompression The Decompression tool for Vxworks MINIFS filesystem. USAGE python minifs_decompression.py [target_firmware] The example of Mercu

8 Jan 03, 2023
AIL LeakFeeder: A Module for AIL Framework that automate the process to feed leaked files automatically to AIL

AIL LeakFeeder: A Module for AIL Framework that automates the process to feed leaked files automatically to AIL, So basically this feeder will help you ingest AIL with your leaked files automatically

ail project 8 May 03, 2022
Yesitsme - Simple OSINT script to find Instagram profiles by name and e-mail/phone

Simple OSINT script to find Instagram profiles by name and e-mail/phone

108 Jan 07, 2023
StarUML cracker - StarUML cracker With Python

StarUML_cracker Usage On Linux Clone the repo. git clone https://github.com/mana

Bibek Manandhar 9 Jun 20, 2022
A simple way to store your passwords without requiring third party applications

SimplePasswordManager A simple way to store your passwords without requiring third party applications Simple To Use. Store Your Passwords For Each Web

Leone Odinga 1 Dec 23, 2021
An forensics tool to help aid in the investigation of spoofed emails based off the email headers.

A forensic tool to make analysis of email headers easy to aid in the quick discovery of the attacker. Table of Contents About mailMeta Installation Us

Syed Modassir Ali 59 Nov 26, 2022
Vulnerability Exploitation Code Collection Repository

Introduction expbox is an exploit code collection repository List CVE-2021-41349 Exchange XSS PoC = Exchange 2013 update 23 = Exchange 2016 update 2

0x0021h 263 Feb 14, 2022
"Video Moment Retrieval from Text Queries via Single Frame Annotation" in SIGIR 2022.

ViGA: Video moment retrieval via Glance Annotation This is the official repository of the paper "Video Moment Retrieval from Text Queries via Single F

Ran Cui 38 Dec 31, 2022
OLOP: One-Line & Obfuscated Python

OLOP: One-Line & Obfuscated Python This repository contains useful python modules for one-line and obfuscated python. pip install olop-ShadowLugia650

1 Jan 09, 2022
Tool-X is a kali linux hacking Tool installer.

Tool-X is a kali linux hacking Tool installer. Tool-X developed for termux and other Linux based systems. using Tool-X you can install almost 370+ hacking tools in termux app and other linux based di

Rajkumar Dusad 4.2k May 29, 2022
FOSSLight Scanner performs open source analysis after downloading the source by passing a link that can be cloned by wget or git.

FOSSLight Scanner Analyze at once for Open Source Compliance. FOSSLight Scanner performs open source analysis after downloading the source by passing

FOSSLight 8 Nov 03, 2022
Malware for Discord, designed to steal passwords, tokens, and inject discord folders for long-term use.

Vital What is Vital? Vital is malware primarily used to collect and extract information from the Discord desktop client. While it has other features (

HellSec 59 Dec 01, 2022