A pipeline that creates consensus sequences from a Nanopore reads. I

Overview
Authors: 
Ada Madejska, MCDB, UCSB  (contact: [email protected])
Nick Noll, UCSB

This pipeline takes error-prone Nanopore reads and tries to increase the percentage identity
of the results of identifying species with BLAST. The reads in fastq format are put through the pipeline
which includes the following steps.
1. Quality control 
    - very short and very long reads (reads that highly deviate from the usual length of the 16S sequence)
    are dropped.
2. Kmer frequency matrix
    - make a kmer frequency matrix based on the reads from the quality control step. The value of k
    can be changed (k=5 or 6 is recommended)
3. UMAP projection and HDBSCAN clustering
    - the kmer frequency matrix is used to create a UMAP projection. The default parameters for UMAP
    and HDBSCAN functions have been chosen based on mock dataset but can be changed. 
4. Refinement 
    - based on our tests on mock datasets, sometimes reads from different species can cluster together.
    To prevent that, we include a refinement step based on MSA of Clustal Omega on each cluster.
    The alignment outputs a guide tree which is used for dividing the cluster into smaller subclusters.
    The distance threshold can be changed to suit each dataset.
5. Consensus making
    - lastly, based on the defined clusters, the last step creates a consensus sequence based on 
    majority calling. The direction of the reads is fixed using minimap2, the alignment is performed 
    by MAFFT, and the consensus is created using em_cons. The reads are run through BLASTN to check
    for identity of each cluster. 

Software Dependencies:

To successfully run the pipeline, certain software need to be installed.
1. Minimap2 - for the consensus making step (https://github.com/lh3/minimap2)
2. MAFFT - for alignment in the consensus making step (https://mafft.cbrc.jp/alignment/software/)
3. EM_CONS - for creating the consensus (http://emboss.sourceforge.net/apps/cvs/emboss/apps/cons.html)
4. NCBIN - for identification of the consensus sequences in the database 
    (https://ftp.ncbi.nlm.nih.gov/blast/executables/LATEST/) (a 16S database is also required)
5. CLUSTALO - for the refinement step (http://www.clustal.org/omega/)

Specifications:

This pipeline runs in python3.8.10 and julia v"1.4.1". 

The following Python libraries are also required:
BioPython
hdbscan
matplotlib
pandas
sklearn
umap

Following Julia packages are required:
Pkg
DataFrames
CSV
Owner
Ada Madejska
UCSB Graduate Student in Computational Biology
Ada Madejska
Exploratory Data Analysis of the 2019 Indian General Elections using a dataset from Kaggle.

2019-indian-election-eda Exploratory Data Analysis of the 2019 Indian General Elections using a dataset from Kaggle. This project is a part of the Cou

Souradeep Banerjee 5 Oct 10, 2022
ETL pipeline on movie data using Python and postgreSQL

Movies-ETL ETL pipeline on movie data using Python and postgreSQL Overview This project consisted on a automated Extraction, Transformation and Load p

Juan Nicolas Serrano 0 Jul 07, 2021
Weather analysis with Python, SQLite, SQLAlchemy, and Flask

Surf's Up Weather analysis with Python, SQLite, SQLAlchemy, and Flask Overview The purpose of this analysis was to examine weather trends (precipitati

Art Tucker 1 Sep 05, 2021
Projects that implement various aspects of Data Engineering.

DATAWAREHOUSE ON AWS The purpose of this project is to build a datawarehouse to accomodate data of active user activity for music streaming applicatio

2 Oct 14, 2021
Single-Cell Analysis in Python. Scales to >1M cells.

Scanpy – Single-Cell Analysis in Python Scanpy is a scalable toolkit for analyzing single-cell gene expression data built jointly with anndata. It inc

Theis Lab 1.4k Jan 05, 2023
NumPy and Pandas interface to Big Data

Blaze translates a subset of modified NumPy and Pandas-like syntax to databases and other computing systems. Blaze allows Python users a familiar inte

Blaze 3.1k Jan 05, 2023
A lightweight, hub-and-spoke dashboard for multi-account Data Science projects

A lightweight, hub-and-spoke dashboard for cross-account Data Science Projects Introduction Modern Data Science environments often involve many indepe

AWS Samples 3 Oct 30, 2021
Exploratory data analysis

Exploratory data analysis An Exploratory data analysis APP TAPIWA CHAMBOKO 🚀 About Me I'm a full stack developer experienced in deploying artificial

tapiwa chamboko 1 Nov 07, 2021
The OHSDI OMOP Common Data Model allows for the systematic analysis of healthcare observational databases.

The OHSDI OMOP Common Data Model allows for the systematic analysis of healthcare observational databases.

Bell Eapen 14 Jan 02, 2023
Tools for working with MARC data in Catalogue Bridge.

catbridge_tools Tools for working with MARC data in Catalogue Bridge. Borrows heavily from PyMarc

1 Nov 11, 2021
A powerful data analysis package based on mathematical step functions. Strongly aligned with pandas.

The leading use-case for the staircase package is for the creation and analysis of step functions. Pretty exciting huh. But don't hit the close button

48 Dec 21, 2022
INF42 - Topological Data Analysis

TDA INF421(Conception et analyse d'algorithmes) Projet : Topological Data Analysis SphereMin Etant donné un nuage des points, ce programme contient de

2 Jan 07, 2022
Python Practicum - prepare for your Data Science interview or get a refresher.

Python-Practicum Python Practicum - prepare for your Data Science interview or get a refresher. Data Data visualization using data on births from the

Jovan Trajceski 1 Jul 27, 2021
PySpark Structured Streaming ROS Kafka ApacheSpark Cassandra

PySpark-Structured-Streaming-ROS-Kafka-ApacheSpark-Cassandra The purpose of this project is to demonstrate a structured streaming pipeline with Apache

Zekeriyya Demirci 5 Nov 13, 2022
The repo for mlbtradetrees.com. Analyze any trade in baseball history!

The repo for mlbtradetrees.com. Analyze any trade in baseball history!

7 Nov 20, 2022
ICLR 2022 Paper submission trend analysis

Visualize ICLR 2022 OpenReview Data

Jintang Li 75 Dec 06, 2022
💬 Python scripts to parse Messenger, Hangouts, WhatsApp and Telegram chat logs into DataFrames.

Chatistics Python 3 scripts to convert chat logs from various messaging platforms into Pandas DataFrames. Can also generate histograms and word clouds

Florian 893 Jan 02, 2023
pipeline for migrating lichess data into postgresql

How Long Does It Take Ordinary People To "Get Good" At Chess? TL;DR: According to 5.5 years of data from 2.3 million players and 450 million games, mo

Joseph Wong 182 Nov 11, 2022
For making Tagtog annotation into csv dataset

tagtog_relation_extraction for making Tagtog annotation into csv dataset How to Use On Tagtog 1. Go to Project Downloads 2. Download all documents,

hyeong 4 Dec 28, 2021
PipeChain is a utility library for creating functional pipelines.

PipeChain Motivation PipeChain is a utility library for creating functional pipelines. Let's start with a motivating example. We have a list of Austra

Michael Milton 2 Aug 07, 2022