Gapmm2: gapped alignment using minimap2 (align transcripts to genome)

Last update: Jan 27, 2022

Related tags

Overview

gapmm2: gapped alignment using minimap2

This tool is a wrapper for minimap2 to run spliced/gapped alignment, ie aligning transcripts to a genome. You are probably saying, yes minimap2 runs this with -x splice --cs option (you are correct). However, there are instances where the terminal exons from stock minimap2 alignments are missing. This tool detects those alignments that have unaligned terminal eons and uses edlib to find the terminal exon positions. The tool then updates the PAF output file with the updated information.

Rationale

We can pull out a gene model in GFF3 format that has a short 5' terminal exon:

scaffold_9	funannotate	gene	408904	409621	.	-	.	ID=OPO1_006919;
scaffold_9	funannotate	mRNA	408904	409621	.	-	.	ID=OPO1_006919-T1;Parent=OPO1_006919;product=hypothetical protein;
scaffold_9	funannotate	exon	409609	409621	.	-	.	ID=OPO1_006919-T1.exon1;Parent=OPO1_006919-T1;
scaffold_9	funannotate	exon	409320	409554	.	-	.	ID=OPO1_006919-T1.exon2;Parent=OPO1_006919-T1;
scaffold_9	funannotate	exon	409090	409255	.	-	.	ID=OPO1_006919-T1.exon3;Parent=OPO1_006919-T1;
scaffold_9	funannotate	exon	408904	409032	.	-	.	ID=OPO1_006919-T1.exon4;Parent=OPO1_006919-T1;
scaffold_9	funannotate	CDS	409609	409621	.	-	0	ID=OPO1_006919-T1.cds;Parent=OPO1_006919-T1;
scaffold_9	funannotate	CDS	409320	409554	.	-	2	ID=OPO1_006919-T1.cds;Parent=OPO1_006919-T1;
scaffold_9	funannotate	CDS	409090	409255	.	-	1	ID=OPO1_006919-T1.cds;Parent=OPO1_006919-T1;
scaffold_9	funannotate	CDS	408904	409032	.	-	0	ID=OPO1_006919-T1.cds;Parent=OPO1_006919-T1;

If we then map this transcript against the genome, we get the following PAF alignment.

$ minimap2 -x splice --cs genome.fasta cds-transcripts.fa | grep 'OPO1_006919'
OPO1_006919-T1	543	13	543	-	scaffold_9	658044	408903	409554	530	530	60	NM:i:0	ms:i:530	AS:i:466	nn:i:0	ts:A:+	tp:A:P	cm:i:167	s1:i:510	s2:i:0	de:f:0	rl:i:0	cs:Z::129~ct57ac:166~ct64ac:235

The --cs flag in minimap2 can be used to parse the coordinates (below) and you can see we are missing the 5' exon.

>>> cs2coords(408903, 13, 543, '-', ':129~ct57ac:166~ct64ac:235')
([(409320, 409554), (409090, 409255), (408904, 409032)],

So if we run this same alignment with gapmm2 we are able to properly align the 5' terminal exon.

$ gapmm2 genome.fa cds-transcripts.fa | grep 'OPO1_006919'
OPO1_006919-T1	543	0	543	-	scaffold_9	658044	408903	409621	543	543	60	tp:A:P	ts:A:+	NM:i:0	cs:Z::129~ct57ac:166~ct64ac:235~ct54ac:13

>>> cs2coords(408903, 0, 543, '-', ':129~ct57ac:166~ct64ac:235~ct54ac:13')
([(409609, 409621), (409320, 409554), (409090, 409255), (408904, 409032)]

Usage:

gapmm2 can be run as a command line script:

$ gapmm2
usage: gapmm2 [-o] [-t] [-m] [-d] [-h] [--version] reference query

gapmm2: gapped alignment with minimap2. Performs minimap2/mappy alignment with splice options and refines terminal alignments with edlib. Output is PAF format.

Positional arguments:
  reference         reference genome (FASTA)
  query             transcipts in FASTA or FASTQ

Optional arguments:
  -o , --out        output in PAF format (default: stdout)
  -t , --threads    number of threads to use with minimap2 (default: 3)
  -m , --min-mapq   minimum map quality value (default: 1)
  -d, --debug       write some debug info to stderr (default: False)

Help:
  -h, --help        Show this help message and exit
  --version         Show program's version number and exit

It can also be run as a python module. The splice_aligner function will return a list of lists containing PAF formatted data for each alignment and a dictionary of simple stats.

>>> from gapmm2.align import splice_aligner
>>> results, stats = splice_aligner('genome.fa', 'transcripts.fa')
>>> stats
{'n': 6926, 'low-mapq': 0, 'refine-left': 409, 'refine-right': 63}
>>> len(results)
6926
>>> results[0]
['OPO1_000001-T1', 2184, 0, 2184, '+', 'scaffold_1', 1803704, 887, 3127, 2184, 2184, 60, 'tp:A:P', 'ts:A:+', 'NM:i:0', 'cs:Z::958~gt56ag:1226']
>>>

To install the python package, you can do this with pip:

python -m pip install gapmm2

To install the most updated code in master you can run:

python -m pip install git+https://github.com/nextgenusfs/gapmm2.git

You might also like...

[NAACL & ACL 2021] SapBERT: Self-alignment pretraining for BERT.

Releases(v0.2.0)

v0.2.0(Jan 27, 2022)
supports output in GFF3 format as well as PAF

some fixes to the --version flag and readme

output is now sorted by contig and then contig start position

Source code(tar.gz)
Source code(zip)
v0.1.1(Jan 27, 2022)
add option to control terminal search space, -i, --max_intron

fix off-by-one bug on left minus alignments in cs flag

Source code(tar.gz)
Source code(zip)
v0.1.0(Jan 26, 2022)

Initial release of gapmm2.
Source code(tar.gz)
Source code(zip)

Gapmm2: gapped alignment using minimap2 (align transcripts to genome)

Related tags

Overview

gapmm2: gapped alignment using minimap2

Rationale

Usage:

You might also like...

[NAACL & ACL 2021] SapBERT: Self-alignment pretraining for BERT.

the code of the paper: Recurrent Multi-view Alignment Network for Unsupervised Surface Registration (CVPR 2021)

Pytorch implementation for "Implicit Feature Alignment: Learn to Convert Text Recognizer to Text Spotter".

The PyTorch improved version of TPAMI 2017 paper: Face Alignment in Full Pose Range: A 3D Total Solution.

🧠 A PyTorch implementation of 'Deep CORAL: Correlation Alignment for Deep Domain Adaptation.', ECCV 2016

An official implementation of the paper Exploring Sequence Feature Alignment for Domain Adaptive Detection Transformers

Collective Multi-type Entity Alignment Between Knowledge Graphs (WWW'20)

Seeing Dynamic Scene in the Dark: High-Quality Video Dataset with Mechatronic Alignment (ICCV2021)

The tl;dr on a few notable transformer/language model papers + other papers (alignment, memorization, etc).

Releases(v0.2.0)

v0.2.0(Jan 27, 2022)

v0.1.1(Jan 27, 2022)

v0.1.0(Jan 26, 2022)

Owner

Jon Palmer

Deep RGB-D Saliency Detection with Depth-Sensitive Attention and Automatic Multi-Modal Fusion (CVPR'2021, Oral)

Lighting the Darkness in the Deep Learning Era: A Survey, An Online Platform, A New Dataset

pyhsmm - library for approximate unsupervised inference in Bayesian Hidden Markov Models (HMMs) and explicit-duration Hidden semi-Markov Models (HSMMs), focusing on the Bayesian Nonparametric extensions, the HDP-HMM and HDP-HSMM, mostly with weak-limit approximations.

A computational optimization project towards the goal of gerrymandering the results of a hypothetical election in the UK.

Foreground-Action Consistency Network for Weakly Supervised Temporal Action Localization

A PyTorch implementation of " EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks."

YOLOX-RMPOLY

Learning Dense Representations of Phrases at Scale (Lee et al., 2020)

Self-supervised Product Quantization for Deep Unsupervised Image Retrieval - ICCV2021

Benchmark for Answering Existential First Order Queries with Single Free Variable

Expand human face editing via Global Direction of StyleCLIP, especially to maintain similarity during editing.

Code for the paper Relation Prediction as an Auxiliary Training Objective for Improving Multi-Relational Graph Representations (AKBC 2021).

Official implementation for the paper "Attentive Prototypes for Source-free Unsupervised Domain Adaptive 3D Object Detection"

Code repository for the paper "Doubly-Trained Adversarial Data Augmentation for Neural Machine Translation" with instructions to reproduce the results.

RL and distillation in CARLA using a factorized world model

Implementation of Invariant Point Attention, used for coordinate refinement in the structure module of Alphafold2, as a standalone Pytorch module

CNN visualization tool in TensorFlow

YOLOX + ROS(1, 2) object detection package

Covid-19 Test AI (Deep Learning - NNs) Software. Accuracy is the %96.5, loss is the 0.09 :)

A library for low-memory inferencing in PyTorch.