Collapse a set of redundant kmers to use IUPAC degenerate bases

Overview

kmer-collapse

Collapse a set of redundant kmers to use IUPAC degenerate bases

Overview

Given an input set of kmers, find the smallest set of kmers that encapsulates all diversity in the input set using IUPAC degenerate bases. This aims to solve the problem described here: https://www.biostars.org/p/9498272/

Usage

Install the marisa-trie library, if necessary.

Modify the script's input variable to specify desired sequences, and then run the script:

$ python kmer-collapse.py
{
    "input": [
        "AAAAAAAAAA",
        "TAAAAAAAAA",
        "ACAAAAAAAA",
        "AGAAAAAAAA"
    ],
    "encoded_output": [
        "WAAAAAAAAA",
        "ASAAAAAAAA"
    ]
}

Notes

This has not been tested with any kmer sets but those examples provided. However, it aims to be scalable by pruning combinations of sub-kmers along the way, which would otherwise yield incorrect encodings. This also uses a trie for faster prefix testing. If futher performance is needed, some easy wins would be to cache sub-kmer tests, since most of these test outcomes would be redundant.

Additionally, no error checking is done on the input kmer alphabet or on the consistency of kmer lengths. It may be useful to validate input before using this script.

Examples

These examples are available from the script by uncommenting the relevant input.

A

{
    "input": [
        "AAAAAAAAAA",
        "TAAAAAAAAA"
    ],
    "encoded_output": [
        "WAAAAAAAAA"
    ]
}

B

{
    "input": [
        "AAAAAAAAAA",
        "TAAAAAAAAA",
        "GCGAAAAAAA"
    ],
    "encoded_output": [
        "GCGAAAAAAA",
        "WAAAAAAAAA"
    ]
}

C

{
    "input": [
        "AAAAAAAAAA"
    ],
    "encoded_output": [
        "AAAAAAAAAA"
    ]
}

D

{
    "input": [
        "AAAAAAAAAA",
        "TAAAAAAAAA",
        "CAAAAAAAAA",
        "GAAAAAAAAA"
    ],
    "encoded_output": [
        "NAAAAAAAAA"
    ]
}

E

{
    "input": [
        "AAAAAAAAAA",
        "TAAAAAAAAA",
        "TTAAAAAAAA",
        "ATAAAAAAAA"
    ],
    "encoded_output": [
        "WWAAAAAAAA"
    ]
}

F

{
    "input": [
        "AAAAAAAAAA",
        "TAAAAAAAAA",
        "CAAAAAAAAA",
        "GAAAAAAAAA",
        "TACAGATACA",
        "AACAGAAAAA"
    ],
    "encoded_output": [
        "NAAAAAAAAA",
        "TACAGATACA",
        "AACAGAAAAA"
    ]
}

G

{
    "input": [
        "AAAAAAAAAA",
        "TAAAAAAAAA",
        "ACAAAAAAAA",
        "AGAAAAAAAA"
    ],
    "encoded_output": [
        "ASAAAAAAAA",
        "WAAAAAAAAA"
    ]
}
Owner
Alex Reynolds
Pug caregiver, curler, cyclist, gardener, beginning French scholar
Alex Reynolds
CD for MachineLearnia

Codebase supporting my talk on CI/CD for MachineLearnia (Nov 12 2021) The dataset used is available here. The point of the talk is to demonstrate a si

0 Feb 23, 2022
FindUncommonShares.py is a Python equivalent of PowerView's Invoke-ShareFinder.ps1 allowing to quickly find uncommon shares in vast Windows Domains.

FindUncommonShares The script FindUncommonShares.py is a Python equivalent of PowerView's Invoke-ShareFinder.ps1 allowing to quickly find uncommon sha

Podalirius 184 Jan 03, 2023
A password genarator/manager for passwords uesing a pseudorandom number genarator

pseudorandom-password-genarator a password genarator/manager for passwords uesing a pseudorandom number genarator when you give the program a word eg

1 Nov 18, 2021
Problem statements on System Design and Software Architecture as part of Arpit's System Design Masterclass

Problem statements on System Design and Software Architecture as part of Arpit's System Design Masterclass

Relog 1.1k Jan 04, 2023
Weakly-Divisable - Takes an interger and seee if it is weakly divisible by seven

Weakly Divisble Project by Diana Arce-Hernandez, Ryan McAlpine, and Rommel Ravan

Diana Arce-Hernandez 1 Jan 12, 2022
Hoopoe - Get notified of important stuff, right away.

Hoopoe - Get notified of important stuff, right away. Report a Bug · Request a Feature . Ask a Question Table of Contents About Getting Started Prereq

Vahid Al 8 Nov 12, 2022
This is a simple analogue clock made with turtle in python...

Analogue-Clock This is a simple analogue clock made with turtle in python... Requirements None, only you need to have windows 😉 ...Enjoy! Installatio

Abhyush 3 Jan 14, 2022
MoBioTools A simple yet versatile toolkit to automatically setup quantum mechanics/molecular mechanics

A simple yet versatile toolkit to setup quantum mechanical/molecular mechanical (QM/MM) calculations from molecular dynamics trajectories.

MoBioChem 17 Nov 27, 2022
A faster Python generator that get function results from multi-process workers

multiyield This package implements a Python generator that get function results from multi-process workers. The faster_fifo Queue (instead of the stan

Xin Du 1 Nov 18, 2021
Donatus Prince 6 Feb 25, 2022
Simple Denial of Service Program yang di bikin menggunakan bahasa pemograman Python,

Peringatan Tujuan kami share code Indo-DoS hanya untuk bertujuan edukasi / pembelajaran! Dilarang memperjual belikan source ini / memperjual-belikan s

SonLyte 8 Nov 07, 2021
Whole-day timezone comparison

Timezone Converter Compare a full day of your local timezone with foreign ones $ timezone-converter tijuana --zone $ timezone-converter tijuana new_yo

Iago Alonso 12 Nov 24, 2022
Model synchronization from dbt to Metabase.

dbt-metabase Model synchronization from dbt to Metabase. If dbt is your source of truth for database schemas and you use Metabase as your analytics to

Mike Gouline 270 Jan 08, 2023
A Trace Explorer for Reverse Engineers

Tenet - A Trace Explorer for Reverse Engineers Overview Tenet is an IDA Pro plugin for exploring execution traces. The goal of this plugin is to provi

1k Jan 02, 2023
En este repositorio realizaré la tarea del laberinto.

Laberinto Perfil de GitHub del autor de este proyecto: @jmedina28 En este repositorio queda resuelta la composición de un laberinto 5x5 con sus muros

Juan Medina 1 Dec 11, 2021
C++ Environment InitiatorVisual Studio Code C / C++ Environment Initiator

Visual Studio Code C / C++ Environment Initiator Latest Version : v 1.0.1(2021/11/08) .exe link here About : Visual Studio Code에서 C/C++환경을 MinGW GCC/G

Junho Yoon 2 Dec 19, 2021
Catalogue CRUD Application

This Python program creates a relational SQL database hosted on the Snowflake platform, then opens a CRUD GUI to manipulate and view the data. In this application, it is used as a book catalogue. CUR

0 Dec 13, 2022
Safe temperature monitor for baby's room. Made for Raspberry Pi Pico.

Baby Safe Temperature Monitor This project is meant to build a temperature safety monitor for a baby or small child's room. Studies have shown the ris

Jeff Geerling 72 Oct 09, 2022
LinuxHelper - A collection of utilities for non-technical Linux users accessible via a GUI

Linux Helper A collection of utilities for non-technical Linux users accessible via a GUI This app is still in very early development, expect bugs and

Seth 7 Oct 03, 2022
Dump Data from FTDI Serial Port to Binary File on MacOS

Dump Data from FTDI Serial Port to Binary File on MacOS

pandy song 1 Nov 24, 2021