DNA sequence classification by Deep Neural Network

Overview

DNA sequence classification by Deep Neural Network: Project Overview

  • worked on the DNA sequence classification problem where the input is the DNA sequence and the output class states whether a certain histone protein is present on the sequence or not.
  • used one of the datasets from 12 different datasets that we have collected. The name of the dataset is H3K4me2
  • To represent a sequence, we have utilized k-mer representation
  • For the sequence embedding we have used one-hot encoding
  • Different word embedding models: Word2Vec, BERT, Keras Embedding layer, Bi-LSTM, and CNN

Bioinformatics Project - B.Sc. in Computer Science and Engineering (CSE)

Created by: - Md. Tarek Hasan, Mohammed Jawwadul Islam, Md Fahad Al Rafi, Arifa Akter, Sumayra Islam

Date of Completion: - Fall 2021 Trimester (Nov 2021 - Jan 2022)

Linkedin of Jawwadul

Linkedin of Tarek

Linkedin of Fahad

Linkedin of Arifa

Linkedin of Sumayra

Code and Resources Used

  • Python Version: 3.7.11
  • Packages: numpy, pandas, keras, tensorflow, sklearn
  • Dataset from: Nguyen who is one the authors of the paper titled “DNA sequence classification by convolutional neural network”

Features of the Dataset

DNA sequences wrapped around histone proteins are the subject of datasets

  • For our experiment, we selected one of the datasets entitled H3K4me2.
  • H3K4me2 has 30683 DNA sequences whose 18143 samples fall under the positive class, the rest of the samples fall under the negative class, and it makes the problem binary class classification.
  • The ratio of the positive-negative class is around (59:41)%.
  • The class label represents the presence of H3K4me2 histone proteins in the sequences.
  • The base length of the sequences is 500.

Data Preprocessing

  • The datasets were gathered in.txt format. We discovered that the dataset contains id, sequence, and class label during the Exploratory Data Analysis phase of our work.
  • We dropped the id column from the dataset because it is the only trait that all of the samples share.
  • Except for two samples, H3K4me2 includes 36799 DNA sequences, the majority of which are 500 bases long. Those two sequences have lengths of 310 and 290, respectively. To begin, we employed the zero-padding strategy to tackle the problem. However, because there are only two examples of varying lengths, we dropped those two samples from the dataset later for experiments, as these samples may cause noise.
  • we have used the K-mer sequence representation technique to represent a DNA sequence, we have used the K-mer sequence representation technique
  • For sequence emdedding after applying the 3-mer representation technique, we have experimented using different embedding techniques. The first three embedding methods are named SequenceEmbedding1D, SequenceEmbedding2D, SequenceEmbedding2D_V2, Word2Vec and BERT.
    • SequenceEmbedding1D is the one-dimensional representation of a single DNA sequence which is basically the one-hot encoding.
    • SequenceEmbedding2D is the two-dimensional representation of a single DNA sequence where the first row is the one-hot encoding of a sequence after applying 3-mer representation. The second row is the one-hot encoding of a left-rotated sequence after applying 3-mer representation.
    • the third row of SequenceEmbedding2D_V2 is the one-hot encoding of a right-rotated sequence after applying 3-mer representation.
    • Word2Vec and BERT are the word embedding techniques for language modeling.

Deep Learning Models

After the completion of sequence embedding, we have used deep learning models for the classification task. We have used two different deep learning models for this purpose, one is Convolutional Neural Network (CNN) and the other is Bidirectional Long Short-Term Memory (Bi-LSTM).

Experimental Analysis

After the data cleaning phase, we had 36797 samples. We have used 80% of the whole dataset for training and the rest of the samples for testing. The dataset has been split using train_test_split from sklearn.model_selection stratifying by the class label. We have utilized 10% of the training data for validation purposes. For the first five experiments we have used batch training as it was throwing an exception of resource exhaustion.

The evaluation metrics we used for our experiments are accuracy, precision, recall, f1-score, and Matthews Correlation Coefficient (MCC) score. The minimum value of accuracy, precision, recall, f1-score can be 0 and the maximum value can be 1. The minimum value of the MCC score can be -1 and the maximum value can be 1.

image

Discussion

MCC score 0 indicates the model's randomized predictions. The recall score indicates how well the classifier can find all positive samples. We can say that the model's ability to classify all positive samples has been at an all-time high over the last five experiments. The highest MCC score we received was 0.1573, indicating that the model is very near to predicting in a randomized approach. We attain a maximum accuracy of 60.27%, which is much lower than the state-of-the-art result of 71.77%. To improve the score, we need to emphasize more on the sequence embedding approach. Furthermore, we can experiment with various deep learning techniques.

Owner
Mohammed Jawwadul Islam Fida
CSE student. Founding Vice President of Students' International Affairs Society at CIAC, UIU
Mohammed Jawwadul Islam Fida
A lightweight library designed to accelerate the process of training PyTorch models by providing a minimal

A lightweight library designed to accelerate the process of training PyTorch models by providing a minimal, but extensible training loop which is flexible enough to handle the majority of use cases,

Chris Hughes 110 Dec 23, 2022
2020 CCF大数据与计算智能大赛-非结构化商业文本信息中隐私信息识别-第7名方案

2020CCF-NER 2020 CCF大数据与计算智能大赛-非结构化商业文本信息中隐私信息识别-第7名方案 bert base + flat + crf + fgm + swa + pu learning策略 + clue数据集 = test1单模0.906 词向量

67 Oct 19, 2022
Space-event-trace - Tracing service for spaceteam events

space-event-trace Tracing service for TU Wien Spaceteam events. This service is

TU Wien Space Team 2 Jan 04, 2022
Generative vs Discriminative: Rethinking The Meta-Continual Learning (NeurIPS 2021)

Generative vs Discriminative: Rethinking The Meta-Continual Learning (NeurIPS 2021) In this repository we provide PyTorch implementations for GeMCL; a

4 Apr 15, 2022
All public open-source implementations of convnets benchmarks

convnet-benchmarks Easy benchmarking of all public open-source implementations of convnets. A summary is provided in the section below. Machine: 6-cor

Soumith Chintala 2.7k Dec 30, 2022
PyTorch implementation of Towards Accurate Alignment in Real-time 3D Hand-Mesh Reconstruction (ICCV 2021).

Towards Accurate Alignment in Real-time 3D Hand-Mesh Reconstruction Introduction This is official PyTorch implementation of Towards Accurate Alignment

TANG Xiao 96 Dec 27, 2022
List of papers, code and experiments using deep learning for time series forecasting

Deep Learning Time Series Forecasting List of state of the art papers focus on deep learning and resources, code and experiments using deep learning f

Alexander Robles 2k Jan 06, 2023
This repository contains the implementation of the following paper: Cross-Descriptor Visual Localization and Mapping

Cross-Descriptor Visual Localization and Mapping This repository contains the implementation of the following paper: "Cross-Descriptor Visual Localiza

Mihai Dusmanu 81 Oct 06, 2022
CausaLM: Causal Model Explanation Through Counterfactual Language Models

CausaLM: Causal Model Explanation Through Counterfactual Language Models Authors: Amir Feder, Nadav Oved, Uri Shalit, Roi Reichart Abstract: Understan

Amir Feder 39 Jul 10, 2022
Replication attempt for the Protein Folding Model

RGN2-Replica (WIP) To eventually become an unofficial working Pytorch implementation of RGN2, an state of the art model for MSA-less Protein Folding f

Eric Alcaide 36 Nov 29, 2022
[NeurIPS 2021] “Improving Contrastive Learning on Imbalanced Data via Open-World Sampling”,

Improving Contrastive Learning on Imbalanced Data via Open-World Sampling Introduction Contrastive learning approaches have achieved great success in

VITA 24 Dec 17, 2022
Hough Transform and Hough Line Transform Using OpenCV

Hough transform is a feature extraction method for detecting simple shapes such as circles, lines, etc in an image. Hough Transform and Hough Line Transform is implemented in OpenCV with two methods;

Happy N. Monday 3 Feb 15, 2022
Material related to the Principles of Cloud Computing course.

CloudComputingCourse Material related to the Principles of Cloud Computing course. This repository comprises material that I use to teach my Principle

Aniruddha Gokhale 15 Dec 02, 2022
Python scripts form performing stereo depth estimation using the HITNET model in Tensorflow Lite.

TFLite-HITNET-Stereo-depth-estimation Python scripts form performing stereo depth estimation using the HITNET model in Tensorflow Lite. Stereo depth e

Ibai Gorordo 22 Oct 20, 2022
Symbolic Parallel Adaptive Importance Sampling for Probabilistic Program Analysis in JAX

SYMPAIS: Symbolic Parallel Adaptive Importance Sampling for Probabilistic Program Analysis Overview | Installation | Documentation | Examples | Notebo

Yicheng Luo 4 Sep 13, 2022
ICNet and PSPNet-50 in Tensorflow for real-time semantic segmentation

Real-Time Semantic Segmentation in TensorFlow Perform pixel-wise semantic segmentation on high-resolution images in real-time with Image Cascade Netwo

Oles Andrienko 219 Nov 21, 2022
Supervised Sliding Window Smoothing Loss Function Based on MS-TCN for Video Segmentation

SSWS-loss_function_based_on_MS-TCN Supervised Sliding Window Smoothing Loss Function Based on MS-TCN for Video Segmentation Supervised Sliding Window

3 Aug 03, 2022
Adversarial-autoencoders - Tensorflow implementation of Adversarial Autoencoders

Adversarial Autoencoders (AAE) Tensorflow implementation of Adversarial Autoencoders (ICLR 2016) Similar to variational autoencoder (VAE), AAE imposes

Qian Ge 236 Nov 13, 2022
以孤立语假设和宽度优先搜索为基础,构建了一种多通道堆叠注意力Transformer结构的斗地主ai

ddz-ai 介绍 斗地主是一种扑克游戏。游戏最少由3个玩家进行,用一副54张牌(连鬼牌),其中一方为地主,其余两家为另一方,双方对战,先出完牌的一方获胜。 ddz-ai以孤立语假设和宽度优先搜索为基础,构建了一种多通道堆叠注意力Transformer结构的系统,使其经过大量训练后,能在实际游戏中获

freefuiiismyname 88 May 15, 2022
PyTorch implementation of ENet

PyTorch-ENet PyTorch (v1.1.0) implementation of ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation, ported from the lua-torc

David Silva 333 Dec 29, 2022