InDels analysis of CRISPR lines by NGS amplicon sequencing technology for a multicopy gene family.

Overview

CRISPRanalysis

InDels analysis of CRISPR lines by NGS amplicon sequencing technology for a multicopy gene family.

In this work, we present a workflow to analyze InDels from the multicopy α-gliadin gene family from wheat based on NGS data without the need to pre-viously establish a reference sequence for each genetic background. The pipeline was tested it in a multiple sample set, including three generations of edited wheat lines (T0, T1, and T2), from three different backgrounds and ploidy levels (hexaploid and tetraploid). Implementation of Bayesian optimization of Usearch parameters, inhouse Python, and bash scripts are reported.

Workflow:

Step1:

Bayesian optimization was implemented to optimize Usearch v9.2.64 parameters from merge to search steps for the α-gliadin amplicons on the wild type lines.

python Step1_Bayesian_usearch.py --database 
   
     --file_intervals 
    
      --trim_primers 
     
       --path_usearch_control 
      

      
     
    
   


Help:

Argument Help
--database File fasta with database sequences. Example: /path/to/database/database.fasta.
--file_intervals File with intervals for parameters. Example in /Examples/Example_intervals.txt.
--trim_primers Trim primers in reads if you use database without primers. Optios: YES | NO.
--path_usearch_control Path of usearch and control raw data separated by "," without white spaces. Example: /paht/to/usearch,/path/to/reads_control.


Outputs:

  • Bayesian_usearch.txt File with optimal values, optimal function value, samples or observations, obatained values and search space.
  • Bayesian.png Convergence plot.
  • Bayesian_data_res.txt File with the minf(x) after n calls in each iteration.

Step 2:

Usearch pipeline optimazed on wild type lines for studying results of optimization.

Step2_usearch_WT_to_DB.sh dif pct maxee amp id path_control name_dir_usearch path_database trim_primers


Help:

Arguments must be disposed in the order indicated before.

  • dif Optimal value for dif Usearch parameter.
  • pct Optimal value for pct Usearch parameter.
  • maxee Optimal value for maxee Usearch parameter.
  • amp Optimal value for amp Usearch parameter.
  • id Optimal value for id Usearch parameter.
  • path_control Path of the wild type lines fastq files.
  • name_dir_usearch Path of Usearch.
  • path_database Path of alpha-gliadin amplicon database.
  • trim_primers Trim primers in reads if you use database without primers. Optios: YES | NO.


Outputs:

Usearch merge files, filter files, unique amplicons file, unique denoised amplicon (Amp/otu) file, otu table file.

Step 3:

Usearch pipeline optimazed on all lines (wild types and CRISPR lines) for studying denoised unique amplicon relative abundances.

Step3_usearch_ALL_LINES.sh dif pct maxee amp id path_ALL name_dir_usearch trim_primers


Help:

Arguments must be disposed in the order indicated before.

  • dif Optimal value for dif Usearch parameter.
  • pct Optimal value for pct Usearch parameter.
  • maxee Optimal value for maxee Usearch parameter.
  • amp Optimal value for amp Usearch parameter.
  • id Optimal value for id Usearch parameter.
  • path_ALL Path of all lines (wild type and CRISPR lines) fastq files.
  • name_dir_usearch Path of Usearch.
  • trim_primers Trim primers in reads if you use database without primers. Optios: YES | NO.


Outputs:

Usearch merge files, filter files, unique amplicon file, unique denoised amplicon (Amp/otu) file, otu table file.

Before Step 4, otu table file must be normalized by TMM normalization method (edgeR package in R). Results of TMM normalized unique denoised amplicons table can be represented as heatmaps. Unique denoised amplicons can be compared between them to detect Insertions and Deletions (InDels) in CRISPR lines.

Step 4:

Create tables with the presence or absence of unique denoised amplicons in each CRISPR line compared to the wild type lines.

python Step4_usearch_to_table.py --file_otu 
   
     --file_group 
    
      --prefix_output 
     
       --genotype 
      

      
     
    
   


Help:

Argument Help
--file_otu File of TMM normalized otu_table from usearch. Remove "#OTU" from the first line.
--file_group Path to file of genotypes in wild type and CRISPR lines. Example in /Examples/Example_groups.txt.
--prefix_output Prefix to output name. Example: if you are working with BW208 groups: BW.
--genotype Genotype name. Example: if you are working with BW208 groups: BW208.

Default threshold 0.3 % of frequency of each unique denoised amplicon (Amp) in each line.


Outputs:

Substitute "name" in output names for the prefix_output string.

  • Amptable_frequency.txt Table of Amps (otus) transformed to frequencies for apply the threshold.
  • Amptable_brutes_name.txt Table with number of reads contained in the unique denoised amplicons (Amps) present in each line.
  • Amps_name.txt Table with number of unique denoised amplicons (Amps) in each line.

Python 3.6 or later is required.

INFO-H515 - Big Data Scalable Analytics

INFO-H515 - Big Data Scalable Analytics Jacopo De Stefani, Giovanni Buroni, Théo Verhelst and Gianluca Bontempi - Machine Learning Group Exercise clas

Yann-Aël Le Borgne 58 Dec 11, 2022
Generate lookml for views from dbt models

dbt2looker Use dbt2looker to generate Looker view files automatically from dbt models. Features Column descriptions synced to looker Dimension for eac

lightdash 126 Dec 28, 2022
A real-time financial data streaming pipeline and visualization platform using Apache Kafka, Cassandra, and Bokeh.

Realtime Financial Market Data Visualization and Analysis Introduction This repo shows my project about real-time stock data pipeline. All the code is

6 Sep 07, 2022
Data pipelines built with polars

valves Warning: the project is very much work in progress. Valves is a collection of functions for your data .pipe()-lines. This project aimes to host

14 Jan 03, 2023
📊 Python Flask game that consolidates data from Nasdaq, allowing the user to practice buying and selling stocks.

Web Trader Web Trader is a trading website that consolidates data from Nasdaq, allowing the user to search up the ticker symbol and price of any stock

Paulina Khew 21 Aug 30, 2022
Find exposed data in Azure with this public blob scanner

BlobHunter A tool for scanning Azure blob storage accounts for publicly opened blobs. BlobHunter is a part of "Hunting Azure Blobs Exposes Millions of

CyberArk 250 Jan 03, 2023
Fit models to your data in Python with Sherpa.

Table of Contents Sherpa License How To Install Sherpa Using Anaconda Using pip Building from source History Release History Sherpa Sherpa is a modeli

134 Jan 07, 2023
signac-flow - manage workflows with signac

signac-flow - manage workflows with signac The signac framework helps users manage and scale file-based workflows, facilitating data reuse, sharing, a

Glotzer Group 44 Oct 14, 2022
Hangar is version control for tensor data. Commit, branch, merge, revert, and collaborate in the data-defined software era.

Overview docs tests package Hangar is version control for tensor data. Commit, branch, merge, revert, and collaborate in the data-defined software era

Tensorwerk 193 Nov 29, 2022
pandas: powerful Python data analysis toolkit

pandas is a Python package that provides fast, flexible, and expressive data structures designed to make working with "relational" or "labeled" data both easy and intuitive.

pandas 36.4k Jan 03, 2023
Data Intelligence Applications - Online Product Advertising and Pricing with Context Generation

Data Intelligence Applications - Online Product Advertising and Pricing with Context Generation Overview Consider the scenario in which advertisement

Manuel Bressan 2 Nov 18, 2021
Full automated data pipeline using docker images

Create postgres tables from CSV files This first section is only relate to creating tables from CSV files using postgres container alone. Just one of

1 Nov 21, 2021
Improving your data science workflows with

Make Better Defaults Author: Kjell Wooding [email protected] This is the git re

Kjell Wooding 18 Dec 23, 2022
A library to create multi-page Streamlit applications with ease.

A library to create multi-page Streamlit applications with ease.

Jackson Storm 107 Jan 04, 2023
SNV calling pipeline developed explicitly to process individual or trio vcf files obtained from Illumina based pipeline (grch37/grch38).

SNV Pipeline SNV calling pipeline developed explicitly to process individual or trio vcf files obtained from Illumina based pipeline (grch37/grch38).

East Genomics 1 Nov 02, 2021
A data parser for the internal syncing data format used by Fog of World.

A data parser for the internal syncing data format used by Fog of World. The parser is not designed to be a well-coded library with good performance, it is more like a demo for showing the data struc

Zed(Zijun) Chen 40 Dec 12, 2022
A lightweight interface for reading in output from the Weather Research and Forecasting (WRF) model into xarray Dataset

xwrf A lightweight interface for reading in output from the Weather Research and Forecasting (WRF) model into xarray Dataset. The primary objective of

National Center for Atmospheric Research 43 Nov 29, 2022
Retentioneering 581 Jan 07, 2023
Data and code accompanying the paper Politics and Virality in the Time of Twitter

Politics and Virality in the Time of Twitter Data and code accompanying the paper Politics and Virality in the Time of Twitter. In specific: the code

Cardiff NLP 3 Jul 02, 2022
Mining the Stack Overflow Developer Survey

Mining the Stack Overflow Developer Survey A prototype data mining application to compare the accuracy of decision tree and random forest regression m

1 Nov 16, 2021