SNV calling pipeline developed explicitly to process individual or trio vcf files obtained from Illumina based pipeline (grch37/grch38).

Overview

SNV Pipeline

SNV calling pipeline developed explicitly to process individual or trio vcf files obtained from Illumina based pipeline (grch37/grch38). The pipeline requires user defined datasets & annotation sources, available tools and input set of vcf files. It generates analysis scripts that can be incorporated into high performance cluster (HPC) computing to process the samples. This results in list of filtered variants per family that can be used for interpreation, reporting and further downstream analysis.

For demonstration purpose below example is presented for GRCh37. However, the same can be replicated for GRCh38.

Installation

git clone https://github.com/ajaarma/snv.git

Required Installation packages

Install anaconda v2.0
Follow this link for installation: https://docs.anaconda.com/anaconda/install/linux/
Conda environment commands
$ conda create --name snv
$ source activate snv
$ conda install python=2.7.16
$ pip install xmltodict
$ pip install dicttoxml

$ conda install -c bioconda gvcfgenotyper
$ conda install -c anaconda gawk	
$ conda install samtools=1.3
$ conda install vcftools=0.1.14
$ conda install bcftools=1.9
$ conda install gcc #(OSX)
$ conda install gcc_linux-64 #(Linux)
$ conda install parallel
$ conda install -c mvdbeek ucsc_tools
** conda-develop -n 
    
    
     /demo/softwares/vep/Plugins/

$ conda install -c r r-optparse
$ conda install -c r r-dplyr
$ conda install -c r r-plyr
$ conda install -c r r-data.table
$ conda install -c aakumar r-readbulk
$ conda install -c bioconda ensembl-vep=100.4
$ vep_install -a cf -s homo_sapiens -y GRCh37 -c 
     
      /demo/softwares/vep/grch37 --CONVERT
$ vep_install -a cf -s homo_sapiens -y GRCh38 -c 
      
       /demo/softwares/vep/grch38 --CONVERT

      
     
    
   

Data directory and datasets

Default datasets provided
1. exac_pli: demo/resources/gnomad/grch37/gnomad.v2.1.1.lof_metrics.by_transcript_forVEP.txt
2. ensembl: demo/resources/ensembl/grch37/ensBioMart_grch37_v98_ENST_lengths_191208.txt
3. region-exons: demo/resources/regions/grch37/hg19_refseq_ensembl_exons_50bp_allMT_hgmd_clinvar_20200519.txt
4. region-pseudo-autosomal: demo/resources/regions/grch37/hg19_non_pseudoautosomal_regions_X.txt
5. HPO: demo/resources/hpo/phenotype_to_genes.tar.gz
Other datasets that require no entry to user-configuration file
6. Curated: 
	6.1. Genelist: demo/resources/curated/NGC_genelist_allNamesOnly-20200519.txt
	6.2. Somatic mosaicism genes: demo/resources/curated/haem_somatic_mosaicism_genes_20191015.txt
	6.3. Imprinted gene list: demo/resources/curated/imprinted_genes_20200424.txt
	6.4. Polymorphic gene list: demo/resources/curated/polymorphic_genes_20200509.txt
7. OMIM: demo/resources/omim/omim_20200421_geneInfoBase.txt

Download link for following dataset and place them in corresponding directories as shown

' | awk -v OFS="\t" '{ if(/^#/){ print }else{ print $1,$2,$3,$4,$5,$6,$7,"ID="$3";"$8 } }' | bgzip -c > hgmd_pro_2019.4_hg19_wID.vcf.gz $ bcftools index -t hgmd_pro_2019.4_hg19_wID.vcf.gz $ bcftools index hgmd_pro_2019.4_hg19_wID.vcf.gz Put it in this directory: demo/resources/hgmd/grch37/hgmd_pro_2019.4_hg19_wID.vcf.gz Edit the user config flat file CONFIG/UserConfig.txt : hgmd=hgmd/grch37/hgmd_pro_2019.4_hg19_wID.vcf.gz 8. CLINVAR: Download link: https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh37/weekly/clinvar_20200506.vcf.gz https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh37/weekly/clinvar_20200506.vcf.gz.tbi Put it in this directory: demo/resources/clinvar/grch37/clinvar_20200506.vcf.gz Edit the user config flat file CONFIG/UserConfig.txt : clinvar=clinvar/grch37/clinvar_20200506.vcf.gz ">
1. HPO: Extract HPO phenotypes mapping:
	$ cd 
   
    /demo/resources/hpo/
	$ tar -zxvf phenotypes_to_genes.tar.gz 

2. REFERENCE SEQUENCE GENOME (FASTA file alongwith Index)
	Download link: https://drive.google.com/drive/folders/1Ro3pEYhVdYkMmteSr8YRPFeTvb_K0lVf?usp=sharing
	Download file: Homo_sapiens.GRCh37.74.dna.fasta
		Get the corresponding index and dict files: *.fai and *.dict
	Put this in folder: demo/resources/genomes/grch37/Homo_sapiens.GRCh37.74.dna.fasta

3. GNOMAD
	Download link (use wget): 
	Genomes: https://storage.googleapis.com/gnomad-public/release/2.1.1/vcf/genomes/gnomad.genomes.r2.1.1.sites.vcf.bgz
	Exomes: https://storage.googleapis.com/gnomad-public/release/2.1.1/vcf/exomes/gnomad.exomes.r2.1.1.sites.vcf.bgz
	Put it in this folder: 
		demo/resources/gnomad/grch37/gnomad.genomes.r2.1.1.sites.vcf.bgz
		demo/resources/gnomad/grch37/gnomad.exomes.r2.1.1.sites.vcf.bgz
	Edit User config flat file CONFIG/UserConfig.txt : 
		gnomad_g=gnomad/grch37/gnomad.genomes.r2.1.1.sites.vcf.bgz
		gnomad_e=gnomad/grch37/gnomad.exomes.r2.1.1.sites.vcf.bgz

4. ExAC:
	Download Link: https://drive.google.com/drive/folders/11Ya8XfAxOYmlKZ9mN8A16IDTLHdHba_0?usp=sharing
	Download file: ExAC.r0.3.1.sites.vep.decompose.norm.prefixed_PASS-only.vcf.gz
		also the index files (*.csi and *.tbi)
	Put it in this folder as: 
		demo/resources/exac/grch37/ExAC.r0.3.1.sites.vep.decompose.norm.prefixed_PASS-only.vcf.gz
	Edit User config flat file CONFIG/UserConfig.txt : 
		exac=exac/grch37/ExAC.r0.3.1.sites.vep.decompose.norm.prefixed_PASS-only.vcf.gz
		exac_t=exac/grch37/ExAC.r0.3.1.sites.vep.decompose.norm.prefixed_PASS-only.vcf.gz

5. CADD:
	Download link (use wget):
		https://krishna.gs.washington.edu/download/CADD/v1.6/GRCh37/whole_genome_SNVs.tsv.gz
		https://krishna.gs.washington.edu/download/CADD/v1.6/GRCh37/InDels.tsv.gz
		(Also download the corresponding tabix index files as well)
	Put it in this directory: 
		demo/resources/cadd/grch37/whole_genome_SNVs.tsv.gz
		demo/resource/cadd/grch37/InDels.tsv.gz
	Edit the user config flat file CONFIG/UserConfig.txt :
		cadd_snv=cadd/grch37/whole_genome_SNVs.tsv.gz
		cadd_indel=cadd/grch37/InDels.tsv.gz

6. REVEL:
	Download link: https://drive.google.com/drive/folders/12Tl1YU5bI-By_VawTPVWHef7AXzn4LuP?usp=sharing
	Download file: new_tabbed_revel.tsv.gz
	         Also the index file: *.tbi
	Put it in this directory: demo/resources/revel/grch37/new_tabbed_revel.tsv.gz
	Edit the user config flat file CONFIG/UserConfig.txt : 
		revel=revel/grch37/new_tabbed_revel.tsv.gz

7. HGMD:
	Download link: http://www.hgmd.cf.ac.uk/ac/index.php (Require personal access login)
	Put it in this directory: demo/resources/hgmd/grch37/hgmd_pro_2019.4_hg19.vcf

	Use this command to process HGMD file inside this directory:
		$ cat hgmd_pro_2019.4_hg19.vcf | sed '/##comment=.*\"/a  ##INFO=
    
     ' | awk -v OFS="\t" '{ if(/^#/){ print }else{ print $1,$2,$3,$4,$5,$6,$7,"ID="$3";"$8 } }' | bgzip -c  > hgmd_pro_2019.4_hg19_wID.vcf.gz
		$ bcftools index -t hgmd_pro_2019.4_hg19_wID.vcf.gz
		$ bcftools index hgmd_pro_2019.4_hg19_wID.vcf.gz	

	Put it in this directory: demo/resources/hgmd/grch37/hgmd_pro_2019.4_hg19_wID.vcf.gz
	Edit the user config flat file CONFIG/UserConfig.txt :
		hgmd=hgmd/grch37/hgmd_pro_2019.4_hg19_wID.vcf.gz

8. CLINVAR:
	Download link: 
		https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh37/weekly/clinvar_20200506.vcf.gz
		https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh37/weekly/clinvar_20200506.vcf.gz.tbi
	Put it in this directory: demo/resources/clinvar/grch37/clinvar_20200506.vcf.gz
	Edit the user config flat file CONFIG/UserConfig.txt :
		clinvar=clinvar/grch37/clinvar_20200506.vcf.gz

    
   
Customized Curated Annotation sets
Default present with this distribution. Can be found in XML file with these tags:
	(1) GeneList: 
   
    
	(2) Somatic mosaicism genes: 
    
     
	(3) Imprinted genes: 
     
      
	(4) Polymorphic genes: 
      
       
	(5) HPO terms: 
       
         (6) OMIM: 
         
        
       
      
     
    
   

Activate the conda environment

$ source activate snv

Step - 1:

1. Edit CONFIG/UserConfig.txt: 
	(a) Add the absolute path prefix for the resources directory with tag: resourceDir. 
	    An example can be seen in CONFIG/Example-UserConfig.txt file.
	(b) Manually check if datasets corresponding to other field tags are correctly downloaded and 
	    put in respective folders.
2. Create user defined XML file from input User Configuration flat file and Base-XML file
Command:
$ python createAnalysisXML.py -u 
   
     
		    	      -b 
    
      
		              -o 
     

     
    
   
Example:
$ python createAnalysisXML.py -u CONFIG/UserConfig.txt 
		       	      -b CONFIG/Analysis_base_grch37.xml 
		              -o CONFIG/Analysis_user_grch37.xml
Outputs:
CONFIG/Analysis_user_grch37.xml

Step-2:

1. Put the respective vcf files in the directory. For example: demo/example/vcf/ 
2. Create manifest file in same format as shown in demo/example/example_manifest.txt
3. Assign gender to each family members (illumina or sample id). For example: demo/example/example_genders.txt
4. List of all the family ids that needed to be analyzed.
         For e.g: demo/example/manifest/example_family_analysis.txt

Step -3:

Generate all the shell scripts that can be incorporated into user specific HPC cluster network. For e.g: Slurm/PBS/LSF network.

Command:
$ python processSNV.py 	-a 
   
    
	    		-p 
    
     
	      		-m 
     
      
	     		-e 
      
       
			-w 
       
         -g 
        
          -d 
         
           -f 
          
            -s 
           
             -r 
             
            
           
          
         
        
       
      
     
    
   
Example:
$ python processSNV.py 	-a CONFIG/Analysis_user_grch37.xml \ 
			-p 20210326 \
			-m 
   
    /demo/example/example_manifest.txt \
			-e gvcfGT \
			-w 
    
     /demo/example/ \
			-g 
     
      /demo/example/example_genders.txt \
			-d 
      
       /demo/example/exeter_samples_norm.fof \
			-f 
       
        /demo/example/manifest/example_family_analysis.txt \ -s 
        
         /demo/example/manifest/example_family.fof \ -r 
         
          /demo/example/manifest/example_family_header.txt (optional) 
         
        
       
      
     
    
   
Outputs:
Two scripts in the directory: 
   
    /demo/example/20210326/tmp_binaries/
Launch the scripts in these 2 stages sequentially after each of them gets finished.

   (1) genotypeAndAnnotate_chr%.sh where %=1..22,X,Y and MT
	scatter the annotation and frequency filtering per chromosome for all families.
   (2) mergeAndFilter.sh:
	Merge all the chromosome and apply inheritance filtering.

   

Step-4

Final output of list of filtered variant is present in:

   
    /demo/example/20210326/fam_filter/
    
     /
     
      .filt_
      
       .txt

      
     
    
   
For any questions/issues/bugs please mail us at:
Owner
East Genomics
Bringing together genomic medicine across the East Midlands and East of England
East Genomics
A tool to compare differences between dataframes and create a differences report in Excel

similarpanda A module to check for differences between pandas Dataframes, and generate a report in Excel format. This is helpful in a workplace settin

Andre Pretorius 9 Sep 15, 2022
MEAD: A Large-scale Audio-visual Dataset for Emotional Talking-face Generation [ECCV2020]

MEAD: A Large-scale Audio-visual Dataset for Emotional Talking-face Generation [ECCV2020] by Kaisiyuan Wang, Qianyi Wu, Linsen Song, Zhuoqian Yang, Wa

112 Dec 28, 2022
Extract Thailand COVID-19 Cluster data from daily briefing pdf.

Thailand COVID-19 Cluster Data Extraction About Extract Clusters from Thailand Daily COVID-19 briefing PDF Download latest data Here. Data will be upd

Noppakorn Jiravaranun 5 Sep 27, 2021
Snakemake workflow for converting FASTQ files to self-contained CRAM files with maximum lossless compression.

Snakemake workflow: name A Snakemake workflow for description Usage The usage of this workflow is described in the Snakemake Workflow Catalog. If

Algorithms for reproducible bioinformatics (Koesterlab) 1 Dec 16, 2021
Numerical Analysis toolkit centred around PDEs, for demonstration and understanding purposes not production

Numerics Numerical Analysis toolkit centred around PDEs, for demonstration and understanding purposes not production Use procedure: Initialise a new i

George Whittle 1 Nov 13, 2021
Yet Another Workflow Parser for SecurityHub

YAWPS Yet Another Workflow Parser for SecurityHub "Screaming pepper" by Rum Bucolic Ape is licensed with CC BY-ND 2.0. To view a copy of this license,

myoung34 8 Dec 22, 2022
We're Team Arson and we're using the power of predictive modeling to combat wildfires.

We're Team Arson and we're using the power of predictive modeling to combat wildfires. Arson Map Inspiration There’s been a lot of wildfires in Califo

Jerry Lee 3 Oct 17, 2021
yt is an open-source, permissively-licensed Python library for analyzing and visualizing volumetric data.

The yt Project yt is an open-source, permissively-licensed Python library for analyzing and visualizing volumetric data. yt supports structured, varia

The yt project 367 Dec 25, 2022
TheMachineScraper πŸ±β€πŸ‘€ is an Information Grabber built for Machine Analysis

TheMachineScraper πŸ±β€πŸ‘€ is a tool made purely for analysing machine data for any reason.

doop 5 Dec 01, 2022
Analyzing Covid-19 Outbreaks in Ontario

My group and I took Covid-19 outbreak statistics from ontario, and analyzed them to find different patterns and future predictions for the virus

Vishwaajeeth Kamalakkannan 0 Jan 20, 2022
Includes all files needed to satisfy hw02 requirements

HW 02 Data Sets Mean Scale Score for Asian and Hispanic Students, Grades 3 - 8 This dataset provides insights into the New York City education system

7 Oct 28, 2021
Pandas and Spark DataFrame comparison for humans

DataComPy DataComPy is a package to compare two Pandas DataFrames. Originally started to be something of a replacement for SAS's PROC COMPARE for Pand

Capital One 259 Dec 24, 2022
Py-price-monitoring - A Python price monitor

A Python price monitor This project was focused on Brazil, so the monitoring is

Samuel 1 Jan 04, 2022
Stock Analysis dashboard Using Streamlit and Python

StDashApp Stock Analysis Dashboard Using Streamlit and Python If you found the content useful and want to support my work, you can buy me a coffee! Th

StreamAlpha 27 Dec 09, 2022
WAL enables programmable waveform analysis.

This repro introcudes the Waveform Analysis Language (WAL). The initial paper on WAL will appear at ASPDAC'22 and can be downloaded here: https://www.

Institute for Complex Systems (ICS), Johannes Kepler University Linz 40 Dec 13, 2022
An experimental project I'm undertaking for the sole purpose of increasing my Python knowledge

5ePy is an experimental project I'm undertaking for the sole purpose of increasing my Python knowledge. #Goals Goal: Create a working, albeit lightwei

Hayden Covington 1 Nov 24, 2021
Autopsy Module to analyze Registry Hives based on bookmarks provided by EricZimmerman for his tool RegistryExplorer

Autopsy Module to analyze Registry Hives based on bookmarks provided by EricZimmerman for his tool RegistryExplorer

Mohammed Hassan 13 Mar 31, 2022
Python Package for DataHerb: create, search, and load datasets.

The Python Package for DataHerb A DataHerb Core Service to Create and Load Datasets.

DataHerb 4 Feb 11, 2022
A set of tools to analyse the output from TraDIS analyses

QuaTradis (Quadram TraDis) A set of tools to analyse the output from TraDIS analyses Contents Introduction Installation Required dependencies Bioconda

Quadram Institute Bioscience 2 Feb 16, 2022
Efficient matrix representations for working with tabular data

Efficient matrix representations for working with tabular data

QuantCo 70 Dec 14, 2022