Location of public benchmarking; primarily final results

Overview

CSL_public_benchmark

This repo is intended to provide a periodically-updated, public view into genome sequencing benchmarks managed by HudsonAlpha's Clinical Services Lab (CSL). The benchmarking results primarily provide the CSL a systematic approach to evalute various reference genome, aligner, and variant caller combinations against each other. All of the datasets we used for testing were generated at HudsonAlpha. The short-read PCR-free datasets were generated using standard clinical processes in the CSL and are currently private datasets. The long-read PacBio datasets were generated by the Genome Sequencing Center and are publicly hosted through the Genome in a Bottle consortium (see below).

The benchmarks or "truth sets" themselves are large-scale publicly available benchmarks created for a handful of reference samples. Most of the benchmarks we use were generated by the Genome in a Bottle (GIAB) Consortium.

Current status

This initial release just includes the final results files that are reviewed after the pipelines have completed.

What is the pipeline?

The benchmarking pipeline itself is maintained in a private repo. Briefly, it is a snakemake pipeline that built around a systematic final evaluation that mostly uses RTG vcfeval to measure sensitivity and precision. The primary "wildcards" in this evaluation are the reference, the aligner, and the variant caller; with versioning where appropriate. This allows us to quickly add new tools by defining new rules to run a particular tool (typically one per aligner or caller), and then evaluate in a standard way. In general, we try to use docker images or conda environments when these are already available to increase downstream portability; however, these are not always available.

As a result, many rules are tied to our cluster ecosystem, either through modules and/or file paths to installed software. Additionally, all the metadata (e.g. fastq pairs for a given sample) is tracked using an internal system. This means that this pipeline, even if publically available, would definitely not run "out-of-the-box" for anyone outside of HudsonAlpha. A very long-term goal would be to create a public version that can run out-of-the-box given user-provided metadata.

However, in the interest of transparency, we will be making efforts to clarify any questions about the implementation over time. This will largely be driven by questions we receive from the community (i.e. create issues if you have questions, so we can begin tackling this). Examples of things already on the TODO radar:

  • Rules for aligners and callers
  • Rules for evaluation
  • Description/links to specific reference files
Comments
  • truvari options

    truvari options

    Hi, Thanks for testing dysgu, I have a few questions about whats causing the low precision values of dysgu. Firstly, I just wanted to compare notes about how you test using truvari. I have tested with: grep '#\|PASS' HG002.dysgu.vcf | grep '#\|DEL' > HG002.dysgu_pass.del.vcf; bgzip HG002.dysgu_pass.del.vcf; tabix HG002.dysgu_pass.del.vcf; truvari bench -b HG002_SVs_Tier1_v0.6.vcf.gz --includebed HG002_SVs_Tier1_v0.6_include.bed --sizemax 260000000 --giabreport -c HG002.dysgu_pass.del.vcf.gz -o truvari_dysgu_pass_only_del --pctsim 0 -s 50 --passonly Also was interested to know what coverage and read length your samples are. Finally I noticed in the dysgu results the stdev of the precision values seemed very high, for example on the hg38 GIAB masked dragmap-1.2.1 precision score was 0.4321+-0.2749. Im not sure whats causing this, but possibly the insert size metrics are not being worked out properly? This should be available in the log file from dysgu. Thanks!

    opened by kcleal 10
  • Adds the tandem-repeat option to sniffles

    Adds the tandem-repeat option to sniffles

    • This adds the tandem-repeat option under temporary name "sniffles_tr-{version}"; this will eventually replace "sniffles-{version}"
    opened by holtjma 2
  • Deletion split

    Deletion split

    • Splits deletion benchmark into RESTRICTED (requiring a high-confidence BED file) and UNRESTRICTED (no BED files, it's everything in the benchmark VCF)
    • RESTRICTED now only contains HG002
    • UNRESTRICTED has both HG001 and HG002, but with caveats around precision
    • Sniffles v2.0.2 happened to drop while I was testing this, so I went ahead and added it (it ran quite fast)
    opened by holtjma 2
  • Benchmark metadata

    Benchmark metadata

    • Adds a benchmarks folder to describe where benchmark files came from
    • Small variants and CMRG both have a simple shell script for downloading the exact files used
    • Deletion files are originally in hg19, and a semi-manual process was used to liftover the files to hg38. The steps for this process are described in the README and the final VCF files are stored in this repo due to their relatively small size.
    opened by holtjma 1
  • References metadata

    References metadata

    Primarily adds reference metadata for the two primary reference files we are currently testing with:

    • hg38_asm5_alt - includes scripts for download, creating ALT contigs, and the final alt files
    • hg38_GIAB_masked - includes scripts for download and the dummy alt file
    opened by holtjma 1
  • Add more information regarding the references

    Add more information regarding the references

    We've had some questions about the reference origins. We should probably add some links to file and bash scripts where applicable with regards to reference acquisition.

    documentation 
    opened by holtjma 1
  • Release 20220225

    Release 20220225

    • Adds hg38_T2T_masked reference genome and metadata around generated corresponding reference files
    • Benchmark results adds hg38_T2T_masked results as well
    • On the backend, Truvari was updated to v3.1.0, this did not seem to have a significant impact on results
    opened by holtjma 0
  • adding results files

    adding results files

    • Added haplotyping results for caller cyrius
    • Updated versions of pbmm2 and pbsv; there are some changes associated with the results of these, so the previous version is maintained for this release
    opened by holtjma 0
Releases(2022-06-10)
  • 2022-06-10(Jun 10, 2022)

  • 2022-05-19(May 19, 2022)

    This release is primarily adding two new samples to the PCR-free datasets. The following updates occurred as a result of this change:

    • Adds two new samples to our PCR-free datasets corresponding to HG006 and HG007
    • PCR-free results all shifted slightly (the vast majority to slightly worse performance); we did not notice any drastic changes across the results; all average sensitivities, precisions, and F1-scores shifted <0.001
    • Updates our expected CYP2D6 outputs to include expectations for HG006 and HG007
    Source code(tar.gz)
    Source code(zip)
    results_20220519.pdf(143.74 KB)
    small_summary_20220519.csv(5.36 KB)
  • 2022-04-29(Apr 29, 2022)

  • 2022-04-08(Apr 8, 2022)

  • 2022-03-18(Mar 18, 2022)

  • 2022-02-25(Feb 25, 2022)

    Reference changes:

    • Adds the hg38_T2T_masked reference which is version 2 of the hg38_GIAB_masked reference. A brief description and direct download links are provided with the reference metadata.
    • The hg38_T2T_masked results tend to be very slightly better than the v1 results, so hg38_GIAB_masked will likely be retired in a future release.

    Method changes:

    Source code(tar.gz)
    Source code(zip)
    results_20220225.pdf(164.79 KB)
    small_summary_20220225.csv(8.97 KB)
  • 2022-02-18(Feb 18, 2022)

  • 2022-02-11(Feb 11, 2022)

    Software changes:

    • Added dnascope-1.0-202112.01-PO for PCR-free datasets, dnascope-0.5-202010.04-PO will be removed in future releases. Additionally, the pass-only filter (e.g. -PO) is recommended for DNAscope, so the unfiltered version has been remove from reporting. Thanks to @DonFreed for the recommendations!
    • Added dysgu-1.3.4-PO, which is a pass-only filtered version of dysgu-1.3.4, for PCR-free and PacBio datasets. Additionally, the pass-only filter (e.g. -PO) is recommended for dysgu, so the unfiltered version will be removed in future released. Thanks to @kcleal for the recommendations!

    Other changes:

    • Added a note in the README on release cadence. In order to reduce overhead, going forward we will limit formal releases to at most once a week. New or partial results may appear through the week with the intention to summarize any changes in the weekly release.
    Source code(tar.gz)
    Source code(zip)
    results_20220211.pdf(157.26 KB)
    small_summary_20220211.csv(8.80 KB)
  • 2022-02-09(Feb 9, 2022)

  • 2022-02-08(Feb 8, 2022)

    Method change:

    • Splits deletion benchmark into RESTRICTED (requiring a high-confidence BED file) and UNRESTRICTED (no BED files, it's everything in the benchmark VCF)
    • RESTRICTED now only contains HG002 with the Tier1 regions. This set is more fair when judging the precision of the aligner/caller pair.
    • UNRESTRICTED has both HG001 and HG002, but with caveats around precision. This set includes more total variants and two samples, but precision is less accurate.

    Software addition:

    • Sniffles v2.0.2 was added as a new caller
    Source code(tar.gz)
    Source code(zip)
    results_20220208.pdf(153.97 KB)
    small_summary_20220208.csv(8.83 KB)
  • 2022-02-07(Feb 7, 2022)

  • 2022-01-28(Jan 28, 2022)

    • Added the first haplotyping caller to our results with cyrius-1.1.1; note that this caller is designed to work on short-read datasets and the upstream tooling (both reference and aligner) can have significant impact on its performance
    • Updated versions of pbmm2 (1.4.0 -> 1.7.0) and pbsv (2.6.2 -> 2.8.0); there are some changes in performance between the previous versions so they are retained in this release; they will be removed in subsequent releases
    Source code(tar.gz)
    Source code(zip)
  • 2022-01-07(Jan 7, 2022)

    Two variant caller updates:

    • Clair3 was update to v0.1-r9: Our previous version was v0.1-r5, and it was running in a conda environment after some back-and-forth with the developers. They now have a docker image that is much easier to use, so we have switched to that for both the Illumina and PacBio tests.
    • PEPPER-Margin-DeepVariant was added a full caller on version r0.7: Previously, we were treating this process as a BAM modifier (basically for phasing) and ignoring any variant calling results. With this change, it now operates as a variant caller and the VCF is analyzed with the rest of the callers. We are using the developer-released docker image for our analysis.
    • We have removed old versions of both tools to avoid any confusion around the analysis implementation
    Source code(tar.gz)
    Source code(zip)
    results_20220107.pdf(145.82 KB)
  • 2021-12-17(Dec 17, 2021)

    Two main results changes:

    • Adds the dnascope-0.5-202010.04-PO variant caller: This is the same data as dnascope-0.5-202010.04 but with a PASS-only filter applied to the VCF file. The short-read DNAscope callers uses the FILTER field to annotate variants that are rejected by the model as likely false positives. This has significant impact on the results and is recommended as best practice by the developers. Thanks to @DonFreed for helping diagnose the issue!
    • Updates minimap2 aligner from v2.22 to v2.23: Overall, this had minimal impact in our benchmark. v2.22 will be retired from the benchmark next release.
    Source code(tar.gz)
    Source code(zip)
    results_20211217.pdf(145.62 KB)
  • 2021-12-13(Dec 13, 2021)

    Metadata changes:

    • Adds a references folder for tracking references that are used in the analysis
    • Adds the hg38_asm5_alt reference including links to the reference and a script demonstrating how the ALT contigs were remapped
    • Adds the hg38_GIAB_masked reference including links to the reference and a dummy ALT file used for the pipeline

    Results changes:

    • Added the SNAP-2.0.0 caller that was recently released, this was run with the -hc- option so GATK-based results are expected to not be as accurate
    Source code(tar.gz)
    Source code(zip)
    results_20211213.pdf(140.63 KB)
Owner
HudsonAlpha Institute for Biotechnology
HudsonAlpha Institute for Biotechnology
HSPyLib is a Python library that will elevate your experience to another level.

HomeSetup Python Library - HSPyLib Your mature python application HSPyLib is a Python library that will elevate your experience to another level. It r

Hugo Saporetti Junior 4 Dec 14, 2022
This tool don't used illegal ativity

ETHICALTOOL This tool for only educational purposes don't used illegal ativity @onlinehacking this tool for pkg update && pkg upgrade && pkg install g

Mrkarthick 4 Dec 23, 2021
i3wm helper tool for workspaces on multiple monitors

i3screens A helper tool for managing i3wm workspaces on multiple monitors. Use-case You have a multi-monitor setup and want to have the "same" workspa

Sebastian Neef 1 Dec 05, 2022
A responsive package for Buttons, DropMenus and Combinations

A responsive package for Buttons, DropMenus and Combinations, This module makes the process a lot easier !

Skr Phoenix YT 0 Jan 30, 2022
Dotfiles & list of programs

dotfiles & list of programs So I wanted to just backup my most used files. I have a bad habit, sometimes I get tired of a distro and do a wipe and sta

2 Sep 04, 2022
Lock a program and kills it indefinitely if it is started.

Kill By Lock Lock a program and kills it indefinitely if it is started. How start it? It' simple, you just have to double-click on the python file (.p

1 Jan 12, 2022
A Python Perforce package that doesn't bring in any other packages to work.

P4CMD 🌴 A Python Perforce package that doesn't bring in any other packages to work. Relies on p4cli installed on the system. p4cmd The p4cmd module h

Niels Vaes 13 Dec 19, 2022
51AC8 is a stack based golfing / esolang that I am trying to make.

51AC8 is a stack based golfing / esolang that I am trying to make.

7 May 22, 2022
Collection of system-wide scripts that I use on my Gentoo

linux-scripts Collection of scripts that I use on my Gentoo machine. I tend to put all scripts in /scripts directory. It is not likely that you would

Xoores 1 Jan 09, 2022
Return-Parity-MDP - Towards Return Parity in Markov Decision Processes

Towards Return Parity in Markov Decision Processes Code for the AISTATS 2022 pap

Jianfeng Chi 3 Nov 27, 2022
Audio2Face - a project that transforms audio to blendshape weights,and drives the digital human,xiaomei,in UE project

Audio2Face - a project that transforms audio to blendshape weights,and drives the digital human,xiaomei,in UE project

FACEGOOD 732 Jan 08, 2023
nbsafety adds a layer of protection to computational notebooks by solving the stale dependency problem when executing cells out-of-order

nbsafety adds a layer of protection to computational notebooks by solving the stale dependency problem when executing cells out-of-order

150 Jan 07, 2023
The fastest way to copy to (not from) high speed flash storage.

FastestCopy The fastest way to copy to (not from) high speed flash storage. This is about 3-6x faster than file copy on explorer.exe to usb flash driv

Derek Frombach 0 Nov 03, 2021
A simple string parser based on CLR to check whether a string is acceptable or not for a given grammar.

A simple string parser based on CLR to check whether a string is acceptable or not for a given grammar.

Bharath M Kulkarni 1 Dec 15, 2021
Nimbus - Open Source Cloud Computing Software - 100% Apache2 licensed

⚠️ The Nimbus infrastructure project is no longer under development. ⚠️ For more information, please read the news announcement. If you are interested

Nimbus 194 Jun 30, 2022
bamboo-engine 是一个通用的流程引擎,他可以解析,执行,调度由用户创建的流程任务,并提供了如暂停,撤销,跳过,强制失败,重试和重入等等灵活的控制能力和并行、子流程等进阶特性,并可通过水平扩展来进一步提升任务的并发处理能力。

bamboo-engine 是一个通用的流程引擎,他可以解析,执行,调度由用户创建的流程任务,并提供了如暂停,撤销,跳过,强制失败,重试和重入等等灵活的控制能力和并行、子流程等进阶特性,并可通过水平扩展来进一步提升任务的并发处理能力。 整体设计 Quick start 1. 安装依赖 2. 项目初始

腾讯蓝鲸 96 Dec 15, 2022
Script para generar automatización de registro de formularios IEEH

Formularios_IEEH Script para generar automatización de registro de formularios IEEH Corresponde a un conjunto de script en python que permiten la auto

vhevia11 1 Jan 06, 2022
Feature engineering library that helps you keep track of feature dependencies, documentation and schema

Feature engineering library that helps you keep track of feature dependencies, documentation and schema

28 May 31, 2022
A simple and usefull python calculator.

simplepy-calculator Your simple and fresh calculator. Getting Started Install python3 from the oficial python website or via terminal. Clone this repo

Felix Sanchez 1 Jan 18, 2022
Waydroid is a container-based approach to boot a full Android system on a regular GNU/Linux system like Ubuntu.

Waydroid is a container-based approach to boot a full Android system on a regular GNU/Linux system like Ubuntu.

WayDroid 4.7k Jan 08, 2023