当前位置：网站首页>A new method for evaluating the quality of metagenome assembly - magista

A new method for evaluating the quality of metagenome assembly - magista

2022-04-23 04:11:00 【Gu he Niubo】

Gu He health

Insert picture description here

Despite the diversity of microbial groups on earth , but Only a small part Got it Culture and effective naming . Because most bacteria cannot... Under very specific conditions Culture, isolation and identification .

In the past decade , Metagenomic research The importance of has been highlighted , Because it can evaluate bacterial gene bank and find out what current laboratory culture technology can't master New bacterial genome . These data are for expand Our understanding of microbial diversity on earth is crucial .

Because the metagenome sequencing data is from Multiple species and bacteria Plant DNA The sequence consists of fragments , There are usually thousands from different Life field , Therefore, the main challenge of such analysis is to correctly identify each DNA Sequence fragment The real source . Unfortunately , These steps It's easy to make mistakes , Therefore, the results must be strictly reviewed , To avoid publishing incomplete and low-quality genomes .

lately , Belgian researchers have newly developed MAGISTA, This is an assessment of metagenomes A new method of assembly quality , Estimation based on random forest method MAGs Integrity and contamination of , Solves current problems based on Some defects of the method of reference gene are often ignored .

MAGISTA It's based on the metagenome bins Inside Of contig fragment Non aligned distance distribution between , Instead of a set of reference genes . This method utilizes data from the whole bin Information about . In order to properly evaluate this method , And explain the disadvantages of reference based tools , lately , Belgian researchers have built a highly complex DNA Simulated community , from 227 It consists of three bacterial strains , And have different degrees of similarity .

Insert picture description here

Fang Law

Training set to （HC227） since 227 A bacterial strain , The test data set consists of five publicly available Short reading （short reads） Subsets make up , There are four relatively low complexity genomes DNA Simulated community reads. See the figure below for details .

Complexity List the number of indicator bacteria ;Assembly tool The list shows the software used for assembly ;Binning method The list shows the tools used to divide boxes ;Binning parameters The list shows the used for Evaluate the quality of crating Indicators of ,comp by Completeness ,cov by coverage .

Insert picture description here

MAGISTA computational procedure ：

Input binning After each bins

-●-

The first 1 Step ： Select the appropriate fragment size and distance calculation method

-●-

First, put each bin Each of the contig Split into fixed length segments , And then use Four different methods （ namely PaSiT4、MMZ3、MMZ4 and Freq4） Calculate a bin All distances between clips in . For each method , All selected a specific fragment length , In order to produce different characteristic distributions for different organisms .

Of each method Final fragment length Your choice is through Different methods of analysis and integration Decisive , The method is shown in the figure below . At least two genomes in each group's design come from the same family , The two genomes come from the same order but from different families . These genomes are artificially divided into fragments of the required length , And calculate the target feature for each segment .

For each group of five genomes , All fragments were mixed and principal component analysis was performed according to their characteristics （PCA）, Then proceed Quadratic discrimination analysis , Used to generate classifiers , The aim is to distinguish the two most overlapping genomes in each group . Of the classifier The accuracy is averaged , The results are used to select the final combination of method and fragment length .

Insert picture description here

-●-

The first 2 Step ： Selection of characteristic variables in the model

-●-

After selecting the fragment length for each method , Use average 、 Standard deviation 、 skewness 、 Kurtosis and median and 2.5%、5%、10%、90%、95% and 97.5% The percentile calculates the distance distribution . Besides , And calculated 1 kb Fragment GC Content distribution . And each bin Size , total 66 Characteristic variables .

-●-

The first 3 Step ： model building

-●-

Use R (v 4.0.3) package “RandomForest” Medium “RandomForest” Function and default parameters to train the random forest model . Use at the same time R package lm Then a linear model is established to perform linear regression , Input the characteristic variable value after logarithmic conversion , For cross validation analysis .

Lord want junction fruit

A highly complex genome DNA Simulated community

From 227 The genome of a bacterial strain DNA form , These strains belong to 8 A door (Actinobacteria, Bacteroidetes,Deinococcus-Thermus, Firmicutes,Fusobacteria,Planctomycetes, Proteobacteria and Verrucomicrobia),18 class ,47 Objective ,85 Families, ,175 Belong to ,197 Kind of .

Insert picture description here

edit

The above figure shows the genome size and size of bacterial strains in the simulated community GC content （ from 26.3% To 73.4%） Scatter plot ;

Insert picture description here

edit

The figure shows the relationship between species in the training set and the test set . The red line indicates the species that exist in the training set , The gray line indicates the bacteria in the training set . Different colors in the ring chart represent different classification levels . In the legend, there are bacteria in the training set * Mark , The phylum present in archaea is marked with dark gray bands .

CheckM Based on single copy markers (SCMG) To assess the bin Defects in quality

Insert picture description here

chart a and b They are from CheckM Integrity index and pollution degree of output in . Use R^2y∼x（ Explain the percentage of variance ）,RMSE（ Root mean square error relative to actual value ） Two parameter evaluation results . The result shows CheckM Overestimate bin The quality of the . Many contaminated bins Predicted to be close to uncontaminated .

Use MAGISTA Analyze the bins

Insert picture description here

First, the optimal fragment size is selected to calculate the distance distribution , As shown in the figure above , Considering 1、5、10、20、30、40、50、75 and 100 kb Fragments of , Finally, the fragment size shown in bold is selected .

Insert picture description here

The graph is concont、MetaBAT and MaxBin Produced bins Integrity and contamination information .

Because generating such a data set through simulation can not accurately represent the real results , So we used binning Software Result , Provides a set of Different quality real bins. The integrity and uncontaminated degree of the training data set are in 90% above .

Insert picture description here

Finally, the model construction , establish Prediction model of integrity and pollution degree . The model is evaluated , As shown in the figure . Respectively for CheckM、MAGISTA and MAGISTIC Its performance is tested .CheckM It is now a mainstream evaluation bin Quality tools .MAGISTIC It's a combination of CheckM and MAGISTA Tools for . Use scores that explain variance （R2y∼x） And root mean square error （RMSE） As an indicator of performance . Prediction of integrity ,MAGISTA be better than CheckM. Prediction of pollution degree ,MAGISTA Better than CheckM.

junction On

Researchers have developed a new method for predicting highly complex metagenome assembly genomes bin The quality of the method ,MAGISTA. Is based on SCMG An equally good alternative to the low complexity metagenomic approach . except MAGISTA outside , Also by combining CheckM Result , Use MAGISTIC Generated a more accurate prediction .

The researchers pointed out in the article MAGISTA and CheckM Are not accurate enough to be considered reliable .MAGISTIC Produced a better than MAGISTA Better results .

In additional analysis , Divide the test set into two subsets , From reality and simulation reads From bins, Analyze this again , The result shows ,CheckM about “ real ” Subsets perform well （ But compared with MAGISTA and MAGISTIC Still poor ）, about “ simulation ” Subsets perform poorly . and MAGISTIC comparison MAGISTA It would be more accurate . however The article does not elaborate MAGISTIC workflow .

See the author in github Software description disclosed on , The address is as follows . However, the contents of the output file are not explained and given . Personally, I think it may not be very mature .

reference ：

Goussarov G, Claesen J, Mysara M, Cleenwerck I, Leys N, Vandamme P, Van Houdt R. Accurate prediction of metagenome-assembled genome completeness by MAGISTA, a random forest model built on alignment-free intra-bin statistics. Environ Microbiome. 2022 Mar 5;17(1):9. doi: 10.1186/s40793-022-00403-7. PMID: 35248155; PMCID: PMC8898458.

版权声明
本文为[Gu he Niubo]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/04/202204220600369475.html

当前位置：网站首页>A new method for evaluating the quality of metagenome assembly - magista

A new method for evaluating the quality of metagenome assembly - magista

边栏推荐

猜你喜欢

随机推荐