We are not able to resolve this OAI Identifier to the repository landing page. If you are the repository manager for this record, please head to the Dashboard and adjust the settings.
Background: Sequencing of environmental DNA (often called metagenomics) has shown tremendous potential to
uncover the vast number of unknown microbes that cannot be cultured and sequenced by traditional methods.
Because the output from metagenomic sequencing is a large set of reads of unknown origin, clustering reads
together that were sequenced from the same species is a crucial analysis step. Many effective approaches to this
task rely on sequenced genomes in public databases, but these genomes are a highly biased sample that is not
necessarily representative of environments interesting to many metagenomics projects.
Results: We present SCIMM (Sequence Clustering with Interpolated Markov Models), an unsupervised sequence
clustering method. SCIMM achieves greater clustering accuracy than previous unsupervised approaches. We
examine the limitations of unsupervised learning on complex datasets, and suggest a hybrid of SCIMM and
supervised learning method Phymm called PHYSCIMM that performs better when evolutionarily close training
genomes are available.
Conclusions: SCIMM and PHYSCIMM are highly accurate methods to cluster metagenomic sequences. SCIMM
operates entirely unsupervised, making it ideal for environments containing mostly novel microbes. PHYSCIMM
uses supervised learning to improve clustering in environments containing microbial strains from well-characterized
genera. SCIMM and PHYSCIMM are available open source from http://www.cbcb.umd.edu/software/scimm.https://doi.org/10.1186/1471-2105-11-54
Is data on this page outdated, violates copyrights or anything else? Report the problem now and we will take corresponding actions after reviewing your request.