IIBR Informatics: Advancing Bioinformatics Methods using Ensembles of Profile Hidden Markov Models
Participants:
- Tandy Warnow (PI)
- Baqiao Liu (PhD student)
- Minhyuk Park (PhD student)
- Chengze Shen (PhD student)
- Eleanor Wedell (PhD student)
- Paul Zaharias (postdoc)
Funding: U.S. National Science Foundation grant DBI-2006069
(ABI Innovation), $500,000.
Project Overview:
Profile Hidden Markov Models (i.e., profile HMMs) are probabilistic graphical models that are in wide use in bioinformatics. Research over the last decade has shown that ensembles of profile HMMs (e-HMMs) can provide greater accuracy than a single profile HMM for many applications in bioinformatics, including phylogenetic placement, multiple sequence alignment, and taxonomic identification of metagenomic reads. Although these improvements have been substantial, the design of these e-HMMs has been fairly ad hoc, and their use can be computationally intensive, which reduces their appeal in practice. This project advances the use of e-HMMs by developing statistically rigorous techniques for building e-HMMs with the goal of improving accuracy and improving understanding of e-HMMs, and develops methods
that use e-HMMs in different bioinformatics problems. Broader impacts include software schools, engagement with under-represented groups, and open-source software.
Journal publications supported by this grant:
-
C. Shen, B. Liu, K.P. Williams, and T. Warnow (2023). EMMA: a new method for computing multiple sequence alignments given a constraint subset alignment. Algorithms for Molecular Biology, 18(1), p.21.
(HTML)
DOI: 10.1186/s13015-023-00247-x
-
M. Park, S. Ivanovic, G. Chu, C. Shen, and T. Warnow (2023).
UPP2: Fast and Accurate Alignment Estimation of Datasets with Fragmentary Sequences. Bioinformatics, Volume 39, Issue 1, January 2023, btad007, (HTML)
DOI: 10.1093/bioinformatics/btad007
-
M. Park and T. Warnow (2023).
HMMerge: an Ensemble Method for Improving Multiple Sequence Alignment.
Bioinformatics Advances, special issue for ISCB-LA 2022.
(HTML).
DOI: 10.1093/bioadv/vbad052
-
B. Liu and T. Warnow (2023).
WITCH-NG: Efficient and Accurate Alignment of Datasets with Sequence Length Heterogeneity.
Bioinformatics Advances, special issue for ISCB-LA 2022
(HTML)
(HTML)
DOI: 10.1093/bioadv/vbad024
-
P. Zaharias and T. Warnow (2023).
Recent Progress on Methods for Estimating and Updating Large Phylogenies. The Philosophical Transactions of the Royal Society, Vol. 377, Issue 1861 (Discussion meeting issue for "Genomic population structures of microbial pathogens").
(HTML)
DOI: 10.1098/rstb.2021.0244.
-
P. Zaharias, V. Smirnov, and T. Warnow (2022).
Large-Scale Multiple Sequence Alignment and the
Maximum Weight Trace Alignment Merging Problem.
IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2022.
(HTML)
DOI: 10.1109/TCBB.2022.3191848
- C. Shen, P. Zaharias, and T. Warnow (2022). MAGUS+eHMMs: Improved Multiple Sequence Alignment Accuracy for Fragmentary Sequences. Bioinformatics, Vol. 38, Issue 4,
Pages 918-924, 2022
(HTML)
DOI: 10.1093/bioinformatics/btab788
- C. Shen, M. Park, and T. Warnow (2022). WITCH: Improved Multiple Sequence Alignment through Weighted Consensus HMM alignment. J. Computational Biology, special issue for Mike Waterman.
(HTML)
DOI: 10.1089/cmb.2021.0585
-
E. Wedell, Y. Cai, and T. Warnow (2022). SCAMPP: scaling alignment-based phylogenetic placement to large trees. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 20(2), pp.1417-1430.
(PDF)
DOI: 10.1109/TCBB.2022.3170386
Preprints supported by this grant (not otherwise published)
-
E. Wedell, C. Shen, and T. Warnow (2022).
BATCH-SCAMPP: Scaling phylogenetic placement methods to place many sequences.
(HTML)
DOI: 10.1101/2022.10.26.513936
Project Software:
-
SCAMPP and BSCAMPP, available on github.
These are codes for phylogenetic placement based on maximum likelihood, and use
divide-and-conquer to enable
pplacer (a very accurate method) to be used on large trees.
SCAMPP is designed to maximize accuracy but BSCAMPP is designed
to improve running time compared to SCAMPP.
These methods were designed by Eleanor Wedell and Tandy Warnow,
and Chengze Shen contributed to the development of BSCAMPP.
-
EMMA, available at https://github.com/c5shen/EMMA, is
code for multiple sequence alignment, and in particular for adding sequences into
an existing alignment.
EMMA was designed and implemented by Chengze Shen, with support
from Baqiao Liu, Kelly Williams, and Tandy Warnow.
-
WITCH-ng,
available at https://github.com/RuneBlaze/WITCH-NG, is code for multiple sequence alignment,
addressing especially the case of aligning datasets that have a large degree of sequence
length heterogeneity.
- WITCH, improved alignment estimation compared
to MAGUS and PASTA, when working with
sequences with substantial sequence length heterogeneity.
WITCH was developed by Chengze Shen and Tandy Warnow, and is available in open source form on github at
https://github.com/c5shen/WITCH.
- UPP2 is an improvement on UPP with respect to speed: rather than
performing all-against-all comparisons between query sequences
and HMMs in the ensemble, it performs a logarithmic number of
the comparisons. The result is comparable accuracy but much faster
alignments.
UPP2 was developed by Minhyuk Park, and is available in open source form on github at
https://github.com/gillichu/sepp.
Symposia and Software Schools:
The grant will provide symposia and software schools to train researchers
(from students through faculty) in new methods.
We will hold a Phylogenomics Software School as part of
the Joint Congrees
on Evolutionary Biology in Athens Georgia, on June 20, 2025.
Presentations:
See http://tandy.cs.illinois.edu/talks.html for the full list of talks.
Course materials
My CS 581: Algorithmic Genomic Biology
course covers this material, as well as related material.
The lectures are available for download, mostly in PDF format.
Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.