IIBR Informatics: Advancing Bioinformatics Methods using Ensembles of Profile Hidden Markov Models

Participants:

Tandy Warnow (PI)
Jian Peng (Co-PI)
Baqiao Liu (PhD student)
Minhyuk Park (PhD student)
Chengze Shen (PhD student)
Paul Zaharias (postdoc)

Funding: U.S. National Science Foundation grant DBI-2006069 (ABI Innovation), $500,000.

Project Overview: Profile Hidden Markov Models (i.e., profile HMMs) are probabilistic graphical models that are in wide use in bioinformatics. Research over the last decade has shown that ensembles of profile HMMs (e-HMMs) can provide greater accuracy than a single profile HMM for many applications in bioinformatics, including phylogenetic placement, multiple sequence alignment, and taxonomic identification of metagenomic reads. Although these improvements have been substantial, the design of these e-HMMs has been fairly ad hoc, and their use can be computationally intensive, which reduces their appeal in practice. This project will advance the use of e-HMMs by developing statistically rigorous techniques for building e-HMMs with the goal of improving accuracy and improving understanding of e-HMMs, and will also develop methods that use e-HMMs for protein structure and function prediction. Broader impacts include software schools, engagement with under-represented groups, and open-source software. Project software and papers are available at http://tandy.cs.illinois.edu/eHMMproject.html.

Journal publications supported by this grant:

M. Park, S. Ivanovic, G. Chu, C. Shen, and T. Warnow (2023). UPP2: Fast and Accurate Alignment Estimation of Datasets with Fragmentary Sequences. Bioinformatics, Volume 39, Issue 1, January 2023, btad007, (HTML) DOI: 10.1093/bioinformatics/btad007
M. Park and T. Warnow (2023). HMMerge: an Ensemble Method for Improving Multiple Sequence Alignment. Bioinformatics Advances, special issue for ISCB-LA 2022. (HTML). DOI: 10.1093/bioadv/vbad052
B. Liu and T. Warnow (2023). WITCH-NG: Efficient and Accurate Alignment of Datasets with Sequence Length Heterogeneity. Bioinformatics Advances, special issue for ISCB-LA 2022 (HTML) (HTML) DOI: 10.1093/bioadv/vbad024
P. Zaharias and T. Warnow (2023). Recent Progress on Methods for Estimating and Updating Large Phylogenies. The Philosophical Transactions of the Royal Society, Vol. 377, Issue 1861 (Discussion meeting issue for "Genomic population structures of microbial pathogens"). DOI: 10.1098/rstb.2021.0244.
P. Zaharias, V. Smirnov, and T. Warnow (2022). Large-Scale Multiple Sequence Alignment and the Maximum Weight Trace Alignment Merging Problem. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2022. (HTML) DOI: 10.1109/TCBB.2022.3191848
C. Shen, P. Zaharias, and T. Warnow (2022). MAGUS+eHMMs: Improved Multiple Sequence Alignment Accuracy for Fragmentary Sequences. Bioinformatics, Vol. 38, Issue 4, Pages 918-924, 2022 (HTML) DOI: 10.1093/bioinformatics/btab788
C. Shen, M. Park, and T. Warnow (2022). WITCH: Improved Multiple Sequence Alignment through Weighted Consensus HMM alignment. J. Computational Biology, special issue for Mike Waterman. (HTML) DOI: 10.1089/cmb.2021.0585

Preprints supported by this grant

C. Shen, B. Liu, and T. Warnow (2022). SALMA: Scalable ALignment using MAFFT-Add. bioRxiv 2022.05.23.493139; (HTML)
C. Shen, B. Liu, and T. Warnow (2022). EMMA: A New Method for Computing Multiple Sequence Alignments given a Constraint Subset Alignment. (HTML) This paper was in WABI 2022, and is under review in a journal.

Project Software:

WITCH, improved alignment estimation compared to MAGUS and PASTA, when working with sequences with substantial sequence length heterogeneity. WITCH is available in open source form on github at https://github.com/c5shen/WITCH.
UPP2 is an improvement on UPP with respect to speed: rather than performing all-against-all comparisons between query sequences and HMMs in the ensemble, it performs a logarithmic number of the comparisons. The result is comparable accuracy but much faster alignments. UPP2 is available in open source form on github at https://github.com/gillichu/sepp.

Symposia and Software Schools: The grant will provide symposia and software schools to train researchers (from students through faculty) in new methods.

Presentations: See http://tandy.cs.illinois.edu/talks.html for the full list of talks.

Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.