New methods for multiple sequence alignment with improved accuracy and scalability


Funding: U.S. National Science Foundation grant 1458652 (ABI Innovation).

Project Overview: Multiple sequence alignment (MSA) is one of the most basic bioinformatics steps, in which a set of molecular sequences (i.e., DNA, RNA, or amino acid sequences) are arranged inside a matrix to identify corresponding positions. MSA calculation is a fundamental first step in many biological analyses. Because of its broad applicability and importance, many MSA methods have been developed and are in wide use today. Unfortunately, many real world biological datasets have features (large size and fragmentary sequences, for example) that make accurate MSA calculation very difficult. Because poorly estimated alignments result in errors in downstream biological analyses, new MSA techniques are needed that can produce accurate alignments on difficult datasets. This project will develop MSA methods with greatly improved accuracy, and that can analyze the large and heterogeneous sequence datasets being assembled in different biology projects nationally. The project also has a substantial outreach component to women's colleges and minority serving institutions, and summer software schools to train biologists in the use of the project software.

Multiple sequence alignment (MSA) and phylogeny estimation are two very basic bioinformatics problems, which sit at the intersection of machine learning, statistical estimation, and evolutionary and structural biology. MSA has particular importance in constructing evolutionary trees, understanding the function and structure of proteins, detecting interactions between proteins, and even genome assembly. Large-scale MSA and phylogeny estimation also require high performance computing and parallel algorithms, in order to provide adequate scalability. The team will develop new machine learning techniques to greatly improve MSA methods, and hence also phylogeny estimation, since it depends on accurate multiple sequence alignments. The core of this project is algorithm development, utilizing a variety of machine learning techniques (including Hidden Markov Models), statistical estimation methods (especially Bayesian MCMC and maximum likelihood), and novel algorithmic strategies, all focused on improving scalability and accuracy.


Project Software:

Summer Symposia and Software Schools: The grant will provide summer symposia and software schools to train researchers (from students through faculty) in new multiple sequence alignment methods, and other topics within phylogenomics.

Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.