To include or not to include: The impact of missing data on summary methods for species tree estimation

Erin Molloy
Department of Computer Science
University of Illinois at Urbana-Champaign

Due to the assumption that the inclusion of genes with substantial missing data will reduce accuracy, many phylogenomic studies restrict datasets to those genes that have data on all or nearly all of the species of interest. We perform a study comparing three statistically consistent coalescent-based species tree methods, ASTRAL, ASTRID, and MP-EST, in the context of missing data. Our study, on a collection of simulated and biological datasets, provides overwhelming evidence that including genes with missing data is either benign or beneficial. Further, our study shows that ASTRAL and ASTRID provide more accurate species trees than MP-EST when genes with missing data are included.