A major challenge in biological sciences is the reconstruction of the Tree of Life. To this effect, large genomic databases like GenBank and SwissProt are being mined for clusters from which phylogenies can be inferred. Systematists and comparative biologists commonly combine such phylogenies into informative supertrees that reveal information which was not explicitly displayed in any of the original phylogenies. However, whether a supertree is informative depends on particular overlap properties among the clusters from which it originates. In this work we formally introduce the concept of groves - sets of clusters with the potential to construct informative supertrees. Thus maximal potential candidate clusters for informative supertree construction can be identified in large databases through groves, prior to inferring trees for each cluster. Groves also have the potential to lead to informative supermatrix construction. We developed methods that (i) efficiently identify particular types of groves and (ii) find lower and upper bounds on the minimal number of groves needed to cover all the trees or data sets in a database. Finally, we apply our methods to the green plant sequences from GenBank.
ASJC Scopus subject areas
- Discrete Mathematics and Combinatorics