30
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      The performance of coalescent-based species tree estimation methods under models of missing data

      research-article
      1 , 2 , 3 , 3 ,
      BMC Genomics
      BioMed Central
      RECOMB-CG - 2017 : The Fifteenth RECOMB Comparative Genomics Satellite Conference (RECOMB-CG 2017)
      04-06 October 2017
      Species tree, Multi-species coalescent, Missing data, Incomplete lineage sorting, ASTRAL, ASTRID, MP-EST, SVDquartets

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Background

          Estimation of species trees from multiple genes is complicated by processes such as incomplete lineage sorting, gene duplication and loss, and horizontal gene transfer, that result in gene trees that differ from each other and from the species phylogeny. Methods to estimate species trees in the presence of gene tree discord due to incomplete lineage sorting have been developed and proved to be statistically consistent when gene tree discord is due only to incomplete lineage sorting and every gene tree includes the full set of species.

          Results

          We establish statistical consistency of certain coalescent-based species tree estimation methods under some models of taxon deletion from genes. We also evaluate the impact of missing data on four species tree estimation methods (ASTRAL-II, ASTRID, MP-EST, and SVDquartets) using simulated datasets with varying levels of incomplete lineage sorting, gene tree estimation error, and degrees/patterns of missing data.

          Conclusions

          All the species tree estimation methods improved in accuracy as the number of genes increased and often produced highly accurate species trees even when the amount of missing data was large. These results together indicate that accurate species tree estimation is possible under a variety of conditions, even when there are substantial amounts of missing data.

          Electronic supplementary material

          The online version of this article (10.1186/s12864-018-4619-8) contains supplementary material, which is available to authorized users.

          Related collections

          Most cited references6

          • Record: found
          • Abstract: not found
          • Article: not found

          Comparison of phylogenetic trees

            Bookmark
            • Record: found
            • Abstract: found
            • Article: not found

            Bayes estimation of species divergence times and ancestral population sizes using DNA sequences from multiple loci.

            The effective population sizes of ancestral as well as modern species are important parameters in models of population genetics and human evolution. The commonly used method for estimating ancestral population sizes, based on counting mismatches between the species tree and the inferred gene trees, is highly biased as it ignores uncertainties in gene tree reconstruction. In this article, we develop a Bayes method for simultaneous estimation of the species divergence times and current and ancestral population sizes. The method uses DNA sequence data from multiple loci and extracts information about conflicts among gene tree topologies and coalescent times to estimate ancestral population sizes. The topology of the species tree is assumed known. A Markov chain Monte Carlo algorithm is implemented to integrate over uncertain gene trees and branch lengths (or coalescence times) at each locus as well as species divergence times. The method can handle any species tree and allows different numbers of sequences at different loci. We apply the method to published noncoding DNA sequences from the human and the great apes. There are strong correlations between posterior estimates of speciation times and ancestral population sizes. With the use of an informative prior for the human-chimpanzee divergence date, the population size of the common ancestor of the two species is estimated to be approximately 20,000, with a 95% credibility interval (8000, 40,000). Our estimates, however, are affected by model assumptions as well as data quality. We suggest that reliable estimates have yet to await more data and more realistic models.
              Bookmark
              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              A maximum pseudo-likelihood approach for estimating species trees under the coalescent model

              Background Several phylogenetic approaches have been developed to estimate species trees from collections of gene trees. However, maximum likelihood approaches for estimating species trees under the coalescent model are limited. Although the likelihood of a species tree under the multispecies coalescent model has already been derived by Rannala and Yang, it can be shown that the maximum likelihood estimate (MLE) of the species tree (topology, branch lengths, and population sizes) from gene trees under this formula does not exist. In this paper, we develop a pseudo-likelihood function of the species tree to obtain maximum pseudo-likelihood estimates (MPE) of species trees, with branch lengths of the species tree in coalescent units. Results We show that the MPE of the species tree is statistically consistent as the number M of genes goes to infinity. In addition, the probability that the MPE of the species tree matches the true species tree converges to 1 at rate O(M -1). The simulation results confirm that the maximum pseudo-likelihood approach is statistically consistent even when the species tree is in the anomaly zone. We applied our method, Maximum Pseudo-likelihood for Estimating Species Trees (MP-EST) to a mammal dataset. The four major clades found in the MP-EST tree are consistent with those in the Bayesian concatenation tree. The bootstrap supports for the species tree estimated by the MP-EST method are more reasonable than the posterior probability supports given by the Bayesian concatenation method in reflecting the level of uncertainty in gene trees and controversies over the relationship of four major groups of placental mammals. Conclusions MP-EST can consistently estimate the topology and branch lengths (in coalescent units) of the species tree. Although the pseudo-likelihood is derived from coalescent theory, and assumes no gene flow or horizontal gene transfer (HGT), the MP-EST method is robust to a small amount of HGT in the dataset. In addition, increasing the number of genes does not increase the computational time substantially. The MP-EST method is fast for analyzing datasets that involve a large number of genes but a moderate number of species.
                Bookmark

                Author and article information

                Contributors
                nute2@illinois.edu
                emolloy2@illinois.edu
                warnow@illinois.edu
                Conference
                BMC Genomics
                BMC Genomics
                BMC Genomics
                BioMed Central (London )
                1471-2164
                8 May 2018
                8 May 2018
                2018
                : 19
                Issue : Suppl 5 Issue sponsor : Publication of this supplement has not been supported by sponsorship. Information about the source of funding for publication charges can be found in the individual articles. The articles have undergone the journal's standard peer review process for supplements. The Supplement Editors declare that they have no competing interests.
                : 286
                Affiliations
                [1 ]ISNI 0000 0004 1936 9991, GRID grid.35403.31, Department of Statistics, University of Illinois at Urbana-Champaign, ; 725 S. Wright St., Champaign, IL, 61820 USA
                [2 ]ISNI 0000 0004 1936 9991, GRID grid.35403.31, Department of Mathematics, University of Illinois at Urbana-Champaign, ; 1409 W. Green St., Urbana, IL, 61801 USA
                [3 ]ISNI 0000 0004 1936 9991, GRID grid.35403.31, Department of Computer Science, University of Illinois at Urbana-Champaign, ; 201 North Goodwin Avenue, Urbana, IL, 61801 USA
                Article
                4619
                10.1186/s12864-018-4619-8
                5998899
                29745854
                6e2d1e9c-d92e-4eb5-a445-9e4521f40a73
                © The Author(s) 2018

                Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License ( http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver( http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

                RECOMB-CG - 2017 : The Fifteenth RECOMB Comparative Genomics Satellite Conference
                RECOMB-CG 2017
                Barcelona, Spain
                04-06 October 2017
                History
                Categories
                Research
                Custom metadata
                © The Author(s) 2018

                Genetics
                species tree,multi-species coalescent,missing data,incomplete lineage sorting,astral,astrid,mp-est,svdquartets

                Comments

                Comment on this article