Species trees from consensus single nucleotide polymorphism (SNP) data: Testing phylogenetic approaches with simulated and empirical data

Alexander N. Schmidt-Lebuhn*, Nicola C. Aitken, Aaron Chuah

*Corresponding author for this work

    Research output: Contribution to journalArticlepeer-review

    12 Citations (Scopus)

    Abstract

    Datasets of hundreds or thousands of SNPs (Single Nucleotide Polymorphisms) from multiple individuals per species are increasingly used to study population structure, species delimitation and shallow phylogenetics. The principal software tool to infer species or population trees from SNP data is currently the BEAST template SNAPP which uses a Bayesian coalescent analysis. However, it is computationally extremely demanding and tolerates only small amounts of missing data. We used simulated and empirical SNPs from plants (Australian Craspedia, Asteraceae, and Pelargonium, Geraniaceae) to compare species trees produced (1) by SNAPP, (2) using SVD quartets, and (3) using Bayesian and parsimony analysis with several different approaches to summarising data from multiple samples into one set of traits per species. Our aims were to explore the impact of tree topology and missing data on the results, and to test which data summarising and analyses approaches would best approximate the results obtained from SNAPP for empirical data. SVD quartets retrieved the correct topology from simulated data, as did SNAPP except in the case of a very unbalanced phylogeny. Both methods failed to retrieve the correct topology when large amounts of data were missing. Bayesian analysis of species level summary data scoring the two alleles of each SNP as independent characters and parsimony analysis of data scoring each SNP as one character produced trees with branch length distributions closest to the true trees on which SNPs were simulated. For empirical data, Bayesian inference and Dollo parsimony analysis of data scored allele-wise produced phylogenies most congruent with the results of SNAPP. In the case of study groups divergent enough for missing data to be phylogenetically informative (because of additional mutations preventing amplification of genomic fragments or bioinformatic establishment of homology), scoring of SNP data as a presence/absence matrix irrespective of allele content might be an additional option. As this depends on sampling across species being reasonably even and a random distribution of non-informative instances of missing data, however, further exploration of this approach is needed. Properly chosen data summary approaches to inferring species trees from SNP data may represent a potential alternative to currently available individual-level coalescent analyses especially for quick data exploration and when dealing with computationally demanding or patchy datasets.

    Original languageEnglish
    Pages (from-to)192-201
    Number of pages10
    JournalMolecular Phylogenetics and Evolution
    Volume116
    DOIs
    Publication statusPublished - Nov 2017

    Fingerprint

    Dive into the research topics of 'Species trees from consensus single nucleotide polymorphism (SNP) data: Testing phylogenetic approaches with simulated and empirical data'. Together they form a unique fingerprint.

    Cite this