TY - JOUR
T1 - smartsnp, an r package for fast multivariate analyses of big genomic data
AU - Herrando-Pérez, Salvador
AU - Tobler, Raymond
AU - Huber, Christian D.
N1 - Publisher Copyright:
© 2021 British Ecological Society
PY - 2021/11
Y1 - 2021/11
N2 - Principal component analysis (PCA) is a powerful tool for the analysis of population structure, a genetic property that is essential to understand the evolutionary processes driving biological diversification and (pre)historical colonizations, migrations and extinctions. In the current era of high-throughput sequencing technologies, population structure can be quantified from scores of genetic markers across hundreds to thousands of genomes. However, these big genomic datasets pose substantial computing and analytical challenges. We present the r package smartsnp for fast and user-friendly computation of PCA on single-nucleotide polymorphism (SNP) data. Inspired by the current field-standard software EIGENSOFT, smartsnp includes appropriate SNP scaling for genetic drift and allows projection of ancient samples onto a modern genetic space while also providing permutation-based multivariate tests for population differences in genetic diversity (both location and dispersion). Our extensive benchmarks show that smartsnp's PCA is 2–4 times faster than EIGENSOFT's SMARTPCA algorithm across a wide range of sample and SNP sizes. All four smartsnp functions (smart_pca, smart_permanova, smart_permdisp and smart_mva) process datasets with up to 100 samples and 1 million simulated SNPs in less than 30 s and accurately recreate previously published SMARTPCA of ancient-human and wolf genotypes. The package smartsnp provides fast and robust multivariate ordination and hypothesis testing for big genomic data that is also suitable for ancient and low-coverage modern DNA. The simple implementation should appeal to biological conservation, evolutionary, ecological and (palaeo)genomic researchers, and be useful for phenotype, ancestry and lineage studies.
AB - Principal component analysis (PCA) is a powerful tool for the analysis of population structure, a genetic property that is essential to understand the evolutionary processes driving biological diversification and (pre)historical colonizations, migrations and extinctions. In the current era of high-throughput sequencing technologies, population structure can be quantified from scores of genetic markers across hundreds to thousands of genomes. However, these big genomic datasets pose substantial computing and analytical challenges. We present the r package smartsnp for fast and user-friendly computation of PCA on single-nucleotide polymorphism (SNP) data. Inspired by the current field-standard software EIGENSOFT, smartsnp includes appropriate SNP scaling for genetic drift and allows projection of ancient samples onto a modern genetic space while also providing permutation-based multivariate tests for population differences in genetic diversity (both location and dispersion). Our extensive benchmarks show that smartsnp's PCA is 2–4 times faster than EIGENSOFT's SMARTPCA algorithm across a wide range of sample and SNP sizes. All four smartsnp functions (smart_pca, smart_permanova, smart_permdisp and smart_mva) process datasets with up to 100 samples and 1 million simulated SNPs in less than 30 s and accurately recreate previously published SMARTPCA of ancient-human and wolf genotypes. The package smartsnp provides fast and robust multivariate ordination and hypothesis testing for big genomic data that is also suitable for ancient and low-coverage modern DNA. The simple implementation should appeal to biological conservation, evolutionary, ecological and (palaeo)genomic researchers, and be useful for phenotype, ancestry and lineage studies.
KW - SMARTPCA
KW - ancient DNA
KW - genetic drift
KW - population structure
KW - single nucleotide polymorphism
UR - http://www.scopus.com/inward/record.url?scp=85112025885&partnerID=8YFLogxK
U2 - 10.1111/2041-210X.13684
DO - 10.1111/2041-210X.13684
M3 - Article
SN - 2041-210X
VL - 12
SP - 2084
EP - 2093
JO - Methods in Ecology and Evolution
JF - Methods in Ecology and Evolution
IS - 11
ER -