Statistical considerations underpinning an alignment-free sequence comparison method

Junmei Jing, Conrad J. Burden, Sylvain Forêt, Susan R. Wilson*

*Corresponding author for this work

    Research output: Contribution to journalArticlepeer-review

    2 Citations (Scopus)

    Abstract

    The D2 statistic is defined as the number of word matches of prespecified length k, with up to t mismatches, shared between two given sequences. This statistic finds its application in alignment-free comparisons of biological sequences. It has two main advantages over alignment-based methods for nucleotide and amino-acid sequence comparisons, such as BLAST (basic local alignment search tool). These are (i) D2 does not assume that homologous segments are contiguous, and (ii) the algorithm is computationally extremely fast, the runtime being proportional to the size of the sequences in the case of exact matches. This review article summarises results to date on determining the distributional properties of the D2 statistic for a range of biologically relevant parameters, describes existing applications of the method, and outlines future research directions.

    Original languageEnglish
    Pages (from-to)325-335
    Number of pages11
    JournalJournal of the Korean Statistical Society
    Volume39
    Issue number3
    DOIs
    Publication statusPublished - Sept 2010

    Fingerprint

    Dive into the research topics of 'Statistical considerations underpinning an alignment-free sequence comparison method'. Together they form a unique fingerprint.

    Cite this