Asymptotic behavior of k-word matches between two uniformly distributed sequences

M. R. Kantorovitz*, H. S. Booth, C. J. Burden, S. R. Wilson

*Corresponding author for this work

    Research output: Contribution to journalArticlepeer-review

    18 Citations (Scopus)

    Abstract

    Given two sequences of length n over a finite alphabet A of size \A\ = d, the D2 statistic is the number of k-letter word matches between the two sequences. This statistic is used in bioinformatics for EST sequence database searches. Under the assumption of independent and identically distributed letters in the sequences, Lippert, Huang and Waterman (2002) raised questions about the asymptotic behavior of D2 when the alphabet is uniformly distributed. They expressed a concern that the commonly assumed normality may create errors in estimating significance. In this paper we answer those questions. Using Stein's method, we show that, for large enough k, the D2 statistic is approximately normal as n gets large. When k = 1, we prove that, for large enough d, the D2 statistic is approximately normal as n gets large. We also give a formula for the variance of D2 in the uniform case.

    Original languageEnglish
    Pages (from-to)788-805
    Number of pages18
    JournalJournal of Applied Probability
    Volume44
    Issue number3
    DOIs
    Publication statusPublished - Sept 2007

    Fingerprint

    Dive into the research topics of 'Asymptotic behavior of k-word matches between two uniformly distributed sequences'. Together they form a unique fingerprint.

    Cite this