Approximate word matches between two random sequences

Conrad J. Burden, Miriam R. Kantorovitz, Susan R. Wilson

    Research output: Contribution to journalArticlepeer-review

    18 Citations (Scopus)

    Abstract

    Given two sequences over a finite alphabet L, the D2 statistic is the number of m-letter word matches between the two sequences. This statistic is used in bioinformatics for expressed sequence tag database searches. Here we study a generalization of the D2 statistic in the context of DNA sequences, under the assumption of strand symmetric Bernoulli text. Fork < m, we look at the count of m-letter word matches with up to k mismatches. For this statistic, we compute the expectation, give upper and lower bounds for the variance and prove its distribution is asymptotically normal.

    Original languageEnglish
    Pages (from-to)1-21
    Number of pages21
    JournalAnnals of Applied Probability
    Volume18
    Issue number1
    DOIs
    Publication statusPublished - Feb 2008

    Fingerprint

    Dive into the research topics of 'Approximate word matches between two random sequences'. Together they form a unique fingerprint.

    Cite this