Abstract
Given two sequences over a finite alphabet L, the D2 statistic is the number of m-letter word matches between the two sequences. This statistic is used in bioinformatics for expressed sequence tag database searches. Here we study a generalization of the D2 statistic in the context of DNA sequences, under the assumption of strand symmetric Bernoulli text. Fork < m, we look at the count of m-letter word matches with up to k mismatches. For this statistic, we compute the expectation, give upper and lower bounds for the variance and prove its distribution is asymptotically normal.
| Original language | English |
|---|---|
| Pages (from-to) | 1-21 |
| Number of pages | 21 |
| Journal | Annals of Applied Probability |
| Volume | 18 |
| Issue number | 1 |
| DOIs | |
| Publication status | Published - Feb 2008 |
Fingerprint
Dive into the research topics of 'Approximate word matches between two random sequences'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver