TY - GEN
T1 - The distribution of shortword match counts between markovian sequences
AU - Burden, Conrad J.
AU - Leopardi, Paul
AU - Forêt, Sylvain
PY - 2013
Y1 - 2013
N2 - The D2 statistic, which counts the number of word matches between two given sequences, has long been proposed as a measure of similarity for biological sequences. Much of the mathematically rigorous work carried out to date on the properties of the D2 statistic has been restricted to the case of 'Bernoulli' sequences composed of identically and independently distributed letters. Here the properties of the distribution of this statistic for the biologically more realistic case of Markovian sequences is studied. The approach is novel in that Markovian dependency is defined for sequences with periodic boundary conditions, and this enables exact analytic formulae for the mean and variance to be derived. The formulae are confirmed using numerical simulations, and asymptotic approximations to the full distribution are tested.
AB - The D2 statistic, which counts the number of word matches between two given sequences, has long been proposed as a measure of similarity for biological sequences. Much of the mathematically rigorous work carried out to date on the properties of the D2 statistic has been restricted to the case of 'Bernoulli' sequences composed of identically and independently distributed letters. Here the properties of the distribution of this statistic for the biologically more realistic case of Markovian sequences is studied. The approach is novel in that Markovian dependency is defined for sequences with periodic boundary conditions, and this enables exact analytic formulae for the mean and variance to be derived. The formulae are confirmed using numerical simulations, and asymptotic approximations to the full distribution are tested.
KW - Biological sequence comparison
KW - Word matches
UR - http://www.scopus.com/inward/record.url?scp=84877974186&partnerID=8YFLogxK
M3 - Conference contribution
SN - 9789898565358
T3 - BIOINFORMATICS 2013 - Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms
SP - 25
EP - 33
BT - BIOINFORMATICS 2013 - Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms
T2 - International Conference on Bioinformatics Models, Methods and Algorithms, BIOINFORMATICS 2013
Y2 - 11 February 2013 through 14 February 2013
ER -