TY - GEN
T1 - Fast and space efficient string kernels using suffix arrays
AU - Choon, Hui Teo
AU - Vishwanathan, S. V.N.
PY - 2006
Y1 - 2006
N2 - String kernels which compare the set of all common substrings between two given strings have recently been proposed by Vishwanathan & Smola (2004). Surprisingly, these kernels can be computed in linear time and linear space using annotated suffix trees. Even though, in theory, the suffix tree based algorithm requires O(n) space for an n length string, in practice at least 40n bytes are required - 20n bytes for storing the suffix tree, and an additional 20n bytes for the annotation. This large memory requirement coupled with poor locality of memory access, inherent due to the use of suffix trees, means that the performance of the suffix tree based algorithm deteriorates on large strings. In this paper, we describe a new linear time yet space efficient and scalable algorithm for computing string kernels, based on suffix arrays. Our algorithm is a) faster and easier to implement, b) on the average requires only 19n bytes of storage, and c) exhibits strong locality of memory access. We show that our algorithm can be extended to perform linear time prediction on a test string, and present experiments to validate our claims.
AB - String kernels which compare the set of all common substrings between two given strings have recently been proposed by Vishwanathan & Smola (2004). Surprisingly, these kernels can be computed in linear time and linear space using annotated suffix trees. Even though, in theory, the suffix tree based algorithm requires O(n) space for an n length string, in practice at least 40n bytes are required - 20n bytes for storing the suffix tree, and an additional 20n bytes for the annotation. This large memory requirement coupled with poor locality of memory access, inherent due to the use of suffix trees, means that the performance of the suffix tree based algorithm deteriorates on large strings. In this paper, we describe a new linear time yet space efficient and scalable algorithm for computing string kernels, based on suffix arrays. Our algorithm is a) faster and easier to implement, b) on the average requires only 19n bytes of storage, and c) exhibits strong locality of memory access. We show that our algorithm can be extended to perform linear time prediction on a test string, and present experiments to validate our claims.
UR - http://www.scopus.com/inward/record.url?scp=34250766728&partnerID=8YFLogxK
U2 - 10.1145/1143844.1143961
DO - 10.1145/1143844.1143961
M3 - Conference contribution
SN - 1595933832
SN - 9781595933836
T3 - ACM International Conference Proceeding Series
SP - 929
EP - 936
BT - ACM International Conference Proceeding Series - Proceedings of the 23rd International Conference on Machine Learning, ICML 2006
T2 - 23rd International Conference on Machine Learning, ICML 2006
Y2 - 25 June 2006 through 29 June 2006
ER -