Fast and space efficient string kernels using suffix arrays

Hui Teo Choon*, S. V.N. Vishwanathan

*Corresponding author for this work

    Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

    14 Citations (Scopus)

    Abstract

    String kernels which compare the set of all common substrings between two given strings have recently been proposed by Vishwanathan & Smola (2004). Surprisingly, these kernels can be computed in linear time and linear space using annotated suffix trees. Even though, in theory, the suffix tree based algorithm requires O(n) space for an n length string, in practice at least 40n bytes are required - 20n bytes for storing the suffix tree, and an additional 20n bytes for the annotation. This large memory requirement coupled with poor locality of memory access, inherent due to the use of suffix trees, means that the performance of the suffix tree based algorithm deteriorates on large strings. In this paper, we describe a new linear time yet space efficient and scalable algorithm for computing string kernels, based on suffix arrays. Our algorithm is a) faster and easier to implement, b) on the average requires only 19n bytes of storage, and c) exhibits strong locality of memory access. We show that our algorithm can be extended to perform linear time prediction on a test string, and present experiments to validate our claims.

    Original languageEnglish
    Title of host publicationACM International Conference Proceeding Series - Proceedings of the 23rd International Conference on Machine Learning, ICML 2006
    Pages929-936
    Number of pages8
    DOIs
    Publication statusPublished - 2006
    Event23rd International Conference on Machine Learning, ICML 2006 - Pittsburgh, PA, United States
    Duration: 25 Jun 200629 Jun 2006

    Publication series

    NameACM International Conference Proceeding Series
    Volume148

    Conference

    Conference23rd International Conference on Machine Learning, ICML 2006
    Country/TerritoryUnited States
    CityPittsburgh, PA
    Period25/06/0629/06/06

    Fingerprint

    Dive into the research topics of 'Fast and space efficient string kernels using suffix arrays'. Together they form a unique fingerprint.

    Cite this