TY - JOUR
T1 - Sequential latent Dirichlet allocation
AU - Du, Lan
AU - Buntine, Wray
AU - Jin, Huidong
AU - Chen, Changyou
PY - 2012/6
Y1 - 2012/6
N2 - Understanding how topics within a document evolve over the structure of the document is an interesting and potentially important problem in exploratory and predictive text analytics. In this article, we address this problem by presenting a novel variant of latent Dirichlet allocation (LDA): Sequential LDA (SeqLDA). This variant directly considers the underlying sequential structure, i. e. a document consists of multiple segments (e. g. chapters, paragraphs), each of which is correlated to its antecedent and subsequent segments. Such progressive sequential dependency is captured by using the hierarchical two-parameter Poisson-Dirichlet process (HPDP). We develop an efficient collapsed Gibbs sampling algorithm to sample from the posterior of the SeqLDA based on the HPDP. Our experimental results on patent documents show that by considering the sequential structure within a document, our SeqLDA model has a higher fidelity over LDA in terms of perplexity (a standard measure of dictionary-based compressibility). The SeqLDA model also yields a nicer sequential topic structure than LDA, as we show in experiments on several books such as Melville's 'Moby Dick'.
AB - Understanding how topics within a document evolve over the structure of the document is an interesting and potentially important problem in exploratory and predictive text analytics. In this article, we address this problem by presenting a novel variant of latent Dirichlet allocation (LDA): Sequential LDA (SeqLDA). This variant directly considers the underlying sequential structure, i. e. a document consists of multiple segments (e. g. chapters, paragraphs), each of which is correlated to its antecedent and subsequent segments. Such progressive sequential dependency is captured by using the hierarchical two-parameter Poisson-Dirichlet process (HPDP). We develop an efficient collapsed Gibbs sampling algorithm to sample from the posterior of the SeqLDA based on the HPDP. Our experimental results on patent documents show that by considering the sequential structure within a document, our SeqLDA model has a higher fidelity over LDA in terms of perplexity (a standard measure of dictionary-based compressibility). The SeqLDA model also yields a nicer sequential topic structure than LDA, as we show in experiments on several books such as Melville's 'Moby Dick'.
KW - Collapsed Gibbs sampler
KW - Document structure
KW - Latent Dirichlet allocation
KW - Poisson-Dirichlet process
KW - Topic model
UR - http://www.scopus.com/inward/record.url?scp=84860918264&partnerID=8YFLogxK
U2 - 10.1007/s10115-011-0425-1
DO - 10.1007/s10115-011-0425-1
M3 - Article
SN - 0219-1377
VL - 31
SP - 475
EP - 503
JO - Knowledge and Information Systems
JF - Knowledge and Information Systems
IS - 3
ER -