TY - JOUR
T1 - A segmented topic model based on the two-parameter Poisson-Dirichlet process
AU - Du, Lan
AU - Buntine, Wray
AU - Jin, Huidong
PY - 2010/10
Y1 - 2010/10
N2 - Documents come naturally with structure: a section contains paragraphs which itself contains sentences; a blog page contains a sequence of comments and links to related blogs. Structure, of course, implies something about shared topics. In this paper we take the simplest form of structure, a document consisting of multiple segments, as the basis for a new form of topic model. To make this computationally feasible, and to allow the form of collapsed Gibbs sampling that has worked well to date with topic models, we use the marginalized posterior of a two-parameter Poisson-Dirichlet process (or Pitman-Yor process) to handle the hierarchical modelling. Experiments using either paragraphs or sentences as segments show the method significantly outperforms standard topic models on either whole document or segment, and previous segmented models, based on the held-out perplexity measure.
AB - Documents come naturally with structure: a section contains paragraphs which itself contains sentences; a blog page contains a sequence of comments and links to related blogs. Structure, of course, implies something about shared topics. In this paper we take the simplest form of structure, a document consisting of multiple segments, as the basis for a new form of topic model. To make this computationally feasible, and to allow the form of collapsed Gibbs sampling that has worked well to date with topic models, we use the marginalized posterior of a two-parameter Poisson-Dirichlet process (or Pitman-Yor process) to handle the hierarchical modelling. Experiments using either paragraphs or sentences as segments show the method significantly outperforms standard topic models on either whole document or segment, and previous segmented models, based on the held-out perplexity measure.
KW - Document structure
KW - Latent Dirichlet allocation
KW - Segmented topic model
KW - Two-parameter Poisson-Dirichlet process
UR - http://www.scopus.com/inward/record.url?scp=77955656991&partnerID=8YFLogxK
U2 - 10.1007/s10994-010-5197-4
DO - 10.1007/s10994-010-5197-4
M3 - Article
SN - 0885-6125
VL - 81
SP - 5
EP - 19
JO - Machine Learning
JF - Machine Learning
IS - 1
ER -