Abstract
In this paper, we study the problem of automatically segmenting written text into paragraphs. This is inherently a sequence labeling problem, however, previous approaches ignore this dependency. We propose a novel approach for automatic paragraph segmentation, namely training Semi-Markov models discriminatively using a Max-Margin method. This method allows us to model the sequential nature of the problem and to incorporate features of a whole paragraph, such as paragraph coherence which cannot be used in previous models. Experimental evaluation on four text corpora shows improvement over the previous state-of-the art method on this task.
Original language | English |
---|---|
Pages | 640-648 |
Number of pages | 9 |
Publication status | Published - 2007 |
Event | 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, EMNLP-CoNLL 2007 - Prague, Czech Republic Duration: 28 Jun 2007 → 28 Jun 2007 |
Conference
Conference | 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, EMNLP-CoNLL 2007 |
---|---|
Country/Territory | Czech Republic |
City | Prague |
Period | 28/06/07 → 28/06/07 |