Cardiff University | Prifysgol Caerdydd ORCA
Online Research @ Cardiff 
WelshClear Cookie - decide language by browser settings

Helmholtz principle on word embeddings for automatic document segmentation

Krzeminski, Dominik, Balinsky, Helen ORCID: and Balinsky, Alexander ORCID: 2018. Helmholtz principle on word embeddings for automatic document segmentation. Presented at: DocEng 18: 18th ACM Symposium on Document Engineering, Halifax, Nova Scotia, Canada, 28-31 August 2018. DocEng '18: Proceedings of the ACM Symposium on Document Engineering 2018. New York, NY: ACM, 10.1145/3209280.3229103
Item availability restricted.

[thumbnail of helmholtz-principle-word.pdf] PDF - Accepted Post-Print Version
Restricted to Repository staff only

Download (1MB)


Automatic document segmentation gets more and more attention in the natural language processing field. The problem is defined as text division into lexically coherent fragments. In fact, most of realistic documents are not homogeneous, so extracting underlying structure might increase performance of various algorithms in problems like topic recognition, document summarization, or document categorization. At the same time recent advances in word embedding procedures accelerated development of various text mining methods. Models such as word2vec, or GloVe allow for efficient learning a representation of large textual datasets and thus introduce more robust measures of word similarities. This study proposes a new document segmentation algorithm combining the idea of embedding-based measure of relation between words with Helmholtz Principle for text mining. We compare two of the most common word embedding models and show improvement of our approach on a benchmark dataset.

Item Type: Conference or Workshop Item (Paper)
Date Type: Publication
Status: Published
Schools: Mathematics
Publisher: ACM
ISBN: 978-1-4503-5769-2
Date of First Compliant Deposit: 15 June 2018
Last Modified: 23 Oct 2022 14:00

Actions (repository staff only)

Edit Item Edit Item


Downloads per month over past year

View more statistics