Krzeminski, Dominik, Balinsky, Helen ORCID: https://orcid.org/0000-0002-8151-4462 and Balinsky, Alexander ORCID: https://orcid.org/0000-0002-8151-4462
2018.
Helmholtz principle on word embeddings for automatic document segmentation.
Presented at: DocEng 18: 18th ACM Symposium on Document Engineering,
Halifax, Nova Scotia, Canada,
28-31 August 2018.
DocEng '18: Proceedings of the ACM Symposium on Document Engineering 2018.
New York, NY:
ACM,
10.1145/3209280.3229103
Item availability restricted. |
PDF
- Accepted Post-Print Version
Restricted to Repository staff only Download (1MB) |
Abstract
Automatic document segmentation gets more and more attention in the natural language processing field. The problem is defined as text division into lexically coherent fragments. In fact, most of realistic documents are not homogeneous, so extracting underlying structure might increase performance of various algorithms in problems like topic recognition, document summarization, or document categorization. At the same time recent advances in word embedding procedures accelerated development of various text mining methods. Models such as word2vec, or GloVe allow for efficient learning a representation of large textual datasets and thus introduce more robust measures of word similarities. This study proposes a new document segmentation algorithm combining the idea of embedding-based measure of relation between words with Helmholtz Principle for text mining. We compare two of the most common word embedding models and show improvement of our approach on a benchmark dataset.
Item Type: | Conference or Workshop Item (Paper) |
---|---|
Date Type: | Publication |
Status: | Published |
Schools: | Mathematics |
Publisher: | ACM |
ISBN: | 978-1-4503-5769-2 |
Date of First Compliant Deposit: | 15 June 2018 |
Last Modified: | 23 Oct 2022 14:00 |
URI: | https://orca.cardiff.ac.uk/id/eprint/112497 |
Actions (repository staff only)
Edit Item |