Cardiff University | Prifysgol Caerdydd ORCA
Online Research @ Cardiff 
WelshClear Cookie - decide language by browser settings

Helmholtz principle on word embeddings for automatic document segmentation

Krzeminski, Dominik, Balinsky, Helen and Balinsky, Alexander 2018. Helmholtz principle on word embeddings for automatic document segmentation. Presented at: DocEng 18: 18th ACM Symposium on Document Engineering, Halifax, Nova Scotia, Canada, 28-31 August 2018. ACM,
Item availability restricted.

[img] PDF - Accepted Post-Print Version
Restricted to Repository staff only

Download (1MB)

Abstract

Automatic document segmentation gets more and more attention in the natural language processing field. The problem is defined as text division into lexically coherent fragments. In fact, most of realistic documents are not homogeneous, so extracting underlying structure might increase performance of various algorithms in problems like topic recognition, document summarization, or document categorization. At the same time recent advances in word embedding procedures accelerated development of various text mining methods. Models such as word2vec, or GloVe allow for efficient learning a representation of large textual datasets and thus introduce more robust measures of word similarities. This study proposes a new document segmentation algorithm combining the idea of embedding-based measure of relation between words with Helmholtz Principle for text mining. We compare two of the most common word embedding models and show improvement of our approach on a benchmark dataset.

Item Type: Conference or Workshop Item (Paper)
Date Type: Publication
Status: In Press
Schools: Mathematics
Publisher: ACM
ISBN: 978-1-4503-5769-2
Date of First Compliant Deposit: 15 June 2018
Last Modified: 07 Aug 2018 15:05
URI: http://orca.cf.ac.uk/id/eprint/112497

Actions (repository staff only)

Edit Item Edit Item

Downloads

Downloads per month over past year

View more statistics