Cardiff University | Prifysgol Caerdydd ORCA
Online Research @ Cardiff 
WelshClear Cookie - decide language by browser settings

Semantically enhanced document clustering

Stankov, Ivan 2013. Semantically enhanced document clustering. PhD Thesis, Cardiff University.
Item availability restricted.

PDF - Accepted Post-Print Version
Download (3MB) | Preview
[img] PDF - Additional Metadata
Restricted to Repository staff only

Download (765kB)


This thesis advocates the view that traditional document clustering could be significantly improved by representing documents at different levels of abstraction at which the similarity between documents is considered. The improvement is with regard to the alignment of the clustering solutions to human judgement. The proposed methodology employs semantics with which the conceptual similarity be-tween documents is measured. The goal is to design algorithms which implement the meth-odology, in order to solve the following research problems: (i) how to obtain multiple deter-ministic clustering solutions; (ii) how to produce coherent large-scale clustering solutions across domains, regardless of the number of clusters; (iii) how to obtain clustering solutions which align well with human judgement; and (iv) how to produce specific clustering solu-tions from the perspective of the user’s understanding for the domain of interest. The developed clustering methodology enhances separation between and improved coher-ence within clusters generated across several domains by using levels of abstraction. The methodology employs a semantically enhanced text stemmer, which is developed for the pur-pose of producing coherent clustering, and a concept index that provides generic document representation and reduced dimensionality of document representation. These characteristics of the methodology enable addressing the limitations of traditional text document clustering by employing computationally expensive similarity measures such as Earth Mover’s Distance (EMD), which theoretically aligns the clustering solutions closer to human judgement. A threshold for similarity between documents that employs many-to-many similarity matching is proposed and experimentally proven to benefit the traditional clustering algorithms in pro-ducing clustering solutions aligned closer to human judgement. 4 The experimental validation demonstrates the scalability of the semantically enhanced document clustering methodology and supports the contributions: (i) multiple deterministic clustering solutions and different viewpoints to a document collection are obtained; (ii) the use of concept indexing as a document representation technique in the domain of document clustering is beneficial for producing coherent clusters across domains; (ii) SETS algorithm provides an improved text normalisation by using external knowledge; (iv) a method for measuring similarity between documents on a large scale by using many-to-many matching; (v) a semantically enhanced methodology that employs levels of abstraction that correspond to a user’s background, understanding and motivation. The achieved results will benefit the research community working in the area of document management, information retrieval, data mining and knowledge management.

Item Type: Thesis (PhD)
Status: Unpublished
Schools: Engineering
Subjects: T Technology > TA Engineering (General). Civil engineering (General)
Uncontrolled Keywords: document clustering, semantics, human judgement, concept indexing, many-to-many matching
Date of First Compliant Deposit: 30 March 2016
Last Modified: 17 Jun 2017 07:50

Actions (repository staff only)

Edit Item Edit Item


Downloads per month over past year

View more statistics