Cardiff University | Prifysgol Caerdydd ORCA
Online Research @ Cardiff 
WelshClear Cookie - decide language by browser settings

Application of improved automated text mining to transcriptome datasets

Leong, Hui Sun 2009. Application of improved automated text mining to transcriptome datasets. PhD Thesis, Cardiff University.

[thumbnail of U570958.pdf] PDF - Accepted Post-Print Version
Download (23MB)

Abstract

A major challenge in microarray data analysis is the functional interpretation of gene lists. A common approach to address this is over-representation analysis (ORA), which uses the hypergeometric test (or its variants) to evaluate whether a particular functionally-defined group of genes is represented more than expected by chance within a gene list. Existing applications of ORA have been largely limited to controlled vocabularies such as Gene Ontology (GO) terms and KEGG pathways. Therefore, this work aims at determining whether ORA can be applied to a wider mining of free-text. Initial explorations using the classical hypergeometric distribution to analyse tokens from PubMed abstracts revealed a hitherto unexpected feature: gene lists derived from typical microarray experiment tend to have more annotation (PubMed abstracts) associated with them than would be expected by chance. This bias, a result of patterns of research activity within the biomedical community, is a major problem for the classical hypergeometric test-based ORA approach, as it cannot account for such bias. The negative effect of annotation bias is a marked over-representation of many common (and likely uninformative) terms, interspersed with terms that appear to convey real biological insight. Several solutions have been developed to address this issue. The first is based on the use of a permutation test, but this nonparametric approach is hampered by being computationally intensive. Two computationally tractable approaches were subsequently developed, which are based on the detection of outliers and the extended hypergeometric distribution. The performances of the proposed text-based ORA approaches were demonstrated on a wide range of published datasets covering different species. A comparison with existing tools that use GO terms suggests that mining PubMed abstracts can reveal additional biological insight that may not be possible by mining pre-defined ontologies alone.

Item Type: Thesis (PhD)
Status: Unpublished
Schools: Medicine
Subjects: R Medicine > R Medicine (General)
Date of First Compliant Deposit: 30 March 2016
Last Modified: 19 Mar 2016 23:32
URI: https://orca.cardiff.ac.uk/id/eprint/55528

Actions (repository staff only)

Edit Item Edit Item

Downloads

Downloads per month over past year

View more statistics