Cardiff University | Prifysgol Caerdydd ORCA
Online Research @ Cardiff 
WelshClear Cookie - decide language by browser settings

Dimension reduction for exponential family data with applications to text data

Smallman, Luke 2019. Dimension reduction for exponential family data with applications to text data. PhD Thesis, Cardiff University.
Item availability restricted.

[img]
Preview
PDF - Accepted Post-Print Version
Download (2MB) | Preview
[img] PDF (Cardiff University Electronic Publication Form) - Supplemental Material
Restricted to Repository staff only

Download (1MB)

Abstract

In this manuscript, we will address the problem of dimension reduction for data modelled by an exponential family distribution, with a particular focus on text data modelled by a Poisson-count model. We are motivated to develop new methods for such data by links between principal component analysis and the Gaussian log-likelihood, which suggests both a simple way to extend PCA to the exponential family (of which the Gaussian distribution is a member), and the unsuitability of PCA when the data is appropriately modelled by a distribution which is not well-approximated by the Gaussian distribution. We will present three novel methods for exponential family dimension reduction. The first is “Poisson Inverse Regression”, a supervised method from the family of inverse regression methods. We will demonstrate that this method provides a sufficient dimension reduction. That is, the transformed data is statistically sufficient with respect to the response. The second is Sparse Generalised Principal Component Analysis, which extends the method of Generalised Principal Component Analysis put forward by Landgraf and Lee (2015b). This method is unsupervised, as is motivated by a modification of the PCA objective function to accommodate other exponential family distributions. We demonstrate that this method performs as-well or better than other state-of-the-art methods. This work has been published as Smallman, Artemiou, et al. (2018). The third is Sparse Simple Exponential/Poisson Principal Component Analysis. This method extends Simple Exponential Principal Component Analysis, put forward by Li and Tao (2013), enforcing sparsity in the equivalent of the loadings matrix. This method is also unsupervised, and we demonstrate its state-of-the-art performance. This work was done jointly with William Underwood from Oxford University, and is published in Smallman, Underwood, et al. (2019). Finally, we present a new framework for analysing and synthesising dimension reduction methods, which we call “Quasi-Likelihood PCA”. This is based on tensor stimating equations, which we also present as a new development. We apply this method to analyse several methods in the literature.

Item Type: Thesis (PhD)
Date Type: Completion
Status: Unpublished
Schools: Mathematics
Subjects: Q Science > QA Mathematics
Funders: Cardiff and Vale University Health Board
Date of First Compliant Deposit: 16 March 2020
Last Modified: 16 Mar 2020 10:44
URI: http://orca.cf.ac.uk/id/eprint/130420

Actions (repository staff only)

Edit Item Edit Item

Downloads

Downloads per month over past year

View more statistics