Ling, Yurong;
(2022)
High-dimensional non-Gaussian data analysis based on sample relationship.
Doctoral thesis (Ph.D), University College London.
Preview |
Text
Thesis.pdf - Other Download (9MB) | Preview |
Abstract
High-dimensional data are omnipresent. Although many statistical methods developed for analysing high-dimensional data adopt the normality assumption, the Gaussian distribution could be a poor approximation of real data in many applications. In this thesis, we investigate how to properly analyse such high-dimensional non-Gaussian data. As quantifying sample relationships, such as measuring the inter-sample proximity and determining neighbours for samples, is an important step in numerous statistical approaches, this thesis develops three methods for analysing different high-dimensional non-Gaussian data types based on the sample relationship: dimension reduction for single cell RNA-sequencing data with missingness with a proposed proximity measure, dimension reduction for data of small counts with a developed proximity measure, and modelling skewed survival data with a proposed procedure of identifying neighbours for samples. In chapter 3, I develop an unbiased estimator of the Gram matrix, which characterises the proximity between samples. The proposed estimator improves a broad spectrum of dimension reduction methods when applied to single cell RNA-sequencing data with missingness. In addition, the consequences of directly applying existing dimension reduction methods to data with missingness are empirically and theoretically clarified. In chapter 4, I develop a dissimilarity measure for count data with an excess of zeros based on the Kullback-Leibler divergence and the empirical Bayes estimators. The proposed measure is shown to have better discriminative power compared with other popular measures. The proposed measure boosts the performance of standard dimension reduction methods on count data containing many zeros. In chapter 5, I clarify that graphs derived from features themselves can be beneficial for the analysis of high-dimensional survival data when used in graph convolutional networks. Besides, a sequential forward floating selection algorithm is proposed to simultaneously perform survival analysis and unveil the local neighbourhoods of samples with the aid of graph convolutional networks.
Type: | Thesis (Doctoral) |
---|---|
Qualification: | Ph.D |
Title: | High-dimensional non-Gaussian data analysis based on sample relationship |
Open access status: | An open access version is available from UCL Discovery |
Language: | English |
Additional information: | Copyright © The Author 2022. Original content in this thesis is licensed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) Licence (https://creativecommons.org/licenses/by-nc/4.0/). Any third-party copyright material present remains the property of its respective owner(s) and is licensed under its existing terms. Access may initially be restricted at the author’s request. |
UCL classification: | UCL > Provost and Vice Provost Offices > UCL BEAMS > Faculty of Maths and Physical Sciences UCL > Provost and Vice Provost Offices > UCL BEAMS > Faculty of Maths and Physical Sciences > Dept of Statistical Science UCL > Provost and Vice Provost Offices > UCL BEAMS UCL |
URI: | https://discovery-pp.ucl.ac.uk/id/eprint/10149318 |
Archive Staff Only
View Item |