Zhu, Wilbur;
(2021)
Regularized Risk Prediction Models in Subject/Patient Analytics in a Time to Event Setting.
Doctoral thesis (Ph.D), UCL (University College London).
Preview |
Text
Zhu_10136088_thesis_chaps_4-6_redacted.pdf Download (754kB) | Preview |
Abstract
This thesis comprises of five investigations and focuses on the use of risk prediction modelling from a computational statistics and machine learning perspective, with applications in subject (e.g. gym user, patient) analytics in a time to event setting. The work was conducted in collaboration with eGym and UCL Hospitals (UCLH). A variety of computational statistics (e.g. logistic lasso) and machine learning based risk prediction methods are applied ranging from kernel methods, ensemble methods and decision trees from both a classification and survival perspective. The thesis is concerned with modelling gym user behaviour and predicting treatment times and types. The underlying goal of this thesis is to develop generalizable and useful models to predict gym user behaviour and patient treatment times. This is what leads us to our methodological work in chapter 6. This thesis conducts the following investigations. 1. Weibull full likelihood implementation The first investigation involves conducting an implementation of a Weibull full likelihood survival model in R. The aim of this investigation is to build the Weibull distribution proportional hazards model, which is formulated via the log likelihood. Then we apply this model to simulated data to see whether the model can reveal the real pattern of the data. The results prove that from the synthetic data the model we build in R can unearth the parameters and the coefficients from which we generate the data. 2. Predicting gym user behaviour through churn and visits The second investigation consisting of two sub-investigations considers the use of time to event models to predict gym user behaviour and churn. The data set has been provided by the Gym Equipment manufacturer eGym. The first sub-investigation considers if it is possible, we can predict whether or not a user will churn, using a range of methods across computational statistics and machine learning, from logistic regression to survival random forests. Our findings indicate that with demographics alone we are unable to produce machine learning models that outperform a baseline learner. This tells us that we are unable to predict right at the beginning, whether or not a user will churn. However, when we apply machine learning based survival models including elastic net Cox and Cox Boosting, we are able to outperform the baseline. This sub-investigation serves as an introduction to considering gym user churn in a time to event setting through both classification and survival models. In the second sub-investigation, we then apply risk prediction modelling in predicting gym user visits via a moving window model, we find we are marginally able to outperform the majority vote baseline in some settings. 3 3. Predicting patient treatment times and treatment types for patient rehabilitative care The third investigation, also consisting of two sub-investigations, concerns the use of time to event modelling to predict patient treatment times and treatment types for patient rehabilitative care. The underlying goal is to help design treatment plans aimed at helping patients return to work by predicting the required combination of treatment time and treatment types required for each patient. The data has been provided by UCLH. All patients in the data set have been eventually discharged from the treatment programme. The aim of the first sub-investigation is to predict how much treatment time the patients required before they were discharged and which patients are more likely to take longer. We model this problem using regression and survival analysis, methods used range from generalized additive models to Cox boosting. Our results show that, using demographic variables we are able to outperform the baseline. In the second sub-investigation, we utilise risk prediction models, such as logistic regression and Adaboosting to predict treatment types based on demographics. We are able to outperform the baseline for some treatments in a deterministic setting but not in a probabilistic setting. 4. Regularization problems in gym user/patient setting As alluded, in both our application settings our model performances are mixed. Our aim therefore is to investigate how we can potentially improve our model performance and usefulness. This is what motivates our methodological studies: improving our model performance via hyper-parameter tuning based on the relevant loss function. We begin our investigation by using F1, Brier score and net benefit as the scoring functions for parameter tuning to build LASSO models. We then run the models on the gym user data and hospital data and compare the performance outputs from modelling. We find we are able to outperform the conventional LASSO models in terms of F1, Brier score and net benefit when using them as tuning functions, respectively. The different LASSO models provide different variable selections and insights. Then we use the integrated Brier score to turn the parameters of Cox proportional hazards LASSO models in a survival setting. Compared with the conventional performance measure - Concordance index, the integrated Brier score reflects better the error measure overall time. We find that by tuning parameters for the integrated Brier score we are able to obtain better integrated Brier score performance and different variable selections. We also examine whether the integrated Brier score is not only useful for improving survival performance at all times but at specific times too. We apply the Cox proportional hazards LASSO models with integrated Brier score and Concordance index as the scoring functions to the gym user and hospital data sets. The results show the models can better perform on the corresponding loss functions but the integrated Brier score LASSO model doesn’t guarantee better performance at a specific time. Finally, we extend our methodology to more modern machine learning methods such as support vector machines. We use F1 score, Brier score and net benefit as scoring functions to turn the parameters and C 4 of SVMs and run the models on the gym user data and hospital data. The results show they only slightly outperform the conventional model and are specifically poor in the deterministic setting due to the data imbalance. Contributions to Science This thesis makes the following contributions to science. 1. Applies logistic regression, linear discriminant analysis, support vector machines and random forests to predict the gym user attendance and churn. 2. Introduces the idea of comparing gym user prediction models to a majority vote baseline. 3. Introduces moving window prediction models for gym user visit prediction. 4. Discovers the relationship between patient demographics and rehabilitative care treatment times. 5. Introduces machine learning and computational statistics to predict patient treatment times and types for neurological rehabilitation patients. 6. Introduces the use of the F1, Brier score and in particular the net benefit LASSO models to a gym user churn prediction and a treatment type prediction. 7. Introduces the use of the integrated Brier score for tuning Cox LASSO models. 8. Extends the idea of parameter turning via the F1, Brier score and net benefit to modern machine learning methods.
Type: | Thesis (Doctoral) |
---|---|
Qualification: | Ph.D |
Title: | Regularized Risk Prediction Models in Subject/Patient Analytics in a Time to Event Setting |
Event: | UCL(University College London) |
Open access status: | An open access version is available from UCL Discovery |
Language: | English |
Additional information: | Copyright © The Author 2021. Original content in this thesis is licensed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) Licence (https://creativecommons.org/licenses/by-nc/4.0/). Any third-party copyright material present remains the property of its respective owner(s) and is licensed under its existing terms. Access may initially be restricted at the author’s request. |
UCL classification: | UCL UCL > Provost and Vice Provost Offices > UCL BEAMS UCL > Provost and Vice Provost Offices > UCL BEAMS > Faculty of the Built Environment UCL > Provost and Vice Provost Offices > UCL BEAMS > Faculty of the Built Environment > Bartlett School Env, Energy and Resources |
URI: | https://discovery-pp.ucl.ac.uk/id/eprint/10136088 |
Archive Staff Only
![]() |
View Item |