UCL Discovery Stage
UCL home » Library Services » Electronic resources » UCL Discovery Stage

Why Does Rebalancing Class-unbalanced Data Improve AUC for Linear Discriminant Analysis?

Xue, J; Hall, P; (2015) Why Does Rebalancing Class-unbalanced Data Improve AUC for Linear Discriminant Analysis? IEEE Transactions on Pattern Analysis and Machine Intelligence , 37 (5) pp. 1109-1112. 10.1109/TPAMI.2014.2359660. Green open access

[thumbnail of 06906278.pdf]
Preview
Text
06906278.pdf

Download (217kB) | Preview

Abstract

Many established classifiers fail to identify the minority class when it is much smaller than the majority class. To tackle this problem, researchers often first rebalance the class sizes in the training dataset, through oversampling the minority class or undersampling the majority class, and then use the rebalanced data to train the classifiers. This leads to interesting empirical patterns. In particular, using the rebalanced training data can often improve the area under the receiver operating characteristic curve (AUC) for the original, unbalanced test data. The AUC is a widely-used quantitative measure of classification performance, but the property that it increases with rebalancing has, as yet, no theoretical explanation. In this note, using Gaussian-based linear discriminant analysis (LDA) as the classifier, we demonstrate that, at least for LDA, there is an intrinsic, positive relationship between the rebalancing of class sizes and the improvement of AUC. We show that the largest improvement of AUC is achieved, asymptotically, when the two classes are fully rebalanced to be of equal sizes.

Type: Article
Title: Why Does Rebalancing Class-unbalanced Data Improve AUC for Linear Discriminant Analysis?
Open access status: An open access version is available from UCL Discovery
DOI: 10.1109/TPAMI.2014.2359660
Publisher version: http://dx.doi.org/10.1109/TPAMI.2014.2359660
Language: English
Additional information: Copyright © 2014 The Author(s). This work is licensed under a Creative Commons Attribution 3.0 License. For more information, see http://creativecommons.org/licenses/by/3.0/
Keywords: Training, Training data, Covariance matrices, Vectors, Educational institutions, Data mining, Linear discriminant analysis
UCL classification: UCL
UCL > Provost and Vice Provost Offices > UCL BEAMS
UCL > Provost and Vice Provost Offices > UCL BEAMS > Faculty of Maths and Physical Sciences
UCL > Provost and Vice Provost Offices > UCL BEAMS > Faculty of Maths and Physical Sciences > Dept of Statistical Science
URI: https://discovery-pp.ucl.ac.uk/id/eprint/1448839
Downloads since deposit
79,572Downloads
Download activity - last month
Download activity - last 12 months
Downloads by country - last 12 months

Archive Staff Only

View Item View Item