Why Does Rebalancing Class-unbalanced Data Improve AUC for Linear Discriminant Analysis?

Advanced search
Browse by:

Department | Year

UCL Theses | Latest

Deposit your research

Why Does Rebalancing Class-unbalanced Data Improve AUC for Linear Discriminant Analysis?

Xue, J; Hall, P; (2015) Why Does Rebalancing Class-unbalanced Data Improve AUC for Linear Discriminant Analysis? IEEE Transactions on Pattern Analysis and Machine Intelligence , 37 (5) pp. 1109-1112. 10.1109/TPAMI.2014.2359660. Green open access

Preview

Text
06906278.pdf
Download (217kB) | Preview

Abstract

Many established classifiers fail to identify the minority class when it is much smaller than the majority class. To tackle this problem, researchers often first rebalance the class sizes in the training dataset, through oversampling the minority class or undersampling the majority class, and then use the rebalanced data to train the classifiers. This leads to interesting empirical patterns. In particular, using the rebalanced training data can often improve the area under the receiver operating characteristic curve (AUC) for the original, unbalanced test data. The AUC is a widely-used quantitative measure of classification performance, but the property that it increases with rebalancing has, as yet, no theoretical explanation. In this note, using Gaussian-based linear discriminant analysis (LDA) as the classifier, we demonstrate that, at least for LDA, there is an intrinsic, positive relationship between the rebalancing of class sizes and the improvement of AUC. We show that the largest improvement of AUC is achieved, asymptotically, when the two classes are fully rebalanced to be of equal sizes.

Type:	Article
Title:	Why Does Rebalancing Class-unbalanced Data Improve AUC for Linear Discriminant Analysis?
Open access status:	An open access version is available from UCL Discovery
DOI:	10.1109/TPAMI.2014.2359660
Publisher version:	http://dx.doi.org/10.1109/TPAMI.2014.2359660
Language:	English
Additional information:	Copyright © 2014 The Author(s). This work is licensed under a Creative Commons Attribution 3.0 License. For more information, see http://creativecommons.org/licenses/by/3.0/
Keywords:	Training, Training data, Covariance matrices, Vectors, Educational institutions, Data mining, Linear discriminant analysis
UCL classification:	UCL UCL > Provost and Vice Provost Offices > UCL BEAMS UCL > Provost and Vice Provost Offices > UCL BEAMS > Faculty of Maths and Physical Sciences UCL > Provost and Vice Provost Offices > UCL BEAMS > Faculty of Maths and Physical Sciences > Dept of Statistical Science
URI:	https://discovery-pp.ucl.ac.uk/id/eprint/1448839

Downloads since deposit

44,898Downloads

Download activity - last month

Download activity - last 12 months

Downloads by country - last 12 months

Archive Staff Only

View Item