Data imbalance in classification : experimental evaluation

Date
2020-03
Authors
Thabtah, Fadi
Hammoud, Suhel
Kamalov, Firuz
Gonsalves, Amanda
Journal Title
Journal ISSN
Volume Title
Publisher
Elsevier Inc.
Abstract
The advent of Big Data has ushered a new era of scientific breakthroughs. One of the common issues that affects raw data is class imbalance problem which refers to imbalanced distribution of values of the response variable. This issue is present in fraud detection, network intrusion detection, medical diagnostics, and a number of other fields where negatively labeled instances significantly outnumber positively labeled instances. Modern machine learning techniques struggle to deal with imbalanced data by focusing on minimizing the error rate for the majority class while ignoring the minority class. The goal of our paper is demonstrate the effects of class imbalance on classification models. Concretely, we study the impact of varying class imbalance ratios on classifier accuracy. By highlighting the precise nature of the relationship between the degree of class imbalance and the corresponding effects on classifier performance we hope to help researchers to better tackle the problem. To this end, we carry out extensive experiments using 10-fold cross validation on a large number of datasets. In particular, we determine that the relationship between the class imbalance ratio and the accuracy is convex. © 2019 Elsevier Inc.
Description
This article is not available at CUD collection. The version of scholarly record of this article is published in Information Sciences (2020), available online at: https://doi.org/10.1016/j.ins.2019.11.004
Keywords
Class imbalance, Classification, Data analysis, Machine learning, Statistical analysis, Supervised learning, Data reduction, Diagnosis, Intrusion detection, Large dataset, Learning systems, Machine learning, Statistical methods, Supervised learning, 10-fold cross-validation, Class imbalance, Class imbalance problems;, Classification models, Classifier performance, Experimental evaluation, Network intrusion detection, Scientific breakthrough, Classification (of information)
Citation
Thabtah, F., Hammoud, S., Kamalov, F., & Gonsalves, A. (2020). Data imbalance in classification: Experimental evaluation. Information Sciences, 513, 429–441. https://doi.org/10.1016/j.ins.2019.11.004