Stratified over-sampling bagging method for random forests on imbalanced data

He Zhao*, Xiaojun Chen, Tung Nguyen, Joshua Zhexue Huang, Graham Williams, Hui Chen

*Corresponding author for this work

    Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

    10 Citations (Scopus)

    Abstract

    Imbalanced data presents a big challenge to random forests (RF). Over-sampling is a commonly used sampling method for imbalanced data, which increases the number of instances of minority class to balance the class distribution. However, such method often produces sample data sets that are highly correlated if we only sample more minority class instances, thus reducing the generalizability of RF. To solve this problem, we propose a stratified over-sampling (SOB) method to generate both balanced and diverse training data sets for RF. We first cluster the training data set multiple times to produce multiple clustering results. The small individual clusters are grouped according to their entropies. Then we sample a set of training data sets from the groups of clusters using stratified sampling method. Finally, these training data sets are used to train RF. The data sets sampled with SOB are guaranteed to be balanced and diverse, which improves the performance of RF on imbalanced data. We have conducted a series of experiments, and the experimental results have shown that the proposed method is more effective than some existing sampling methods.

    Original languageEnglish
    Title of host publicationIntelligence and Security Informatics - 11th Pacific Asia Workshop, PAISI 2016, Proceedings
    EditorsMichael Chau, G. Alan Wang, Hsinchun Chen
    PublisherSpringer Verlag
    Pages63-72
    Number of pages10
    ISBN (Print)9783319318622
    DOIs
    Publication statusPublished - 2016
    Event11th Pacific Asia Workshop on Intelligence and Security Informatics, PAISI 2016 - Auckland, New Zealand
    Duration: 19 Apr 201619 Apr 2016

    Publication series

    NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
    Volume9650
    ISSN (Print)0302-9743
    ISSN (Electronic)1611-3349

    Conference

    Conference11th Pacific Asia Workshop on Intelligence and Security Informatics, PAISI 2016
    Country/TerritoryNew Zealand
    CityAuckland
    Period19/04/1619/04/16

    Fingerprint

    Dive into the research topics of 'Stratified over-sampling bagging method for random forests on imbalanced data'. Together they form a unique fingerprint.

    Cite this