TY - JOUR
T1 - Classifying very high-dimensional data with random forests built from small subspaces
AU - Xu, Baoxun
AU - Huang, Joshua Zhexue
AU - Williams, Graham
AU - Wang, Qiang
AU - Ye, Yunming
PY - 2012/4
Y1 - 2012/4
N2 - The selection of feature sub space s for growing decision trees is a key step in building random forest models. However, the common approach using randomly sampling a few features in the subspace is not suitable for high dimensional data consisting of thousands of features, because such data often contains many features which are uninformative to classification, and the random sampling often doesn't include informative feature s in the selected subspaces. Consequently, classification performance of the randomforestmodel is significantly affected. In this paper, the authors propose an improved random forest method which uses a novel feature weighting method for subspace selection and therefore enhances classification performance over high-dimensional data. A series of experiments on 9 real life high dimensional datasets demonstrated that using a subspace size of [log2 (M) + 1] features where M is the total number of features in the dataset, our random forest model significantly outperforms existing randomforest models.
AB - The selection of feature sub space s for growing decision trees is a key step in building random forest models. However, the common approach using randomly sampling a few features in the subspace is not suitable for high dimensional data consisting of thousands of features, because such data often contains many features which are uninformative to classification, and the random sampling often doesn't include informative feature s in the selected subspaces. Consequently, classification performance of the randomforestmodel is significantly affected. In this paper, the authors propose an improved random forest method which uses a novel feature weighting method for subspace selection and therefore enhances classification performance over high-dimensional data. A series of experiments on 9 real life high dimensional datasets demonstrated that using a subspace size of [log2 (M) + 1] features where M is the total number of features in the dataset, our random forest model significantly outperforms existing randomforest models.
KW - Classification
KW - Decision tree
KW - High-dimensional data
KW - Random forests
KW - Random subspace
UR - http://www.scopus.com/inward/record.url?scp=84866622935&partnerID=8YFLogxK
U2 - 10.4018/jdwm.2012040103
DO - 10.4018/jdwm.2012040103
M3 - Article
SN - 1548-3924
VL - 8
SP - 44
EP - 63
JO - International Journal of Data Warehousing and Mining
JF - International Journal of Data Warehousing and Mining
IS - 2
ER -