Do We Need Hundreds of Classiﬁers to Solve Real World CLassiﬁcation Problems?
We evaluate 179 classiﬁers arising from 17 families (discriminant analysis, Bayesian, neural networks, support vector machines, decision trees, rule-based classiﬁers, boosting, bagging, stacking, random forests and other ensembles, generalized linear models, nearestneighbors, partial least squares and principal component regression, logistic and multinomial regression, multiple adaptive regression splines and other methods), implemented in Weka, R (with and without the caret package), C and Matlab, including all the relevant classiﬁers available today. We use 121 data sets, which represent the whole UCI data base (excluding the large-scale problems) and other own real problems, in order to achieve signiﬁcant conclusions about the classiﬁer behavior, not dependent on the data set collection. The classiﬁers most likely to be the bests are the random forest (RF) versions, the best of which (implemented in R and accessed via caret) achieves 94.1% of the maximum accuracy overcoming 90% in the 84.3% of the data sets. However, the diﬀerence is not statistically signiﬁcant with the second best, the SVM with Gaussian kernel implemented in C using LibSVM, which achieves 92.3% of the maximum accuracy. A few models are clearly better than the remaining ones: random forest, SVM with Gaussian and polynomial kernels, extreme learning machine with Gaussian kernel, C5.0 and avNNet (a committee of multi-layer perceptrons implemented in R with the caret package). The random forest is clearly the best family of classiﬁers (3 out of 5 bests classiﬁers are RF), followed by SVM (4 classiﬁers in the top-10), neural networks and boosting ensembles (5 and 3 members in the top-20, respectively).
classiﬁcation, UCI data base, random forest, support vector machine, neural networks, decision trees, ensembles, rule-based classiﬁers, discriminant analysis, Bayesian classiﬁers, generalized linear models, partial least squares and principal component regression, multiple adaptive regression splines, nearest-neighbors, logistic and multinomial regression