"Most classification algorithms will only perform optimally when the number of samples of each class is roughly the same. Highly skewed datasets, where the minority is heavily outnumbered by one or more classes, have proven to be a challenge while at the same time becoming more and more common."
Does DAAL has processing algorithms to support the classification of imbalance data sets? I didn't found anything via the user guide document. Thanks!
There are several techniques to deal with the imbalanced data sets including
- Over-sampling: the instances of the minority class are replicated to balance class distribution
- Under-sampling: the instances of the majority class are removed to balance class distribution
- Bagging, Bootstrap Aggregation: N training samples are generated and the classifier is trained N times using those samples. The results of the N classifiers are aggregated to get the final prediction
- Boosting: those algorithms train a sequence of weak learners, where each consecutive weak learner pays more attention to the samples that were classified incorrectly by the set of weak learners trained previously.
Choice of the method is defined by the characteristics of the imbalanced data set.
Intel DAAL provides several boosting algorithms including AdaBoost and BrownBoost for binary classification, and LogitBoost for multiclass classification. Please, have a look at those algorithms from perspective of improvement of the classification accuracy for your imbalanced data sets. Let us know, if those approaches do not work in your application, and/or you have in mind other techniques that may be useful for you but are still missed in the library