I'm currently analysing features using the Correlation Distance Matrix before feeding the more useful ones into the SVM classifier. I noticed that if the numeric table data dictionary describes a feature as categorical there still seems to be a reasonable looking correlation coming out.
Now I'm assuming the correlation is Pearson(?), and so shouldn't be used for categorical (or ordinal) data and I should probably be implementing chi-squared or similar. Is this correct? Or is DAAL doing something clever?
I'm also assuming the SVM classifier is ok with categorical data and it internally expands from n categories to n-1 features? Unfortunately the samples are light on examples with categorical data, and my use-case is rather dependent on them :)
you are correct, the present version of the library supports methods for computation Pearson correlation matrix only. Extension of the supported types of correlation matrices such as Kendall rank and different types of stat tests such as chi2 for independence that might be helpful in analysis of ordinal/ categorical data is in our plans.
Intel DAAL version of SVM classifier relies on the algorithm described in the paper of Rong-En Fan, Pai-Hsuen Chen, Chih-Jen Lin. Working Set Selection Using Second Order Information for Training Support Vector Machines. used in the libSVM library. The method is known to be more stable when m-value categorical feature is represented as a binary vector of m-1 size. However, the present version of the library does not apply this type of the conversions internally and assumes that all required conversions are done on the user's side by proper filling numeric table and respective dictionary. I'm not sure the documentation of the library clearly indicates this, and this info should be added in the docs. Generally, extended support of categorical variables like one discussed above is in the plans of the library in the data management and algorithmic components of the library.
Please, let me know, if it answers your question. Also, please suggest if you need extra functionality in the library that will simplify your analysis of the datasets that contain categorical variables.
Many thanks for your quick reply. I'll work on expanding the m categories to m-1 features - I assume the values would be +/-1.0?
It would be nice if the categorical features could be expanded "on the fly" just to save memory when there are a lot of them, but it's not a deal breaker. Chi2 etc would be very handy to have in the tool kit.
As a matter of interest is the category/ordinal description used in any algorithm?
Hi Harvey, thanks for the feedback
I would encode the categorical variable using vector with 0/1 values, but generally it may depend on the specific application.
Answering your question: yes, in some algorithms such as k-means or stumps the description is used, and the use of description would be expanded in the future versions of the library