I have been looking at the examples for classifications and tried out the naive bayes. It is not evident from the documentation how to specify that individual features are DAAL_CONTINUOUS or DAAL_CATEGORICAL.
The example loads CSRNumericTable:
NumericTableDictionary * aDict = testData->getDictionary();
is correct (20)
but all features are set to "DAAL_CONTINUOUS". Is this dictionary even considered? I tried changing this to DAAL_CATEGORICAL, but Compute() returned exactly the same as before. It appeared to ignore dictionary settings completely.
Is "dictionary" referenced only for certain types? Which? AOS and SOA only? For homogenous tables the "Features" are implicitly always "Continuous"?
In the upcoming release of Intel DAAL 2018 we consider support of one-hot encoder for categorical features in the Data Source component. Using one-hot encoder, you can transform a categorical feature with N categories into a set of N binary features at the stage of input data reading from the data source. Most of the algorithms will process these set of binary features properly.
In the present version of Intel DAAL several algorithms provide support for categorical features including Decision Tree, Decision Forest, Boosting algorithms and K-means. Presence of categorical features in the input data will not affect the result in other algorithms. That’s why you do not see the changes in the results when you use categorical features with Naïve Bayes algorithm. We continue analysis to extend support of categorical features for other types of algorithms in DAAL.
The code sequence that you provide to set the type of numeric table’s features to DAAL_CATEGORICAL is correct. You can use this code with any type of the numeric tables derived from the NumericTable class, including homogeneous, AOS, SOA and other tables.
The type of each feature in the numeric table is stored in the numeric table dictionary associated with the table. If all the features in the numeric table are the same, the dictionary can contain only one feature that describes all the features in the numeric table. You can use Dictionary::getFeaturesEqual() method to check if the features in the numeric table are the same.
By default, all features in the numeric table are DAAL_CONTINUOS. You can change the type of the features in the dictionary associated with the numeric table as shown in the example: datastructures_soa.cpp
In the present version of Intel DAAL several algorithms provide support for categorical features including Decision Tree, Decision
These two are only present in the latest beta version of DAAL 2018? There is no mentioning in DAAL 2017 docs.
Presence of categorical features in the input data will not affect the result in other algorithms.
They will be ignored or treated as "continuous"? Which algorithms consider "ordinal" feature type in their computation?
Additionally, when using sparse matrices, are values which are not specified considered to be "missing values" or they are treated as "implicit zero"?
Does the library support "missing values" in the columns of features?
I would also like to report that:
a.) serialization of KNN model results in memory corruption.
b.) Multi-class classifier works only with SVM and does not accept other boosting binary classifiers. Is this by design or a bug?
Decision Tree and Decision Forest algorithms were introduced in DAAL 2018 Beta, please take a look at the release notes for more details.
About categorical features: If the algorithm is not aware of ‘categorical’ and ‘ordinal’ features, it will treat all the features as ‘continuous’. In the present version of DAAL only Decision Tree algorithm provides the support of 'ordinal' features.
As for the sparse numeric tables, the work with them is algorithm-specific. Most of the algorithms that work with the sparse data treat the values that are not specified as implicit zeros. However, there are exceptions: Implicit ALS treats those values as missing.
>> Does the library support "missing values" in the columns of features?
This is also algorithm specific. Implicit ALS supports missing values when the input numeric table is sparse. But in case you work with the dense data, you cannot provide the information about the missing values. Can you please provide more details about the use case? Which algorithms are you interested in to work with missing values? What is the percentage of the missing values in your data?
Regarding the bug reports. Our analysis shown that the issue with KNN model serialization is not reproduced with DAAL 2018 Beta Update 1. If you are using an older version of DAAL, can you please check on your side that the issue is fixed in DAAL 2018 Beta Update 1?
The issue with multi-class classifier was reproduced on our side. We will work to fix it in the future releases of the library.
Our analysis shown that the issue with KNN model serialization is not reproduced with DAAL 2018 Beta Update 1
Were you aware of this bug before and you have it marked as fixed? It does not fail immediately. This is short test case.
1.) Append these lines at end of trainModel() inside of the kdtree_knn_dense_batch.cpp
bModel = trainingResult->get(classifier::training::model);
2.) Modify the main() as follows:
for (int i = 0; i < 10; i++)
3.) Run and the exception happens at:
> kdtree_knn_dense_batch.exe!daal::data_management::interface1::DataArchive::write(unsigned char * ptr=0x00007ff6b79e19d8, unsigned __int64 size=350681133952) Line 290 C++
or some other location. I will have to wait with installation of DAAL 2018 Beta.
But in case you work with the dense data, you cannot provide the information about the missing values. Can you please provide more details about the use case? Which algorithms are you interested in to work with missing values? What is the percentage of the missing values in your data?
There are various strategies for specifying missing values. For categorical values, the table could contain 999. For real values, missing value could be a NAN. For ordinal integers, it could be LONG_MAX.
Typical example application:
Real estate price estimation. Input:
Price based on data from last 1month, 3months, 6months or any selected range.
Depending on the application you can have a lot of missing data (20%) per vector and 90% of vectors with one or more value missing.
Even If there are no missing values, which algorithms in DAAL could be used for this purpose?
Intel DAAL 2018 Beta had a bug in serialization of kNN model that was fixed in Intel DAAL 2018 Beta Update 1.
Thank you for providing the details about use cases related to missing values. We will analyze them from perspective of options to support handling of missing values in the library.
You reported issue of "Multi-class classifier works only with SVM and does not accept other boosting binary classifiers" is fixed in Intel DAAL 2018 Update 1 that is available now. Please try it out and let us know if you have any question. Thank you!