How do I use this library with Apache Spark?

Zhang_Z_Intel · ‎02-24-2015

Ilya_B_Intel · ‎03-26-2015

Intel® DAAL provides algorithms implementations for distributed computing.

All data management objects support Java serialization and can be elements of Apache Spark RDD collections.

Algorithms are divided into 2 or more steps which correspond to general map/reduce computation tasks (and in more complex cases sequence of map/reduce tasks).

Simpler scheme can be described in the following way:

You do map on available portions of data stored in RDD
- you call computation of 1st step of chosen algorithm for each element of RDD and obtain PartialResult object
- return PartialResult in RDD
You call collect() on PartialResults RDD
- you call computation of 2nd step of chosen algorithm and obtain Results object which contains algorithm results

More details for each particular algorithm can be found within programming guide.

Package is supplied with four Spark samples, which implements this scheme for PCA (Correlation and SVD method), QR and SVD decompositions. We will extend number of samples in further updates. Let us know if you are interested in some specific algorithms.