Intel® oneAPI Data Analytics Library
Learn from community members on how to build compute-intensive applications that run efficiently on Intel® architecture.

How do I use this library?


How do I use this DAAL library?

0 Kudos
2 Replies

Intel® Data Analytics Acceleration Library (Intel DAAL) is a SW solution which covers the whole data analytics flow, from acquiring the data from a data source till analysis, model training, and prediction.The library provides C++ and Java* APIs to help optimize data analytics applications. The library supports a set of data sources including MySQL*, csv file format, HDFS, and RDD as well.

With this library you can do data analysis in different modes:

- Offline (batch) which assumes that the whole dataset is available for processing

- Online, or Streaming which supports the data arriving in blocks. This mode can also be used, if your data is located on HDD and is too big to fit into memory of your computer.

- Distributed, which assumes that the dataset is stored in blocks on the nodes of your cluster.

The typical steps for use of Intel DAAL in batch mode are demonstrated using the example of Principal Component Analysis (PCA) applied to the data stored in csv format:

/* Construct the object for accessing the data in csv format */
FileDataSource<CSVFeatureManager> dataSource(dataFileName, DataSource::doAllocateNumericTable, DataSource::doDictionaryFromContext);

/* Construct object to run PCA analysis in batch mode */
pca::Batch<> algorithm;

/* Retrieve the nVectors observations from CSV file */

/* Provide dataset into PCA algorithm */
algorithm.input.set(pca::data, dataSource.getNumericTable());

/* Run PCA */

/* Access the PCA result */
SharedPtr<pca::Result> result = algorithm.getResult();

/* Access eigen-values and eigen-vectors to choose the principal components */
NumericTable eigenvals  = result->get(pca::eigenvalues);
NumericTable eigenvecs = result->get(pca::eigenvectors);

In the examples which describe the use of the library in online and distributed mode, the steps for data construction and extraction are similar to ones above and are omitted.

The use of the Intel DAAL for analysis of the streaming data is a minor generalization of the scheme above.

You are required to load the next data block, apply PCA algorithm to update the intermediate results, and finalize computations once you process the last block of your data set. In the example below we assume that the next data block is available in the same memory:

/* Construct object to run PCA in online mode */
pca::Online<> algorithm;

/* Provide dataset into PCA algorithm */
algorithm.input.set(pca::data, dataSource.getNumericTable());

while( dataSource.loadDataBlock(nVectorsInBlock) == success )
       /* Update PCA intermediate result */

/* Finalize PCA analysis */

Eventually, let’s consider the basic steps for use of Intel DAAL in the processing of the distributed data.

Note that, the library API for processing of distributed data is agnostic to underlying communication technology. The user is responsible for delivery of intermediate results from the local nodes to the master node for finalization of the results.

The library is shipped with MPI*, Hadoop*, and Spark* samples which can be used for creation of the respective application.

The code below shows how to run PCA analysis of the data blocks on the local nodes.

/* Construct object to run PCA in distributed mode on local nodes */
pca::Distributed<step1Local> localAlgorithm;

/* Provide data block into PCA algorithm */
localAlgorithm.input.set(pca::data, dataSource.getNumericTable());

/* Run PCA */

/*  Extract partial result for its delivery to master node */
pca::PartialResult PartialResulti = localAlgorithm.getPartialResult();


The code below demonstrates computation of the PCA final results on the master node from partial results delivered from the local nodes:

/* Construct object to run PCA in distributed mode on master node */
pca::Distributed<step2Master> masterAlgorithm;

/* Provide partial results arrived from local nodes into PCA algorithm */
masterAlgorithm.input.add( pca::partialResults, PartialResult1 );
masterAlgorithm.input.add( pca::partialResults, PartialResult2 );

/* Run PCA */

/* Finalize PCA analysis */

Note that the library provides the methods for data serialization/de-serialization and compression which help prepare PCA partial results for sending to/receiving by the master node.

Additional details about use of Intel DAAL are available in Intel DAAL Programming Guide and Reference Manual.

0 Kudos

Thanks Andrey, it helps me

0 Kudos