Is this library related to Hadoop? How do I use it on Hadoop?

VipinKumar_E_Intel · ‎02-24-2015

Sergey_M_Intel2 · ‎02-26-2015

This library is for Hadoop and other Big Data infrastructures.

The beauty of Intel® Data Analytics Acceleration Library, or Intel DAAL, is that its APIs are abstracted from the cross-device/node communication layer. Which makes it really flexible to implement variety of usage scenarios, including variety of approaches for distributed computing.

To simplify integration of Intel DAAL with popular distributed computing infrastructures/technologies the library is equipped with code samples. C++ code samples for DAAL distributed algorithms relying on MPI*, Java* code samples for DAAL distributed computing with HDFS and Spark* RDD.

Thank you,

Sergey Maidanov

zhengda1936 · ‎02-28-2015

I also have the same question on how we use it on Hadoop. To be more specific, how to use it with MapReduce? Even more specific, do we invoke the functions in this library in the Map or Reduce function?

Thanks,

Da

Sergey_M_Intel2 · ‎03-01-2015

Hi Da,

Yes, Intel DAAL functions are invoked from Hadoop map/reduce functions.

Let us take the distributed computation of the Principal Component Analysis using the SVD method (see the DAAL Programming Guide for the workflow details).

Map

    public void map(Object key, InputData, context) throws IOException, InterruptedException {

        /* This is a local part of input data */
        double[] data = InputData.getArray(nFeatures, nVectorsInBlock);
        daal.data.HomogenNumericTable ntData = new daal.data.HomogenNumericTable(data, nFeatures, nVectorsInBlock);

        /* This will contain partial result on local node */
        daal.data.HomogenNumericTable ntNodeComputations =
                new HomogenNumericTable(Double.class, nFeatures, nFeatures, NumericTable.AllocationFlag.DoAllocate);

        /* Create algorithm object */
        PCA pcaAlgorithm = new PCA(Double.class, daal.data.PCA.Method.PCASVD, daal.data.PCA.InputDataType.normalizedDataSet);
        pcaAlgorithm.setComputeMode(daal.data.ComputeMode.Distributed);

        /* Do computations */
        pcaAlgorithm.compute(ntData, ntNodeComputations);

        long[] nObservationsArray = { nVectorsInBlock };
        
        /* Here is serialization of partial data */

        context.write(new Text(), serializedPartialData);
    }

Reduce

    public void reduce(Text key, Iterable<serializedType> values, Context context) IOException, InterruptedException {

        /*Arrays for partial results from nodes */
        HomogenNumericTable[] computeResults = new HomogenNumericTable[nBlocks];
        HomogenNumericTable[] nObservations = new HomogenNumericTable[nBlocks];

       NumericTable[] mergeInputs = new NumericTable[2 * nBlocks];
        for (int i = 0; i < nBlocks; i++) {
            /* Here is deserialization of partial data */
            mergeInputs[2 * i] = nObservations;
            mergeInputs[2 * i + 1] = computeResults;
        }

        /* Create numeric tables for storing PCA results */
        eigenvectors = new daal.data.HomogenNumericTable(Double.class, nFeatures, nFeatures, NumericTable.AllocationFlag.DoAllocate);
        eigenvalues = new daal.data.HomogenNumericTable(Double.class, nFeatures, 1, NumericTable.AllocationFlag.DoAllocate);
        daal.data.NumericTable[] results = { eigenvectors, eigenvalues };

        PCA pcaAlgorithm = new PCA(Double.class, PCA.Method.PCASVD, PCA.InputDataType.normalizedDataSet);
        pcaAlgorithm.setComputeMode(ComputeMode.Distributed);
        pcaAlgorithm.merge(mergeInputs, results);

        double[] eigenvaluesArray = eigenvalues.getDoubleArray();
        double[] eigenvectorsArray = eigenvectors.getDoubleArray();
    }

The eigenvaluesArray and eigenvectorsArray will contain final eigenvalues and eigenvectors respectively.

I hope it helps,

Thank you,

Sergey Maidanov

zhengda1936 · ‎03-07-2015

I guess the DAAL Programming Guide isn't released yet?

The workflow isn't intuitive to me. I suppose the input is a matrix? So first map() partitions the matrix and run PCA on part of the matrix completely on the local node? Then one MapReduce can generate the final eigenvalues and eigenvectors?

Computing eigenvalues/vectors requires a sequence of matrix vector multiplication. Each matrix vector multiplication requires data shuffling in MapReduce. Unless the DAAL performs computation in a distributed fashion at the background, I don't understand how a single MapReduce can accomplish the task.

Zhang_Z_Intel · ‎03-11-2015

zhengda1936 wrote:

I guess the DAAL Programming Guide isn't released yet?

The programming guide is available as part of the "User and Reference Guide of Intel Data Analytics Acceleration Library". You can find it after you download and install the library. Please visit https://software.intel.com/en-us/articles/announcing-intel-data-analytics-acceleration-library-2016-beta and follow the links to download it.

Priyanka_K_ · ‎11-02-2016

Hadoop has native implementations of certain components for performance reasons and for non-availability of Java implementations. These components are available in a single, dynamically-linked native library called the native hadoop library. On the *nix platforms the library is named libhadoop.so.

Thanks,

Priyanka,

Hadoop Developer @ Catch Experts,

http://www.catchexperts.com/hadoop/online-training