I have compiled some of the distributed DAAL samples and I can run them from the command line just fine. I have changed the OMP_NUM_THREADS and I do seem to see faster results. But, how do I run across multiple nodes? Do I simply call the executable using mpirun? Also, what is the largest data set in the examples (one that may take most CPU time) I am trying to do some test time runs to verify changes when running multiple threads, sequential and ultimately MPI. I use the "time" command, but most run times are quick that the differences may be in the acceptable deviation. But if a long test set is available it would be useful for time comparison. Jorge
The folder "mpi" in Intel(R) DAAL release contains readme.htm with detailed instructions how to build and run MPI samples. Depending on OS, the samples are built and run using either launcher.bat utility (invokes mpiexec) or make file (mpirun). The datasets used in the samples are small to avoid increase of the size of the release package. For performance experiments in the distributed environment it makes sense to use datasets available in open repositories such as UCI Machine Learning or similar.
Can you please clarify for which algorithm did you see impact of OMP_NUM_THREADS variable?
Thanks for the MPI hint, the "samples" and "examples" directories point to a different area of the distribution, and contain different code. I assume the code in the examples directory are for distributed using spark/hadoop? and of course the one in the samples is for MPI.
The samples directory is what I was after. As a comment, maybe there should be a reference in the DAAL guide to these pages, as this is exactly what I was looking for to use DAAL over MPI. Textual references are made but there was no direct way to find the code and instructions. I was able to run the examples and I will be on my way on to testing DAAL using MPI.
As far as the examples I was experimenting with the OMP_NUM_THREADS, I experimented with three:
the one from the my_first_daal_program, kmeans_distributed, covariance_csr_distributed.
But for now I was just focused on trying to get something going and trying to figure put the this all works. At our end, processor affinity will be crucial to get the best use of our clusters. Therefore, libraries that allow use to control or take advantage of the ranks and threads is crucial. I have done kmeans analysis experimentation under hadoop, but I am beginning user for these types of algorithms.
Thank you, Jorge, for your suggestion to improve the documentation. We will think how to address it.
The examples demonstrate use of the library interfaces in different use scenarios such as batch or online type of computations. The samples show the use of the algorithms with other technologies such as Hadoop*, Spark*, or MySQL*. In the same folder samples, but under directory java you will find Hadoop* and Spark* samples which rely on the algorithms of the library such pca, covariance, or k-means.
Impact of value in OMP_NUM_THREADS variable is aside, and we currently analyze whether it is possible to eliminate it.
The library provides service functionality that helps you to set number of threads - please, have a look at the example set_number_of_threads.cpp and respective description of the setNumberOfThreads() in the class Environment. Also, the library contains CPU dispatcher which automatically detects type of CPU on the computational node and runs respective code path.
Please, let us know, if you have more questions/comments on the library and its content.
Is there a white paper or document that describes how an MPI based DAAL application uses threads? I assume it obeys I_MPI_PIN_DOMAIN and limits threads to that domain - but does it use all available thread contexts? Does MKL_NUM_THREADS control threads within ranks? Are you considering a new class of env vars like DAAL_NUM_THREADS and DAAL_DOMAIN_xxxx_threads to set thread numbs for differing domains within DAAL?
Thanks for your questions.
While Intel® DAAL internally does not rely on a communication technology such as MPI, it is shipped with MPI samples that demonstrate use of the distributed algorithms with MPI. We do not have the document that describes MPI based DAAL application uses threads. Yes, you are absolutely right: an MPI-based Intel® DAAL application obeys I_MPI_PIN_DOMAIN and limits threads to that domain. It is Intel® TBB library that decides whether to use all available thread contexts or not.
The library provides the tools to control number of threads. To set number of threads used by Intel DAAL, please use setNumberOfThreads method of the Environment class, for example:
You can also find respective example in set_number_of_threads examples in C++ and Java.
We investigate the option to support environmental variables DAAL_NUM_THREADS and DAAL_DOMAIN_xxxx_threads you mention above.