How to use "I_MPI_ADJUST" option efficiently?

youn__kihang · ‎06-10-2020

Hi All,

I found that I_MPI_ADJUST option can be used to test the performance improvement by changing the algorithm of MPI communication.
I also learned that there is an AUTOTUNE function, so you can test multiple options at once. I would like to take advantage of this feature to find the best option in my application.
However, the application takes a long time and there are many I_MPI_ADJUST options, so it is necessary to exclude unnecessary experiments and prioritize experiments.

Before constructing an experiment with a specific application, I ask some questions to hear from experts.

Q1. The autotuning page says that the Intel MPI performance depends on the platform.
Can it also depend on the volume of trasfer/number of calls/number of nodes used in the application?
https://software.intel.com/en-us/node/810193

Q2. On the I_MPI_ADJUST page, I_MPI_ADJUST defaults to 0 and says "The default value of zero selects the optimized default settings".
Does this mean one of the options from 1 to N? Or is it an Intel's secret recipe?
software.intel.com/content/www/us/en/develop/documentation/mpi-developer-reference-windows/top/environment-variable-reference/i-mpi-adjust...

Q3. Is there I_MPI_ADJUST for Waitall? There is about MPI_Barrier. If you apply this, does it affect Waitall?

Q4. Is there any possibility that the optimal I_MPI_ADJUST option for the application is different for other Intel MPI versions?

And I want to test it with the following process.
If there are any unnecessary process or additional process, please comment.

P1. Check the MPI functions total time summary through the APS report and list up the top 5 functions in the order of elapsed time.

P2. List the top 5 options that works well on the current platform by autotuning 5 functions.

P3. Perform the application in 25 cases (5 functions*5 options) to compare the baseline (without all options) to check performance improvement.

P4. Check the baseline and performance improvement by performing the final case that combines the optimal options for each function.

Thank you for reading the long question.
If you have any comments, please share us.

Best Regards,
Kihang

PrasanthD_intel · ‎06-11-2020

Hi Kihang,

Thanks for reaching out to us.

Regarding your first question

Q1. Can it also depend on the volume of transfer/number of calls/number of nodes used in the application?

As per our knowledge, Autotune runs the program with all the available algorithms for the collective operations with various data sizes and gives us a binary file that contains information about optimal algorithms for that program. When we provide this binary file to the program later the MPI knows the optimal algorithms based on the data and it uses those.

So the volume of transfer/number of calls/number of nodes is taken into consideration while selecting the algorithm.

For the remaining questions, we will contact the concerned experts and will get back to you.

Regards

Prasanth

youn__kihang · ‎06-11-2020

Hi Prasanth,

What I am trying to ask in Q1 is close to the below question.

I wonder if any application is optimal using the algorithm recommended by Autotune.
At least it would be recommended for the application.
If not, because I need to test all algorithms on an application basis, regardless of the Autotune results.

Best Regards,
Kihang

PrasanthD_intel · ‎06-15-2020

Hi Kihang,

We are referring your query to the concerned team.

Will get back to you at earliest

Regards

Prasanth

DrAmarpal_K_Intel · ‎06-18-2020

Hi Kihang,

To answer your questions in the same order -

1. Performance of any MPI library is typically a function of the variables you listed.

2. Yes. 0 selects the default from the out of box tuning data.

3. No. See https://software.intel.com/content/www/us/en/develop/documentation/mpi-developer-reference-windows/top/environment-variable-reference/i-mpi-adjust-family-environment-variables.html

4. Yes, this is possible. Not just the version of Intel MPI Library, the optimal settings will change for your application depending on the workload, number of nodes, number of ranks per node, message sizes, etc.

Autotuner has a very low overhead. Also, Autotuner will tune for all collective operations in your application in a single run. There is no need to tune for different functions separately. A single Autotuner run can provide best tuning settings for all collective functions.

What are your top MPI functions?
Which application are you trying to tune?
How many nodes are you trying to tune for?
Which interconnect does your cluster provide?

Best regards,
Amar

DrAmarpal_K_Intel · ‎07-07-2020

Hi Kihang,

I hope you have been able to successfully use Autotuner for your tuning requirements!

“Having not received your response we are assuming that your concerns have been addressed fully and we will no longer respond to this thread. If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only”.

Best regards,

Amar