How to profile hybrid openMP/MPI code

Rohit_G_1 · ‎08-17-2016

Hi All

I have a piece of code which uses both openMP and MPI and I wish to profile it in different configurations. e.g.

One Haswell node with 20 cores in following configurations

1. 20 MPI tasks and no openMP parallelization or 1 openMP thread

2. 10 MPI tasks and 2 openMP threads per task

3. 4 MPI tasks and 5 openMP threads per task

4. 2 MPI tasks and 10 openMP threads per task

I am running completely independent tasks (linear solvers) with different data sets so among MPI tasks there is NO communication. The reason i have MPI tasks is because in future I would like to have a more fine-grain parallel task that can also use cores form nodes on the infiniband network.

I had expected that for the matrix I am using for the linear solver I would see more than 10 times improvement between variant 1 and 4. What this means is that in 1 say all 20 tasks finish in 130 seconds (maximum time taking task). I see that 4 finishes in 13 seconds but then in order to complete all the work I must run 4 10 times. This results in 130 seconds so the gain in parallelizing with openMP is absent.

This is what I wish to understand with a tool or a set of tools. I was advised by my cluster administrator to use Vtune for openMP analysis and ITAC for MPI analysis.

I am wondering is there an integrated way of looking at the possible issues with my test? Kindly advise.

with kind regards and thanks in advance for reading my message

Rohit

P.S.:- Please note that in order to get these numbers I used the knowledge provided in articles listed

https://software.intel.com/en-us/node/528819

https://software.intel.com/en-us/node/522691#AFFINITY_TYPES

So in my code I use options listed on this page to map processes to cores

Dmitry_P_Intel1 · ‎08-17-2016

Hello Rohit,

In each case of combination of MPI processes and OpenMP threads you have 20 execution flows in parallel. Assuming that they do the same amount of work in parallel with equal affinity we should expect close numbers of execution time especially taking into account that you don't have real message passing between ranks.

To have statistics on both MPI and OpenMP I can recommend use MPI Performance Snapshot - https://software.intel.com/en-us/articles/getting-started-with-the-mpi-performance-snapshot that is a part of Parallel Studio XE Cluster addition.

You also can use VTune. E.g. in VTune Amplifier XE 2016 Update 4 you can find HPC Performance Characterization analysis type that should give you basic numbers on MPI and OpenMP efficiency/cost along with other metrics on memory efficiency and FLOPs/vectorizaton (for processors that support it).

Thanks & Regards, Dmitry

Rohit_G_1 · ‎08-18-2016

Hi Dmitry

I do not understand your first remark.

"In each case of combination of MPI processes and OpenMP threads you have 20 execution flows in parallel"

In cases 2,3 and 4 I have 10, 4 and 2 execution flows in parallel. They have multiple cores at their disposal in order to complete the parallel regions of the code with more threads.

I assumed that in case 1 where I have 20 execution flows the fact that more processors are effectively working on independent sets of data would lead to less cache use and more thrashing (since multiple processes contend for the cache, L2 and upwards). My motivation to run less processes and use openMP to parallelize was to have less cache thrashing as less processes contend for caches (L2 upwards).

I will do the analysis using the tools you have specified.

Thanks for your answer.

Rohit

Dmitry_P_Intel1 · ‎08-19-2016

Hello Rohit,

Now I see what you mean. So in the case of 20 MPI ranks 20 threads will do independent pieces of work while in the case of 5 MPI ranks 4 threads will divide loop iterations of the same piece of work and you expect better cache reuse in this case. Yes - it would be interesting to look at HPC Performance Characterization analysis results to see thorough memory metrics (cache, DRAM accesses cost) along with MPI/OpenMP cost to judge on this.

Thanks & Regards, Dmitry

Rohit_G_1 · ‎08-19-2016

HI Dmitry

I have generated the vtune data but when I try opening it in the GUI I always get the error " The data cannot be displayed there is not valid viewpoint of the data"

Do I need to compile my code differently in order to run this analysis. Also in the standard error output from the mpirun I also see two errors during collection

amplxe: Error: Cannot load data file `/pscratch/rogupta/vtune_results.pn134/data.0/sep2aaab7b6f700.20160819T135517.354484.tb6' (Failed to bind sampling data file!).

amplxe: Error: 0x40000025 (Inapplicable report) -- The report 'summary' is not applicable to the result /pscratch/rogupta/vtune_results.pn134/vtune_results.pn134.amplxe.

The command I give to execute my test case is as follows

mpirun -n 2 -env I_MPI_PIN_DOMAIN socket amplxe-cl -data-limit=1500 -result-dir /pscratch/rogupta/vtune_results -collect hpc-performance ./idr_hybd_test /pscratch/rogupta/Aidr4m.mtx /pscratch/rogupta/rhs4m.vec 10 8

I am connected to the login node of my HPC server via putty with X11 fowarding and after the code completes executing I open the .amplxe file which is created in the pscratch folder using the amplxe-gui but I get the error which i mention to you.

Also as you might notice in the errors above when i try generating a command line report with a summary option that doesn't work either.

I am using vtune from intel parallel studio 2016

"vtune_amplifier_xe_2016.2.0.444464"

I am running vtune on a cluster which is running linux.

Could you guide me on what might be going wrong?

with kind regards

Rohit

P.S.:- If you need more information to pin-point the cause of why I see such errors I'll be more than willing to help.

Dmitry_P_Intel1 · ‎08-19-2016

Hello Rohit,

A couple of clarifying questions - do you have VTune sampling driver installed? To check please run <vtune_install_dir>/bin64/sep -version

Is it possible to update VTune to 2016 Update 4 version?

If you do collection on one node I also can offer to change the collection CL as follows:

amplxe-cl -data-limit=1500 -result-dir /pscratch/rogupta/vtune_results -collect hpc-performance mpirun -n 2 -env I_MPI_PIN_DOMAIN socket ./idr_hybd_test /pscratch/rogupta/Aidr4m.mtx /pscratch/rogupta/rhs4m.vec 10 8

This will work for 1 node MPI run but will not profile all the nodes if you have multiple node MPI run.

Thanks & Regards, Dmitry

Rohit_G_1 · ‎08-19-2016

HI Dmitry

This is the output of the sep -version command

-bash-4.1$ /hpc/shared/apps/intel/vtune_amplifier_xe_2016/bin64/sep -version
Sampling Enabling Product version: 3.15 (private) built by patbbinn on Dec 22 2015 02:30:51
SEP User Mode Version: 3.15.5
SEP Driver Version: Error retrieving SEP driver version
PAX Driver Version: Error retrieving PAX driver version
Platform type: 92
CPU name: Intel(R) Xeon(R) E5/E7 v2 processor
PMU: ivytown

and I think I have update 2 for vtune i ran the following to find that out

-bash-4.1$ amplxe-cl --version
Intel(R) VTune(TM) Amplifier XE 2016 Update 2 (build 444464) Command Line Tool
Copyright (C) 2009-2015 Intel Corporation. All rights reserved.

I dunno if they'll install the update 4 quickly but I have an extended trial license in windows which I guess I can update if I can get the update. However, that won't work right? Till I have update 4 on the cluster as well?

Could you also answer my question about the HPC characterization reporting? I mean why doesn't it generate the report? Is it something to do with the update 4?

with kind regards

Rohit

Rohit_G_1 · ‎08-22-2016

Hi Dimitry

I ran the command you suggested and the program completed successfully with the output as expected. Vtune created a .amplxe file in my scratch space on the cluster. I tried opening the file with the amplxe-gui [I have an X based connection via SSH] and pointed it to the file in scratch space.

Then I get the error :

the data cannot be displayed: there is no viewpoint applicable for the data

If i run from command prompt I get the error

amplxe: Using result path `/pscratch/rogupta/vtune_res_2_10_h'

amplxe: Opening `/pscratch/rogupta/vtune_res_2_10_h' in read-only mode.
amplxe: Error: 0x40000006 (Insufficient permissions) -- /pscratch/rogupta/vtune_res_2_10_h/sqlite-db

Could you suggest what might be going wrong?

with kind regards

Rohit

Dmitry_P_Intel1 · ‎08-22-2016

Hello Rohit,

What MPI do you use?

Thanks & Regards, Dmitry

Rohit_G_1 · ‎08-22-2016

Hi Dmitry

I am using Intel MPI 5.1.3.181

with kind regards

Rohit

Dmitry_P_Intel1 · ‎08-23-2016

Hello Rohit,

I just noticed that SEP driver is not installed/loaded according to your output (though you mentioned that you profile HSW node meanwhile the output saying IVT) so if you do collections like hpc performance characterization it will go through perf and it is highly recommended to install 2016 Update 4 since that time there we a bunch of fixes to improve collection over perf.

Thanks & Regards, Dmitry

Rohit_G_1 · ‎08-23-2016

Hi Dmitry

I probably ran the command to check sep drivers on the login node (which could be an Ivy bridge) that is why the inconsistent result.

Meanwhile I also checked the error log of my code. These messages were generated by vtune

amplxe: Error: Cannot load data file `/pscratch/rogupta/vtune_res_2_10_htsptv2/data.0/sep2aaab7b6f700.20160823T172035.087839.tb6' (Failed to bind sampling data file!).

and

amplxe: Error: 0x40000025 (Inapplicable report) -- The report 'summary' is not applicable to the result /pscratch/rogupta/vtune_res_2_10_htsptv2/vtune_res_2_10_htsptv2.amplxe.

Do these relate to the sep driver and/or the update 4?

Please suggest.

with kind regards

Rohit