Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.

Intel Processor Checker

youn__kihang
Novice
918 Views

 

Hello everybody,

 

I am writing to ask a question about the problem of randomly different results when calculating using a specific processor.

I have a forecast model where the values vary when repeatedly executed on a specific compute node.

Even when analyzing at the code level, the difference in values does not occur at the same source code location every time, nor does it occur at the same time.

I have found that this problem occurs when using a specific processor through the following options.

I_MPI_PIN_PROCESSOR_LIST="0-6,8-47": No error
I_MPI_PIN_PROCESSOR_LIST="0-7,9-47": Error

Therefore, I am judging that there is a problem with the 8th core (processor).

In this case, the following three questions arise.

1) What factors (e.g. hardware, software) are in the core causing the result to be different?

2) Is this related to the source code (e.g. MKL, MPI)?

- Because I want to know if it affects other source code.

3) Is there a tool (like HPL) that quickly finds this problem?

- Because I have a lot of nodes I need to manage, so I hope there is a simple test tool. It would be nice if it would be a tool that judges whether it gives a difference or not to the result value if it is executed.

 

Thank you in advance

Kihang

0 Kudos
1 Solution
PrasanthD_intel
Moderator
872 Views

Hi Kihang,


Instead of a specific CPU not working, we think these errors might be due to data race conditions in the code.

Have you run the test enough times and found similar behaviour when that CPU is involved?

Can you provide your command line? (how you were launching the MPI) and if possible, provide the code too.


You can check for the correctness of the code using ITAC. (This will report if there any race conditions and other errors too).

source <itac_installdir>/bin/itacvars.sh

mpirun -np< > -check_mpi ./<executable>


If your program involves OpenMP/TBB or any other threading use Intel Inspector to analyse the application.

You can see how to use Inspector for MPI here: https://software.intel.com/content/www/us/en/develop/documentation/inspector-user-guide-linux/top/mpi-applications-support/collecting-mpi-performance-correctness-data.html


Regards

Prasanth


View solution in original post

0 Kudos
3 Replies
PrasanthD_intel
Moderator
904 Views

Hi Kihang,


Can you provide the CPU information of that processor?

command:  cpuinfo


Intel does provide tools to find problems in HPC applications:

 1)To check whether the problem is in cluster setup/hardware, you can check the status of hardware using cluster checker.

source <install_dir>/clck/latest/env/vars.sh

clck -F<Framework> -f nodefile

The nodefile to be filled with hostnames, the framework can be cpu_info, hyper_threading etc based on your requirement. You can get the list of available frameworks using clck -Xlist

For more info please check https://software.intel.com/content/www/us/en/develop/documentation/cluster-checker-user-guide/top/getting-started.html

 

 2)To check for problems in code, you can analyze the application using ITAC.

source <itac_installdir>/bin/itacvars.sh

mpiicc -g -trace <app_name> <args..>

mpirun -genv VT_LOGFILE_FORMAT=SINGLESTF -trace -n 16 -ppn 2 -f hosts.txt ./<executable> 

traceanalyzer ./<application_name>.stf &

For more info please check https://software.intel.com/content/www/us/en/develop/documentation/itac-vtune-mpi-openmp-tutorial-lin/top/identify-communication-issues-with-intel-trace-analyzer-and-collector.html


Hope this information will help in debugging the problem.


Regards

Prasanth


0 Kudos
PrasanthD_intel
Moderator
873 Views

Hi Kihang,


Instead of a specific CPU not working, we think these errors might be due to data race conditions in the code.

Have you run the test enough times and found similar behaviour when that CPU is involved?

Can you provide your command line? (how you were launching the MPI) and if possible, provide the code too.


You can check for the correctness of the code using ITAC. (This will report if there any race conditions and other errors too).

source <itac_installdir>/bin/itacvars.sh

mpirun -np< > -check_mpi ./<executable>


If your program involves OpenMP/TBB or any other threading use Intel Inspector to analyse the application.

You can see how to use Inspector for MPI here: https://software.intel.com/content/www/us/en/develop/documentation/inspector-user-guide-linux/top/mpi-applications-support/collecting-mpi-performance-correctness-data.html


Regards

Prasanth


0 Kudos
PrasanthD_intel
Moderator
846 Views

Hi Kihang,


We are closing this thread assuming your issue is resolved.

Please raise a new thread for any further questions. Any further interaction in this thread will be considered community only

Regards

Prasanth


0 Kudos
Reply