Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Beginner
61 Views

Intel Processor Checker

 

Hello everybody,

 

I am writing to ask a question about the problem of randomly different results when calculating using a specific processor.

I have a forecast model where the values vary when repeatedly executed on a specific compute node.

Even when analyzing at the code level, the difference in values does not occur at the same source code location every time, nor does it occur at the same time.

I have found that this problem occurs when using a specific processor through the following options.

I_MPI_PIN_PROCESSOR_LIST="0-6,8-47": No error
I_MPI_PIN_PROCESSOR_LIST="0-7,9-47": Error

Therefore, I am judging that there is a problem with the 8th core (processor).

In this case, the following three questions arise.

1) What factors (e.g. hardware, software) are in the core causing the result to be different?

2) Is this related to the source code (e.g. MKL, MPI)?

- Because I want to know if it affects other source code.

3) Is there a tool (like HPL) that quickly finds this problem?

- Because I have a lot of nodes I need to manage, so I hope there is a simple test tool. It would be nice if it would be a tool that judges whether it gives a difference or not to the result value if it is executed.

 

Thank you in advance

Kihang

0 Kudos
2 Replies
Highlighted
Moderator
47 Views

Hi Kihang,


Can you provide the CPU information of that processor?

command:  cpuinfo


Intel does provide tools to find problems in HPC applications:

 1)To check whether the problem is in cluster setup/hardware, you can check the status of hardware using cluster checker.

source <install_dir>/clck/latest/env/vars.sh

clck -F<Framework> -f nodefile

The nodefile to be filled with hostnames, the framework can be cpu_info, hyper_threading etc based on your requirement. You can get the list of available frameworks using clck -Xlist

For more info please check https://software.intel.com/content/www/us/en/develop/documentation/cluster-checker-user-guide/top/ge...

 

 2)To check for problems in code, you can analyze the application using ITAC.

source <itac_installdir>/bin/itacvars.sh

mpiicc -g -trace <app_name> <args..>

mpirun -genv VT_LOGFILE_FORMAT=SINGLESTF -trace -n 16 -ppn 2 -f hosts.txt ./<executable> 

traceanalyzer ./<application_name>.stf &

For more info please check https://software.intel.com/content/www/us/en/develop/documentation/itac-vtune-mpi-openmp-tutorial-li...


Hope this information will help in debugging the problem.


Regards

Prasanth


0 Kudos
Highlighted
Moderator
15 Views

Hi Kihang,


Instead of a specific CPU not working, we think these errors might be due to data race conditions in the code.

Have you run the test enough times and found similar behaviour when that CPU is involved?

Can you provide your command line? (how you were launching the MPI) and if possible, provide the code too.


You can check for the correctness of the code using ITAC. (This will report if there any race conditions and other errors too).

source <itac_installdir>/bin/itacvars.sh

mpirun -np< > -check_mpi ./<executable>


If your program involves OpenMP/TBB or any other threading use Intel Inspector to analyse the application.

You can see how to use Inspector for MPI here: https://software.intel.com/content/www/us/en/develop/documentation/inspector-user-guide-linux/top/mp...


Regards

Prasanth


0 Kudos