I am writing to ask a question about the problem of randomly different results when calculating using a specific processor.
I have a forecast model where the values vary when repeatedly executed on a specific compute node.
Even when analyzing at the code level, the difference in values does not occur at the same source code location every time, nor does it occur at the same time.
I have found that this problem occurs when using a specific processor through the following options.
I_MPI_PIN_PROCESSOR_LIST="0-6,8-47": No error
Therefore, I am judging that there is a problem with the 8th core (processor).
In this case, the following three questions arise.
1) What factors (e.g. hardware, software) are in the core causing the result to be different?
2) Is this related to the source code (e.g. MKL, MPI)?
- Because I want to know if it affects other source code.
3) Is there a tool (like HPL) that quickly finds this problem?
- Because I have a lot of nodes I need to manage, so I hope there is a simple test tool. It would be nice if it would be a tool that judges whether it gives a difference or not to the result value if it is executed.
Thank you in advance
Can you provide the CPU information of that processor?
Intel does provide tools to find problems in HPC applications:
1)To check whether the problem is in cluster setup/hardware, you can check the status of hardware using cluster checker.
clck -F<Framework> -f nodefile
The nodefile to be filled with hostnames, the framework can be cpu_info, hyper_threading etc based on your requirement. You can get the list of available frameworks using clck -Xlist
For more info please check https://software.intel.com/content/www/us/en/develop/documentation/cluster-checker-user-guide/top/ge...
2)To check for problems in code, you can analyze the application using ITAC.
mpiicc -g -trace <app_name> <args..>
mpirun -genv VT_LOGFILE_FORMAT=SINGLESTF -trace -n 16 -ppn 2 -f hosts.txt ./<executable>
traceanalyzer ./<application_name>.stf &
For more info please check https://software.intel.com/content/www/us/en/develop/documentation/itac-vtune-mpi-openmp-tutorial-li...
Hope this information will help in debugging the problem.
Instead of a specific CPU not working, we think these errors might be due to data race conditions in the code.
Have you run the test enough times and found similar behaviour when that CPU is involved?
Can you provide your command line? (how you were launching the MPI) and if possible, provide the code too.
You can check for the correctness of the code using ITAC. (This will report if there any race conditions and other errors too).
mpirun -np< > -check_mpi ./<executable>
If your program involves OpenMP/TBB or any other threading use Intel Inspector to analyse the application.
You can see how to use Inspector for MPI here: https://software.intel.com/content/www/us/en/develop/documentation/inspector-user-guide-linux/top/mp...