- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello everybody,
I am writing to ask a question about the problem of randomly different results when calculating using a specific processor.
I have a forecast model where the values vary when repeatedly executed on a specific compute node.
Even when analyzing at the code level, the difference in values does not occur at the same source code location every time, nor does it occur at the same time.
I have found that this problem occurs when using a specific processor through the following options.
I_MPI_PIN_PROCESSOR_LIST="0-6,8-47": No error
I_MPI_PIN_PROCESSOR_LIST="0-7,9-47": Error
Therefore, I am judging that there is a problem with the 8th core (processor).
In this case, the following three questions arise.
1) What factors (e.g. hardware, software) are in the core causing the result to be different?
2) Is this related to the source code (e.g. MKL, MPI)?
- Because I want to know if it affects other source code.
3) Is there a tool (like HPL) that quickly finds this problem?
- Because I have a lot of nodes I need to manage, so I hope there is a simple test tool. It would be nice if it would be a tool that judges whether it gives a difference or not to the result value if it is executed.
Thank you in advance
Kihang
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Kihang,
Instead of a specific CPU not working, we think these errors might be due to data race conditions in the code.
Have you run the test enough times and found similar behaviour when that CPU is involved?
Can you provide your command line? (how you were launching the MPI) and if possible, provide the code too.
You can check for the correctness of the code using ITAC. (This will report if there any race conditions and other errors too).
source <itac_installdir>/bin/itacvars.sh
mpirun -np< > -check_mpi ./<executable>
If your program involves OpenMP/TBB or any other threading use Intel Inspector to analyse the application.
You can see how to use Inspector for MPI here: https://software.intel.com/content/www/us/en/develop/documentation/inspector-user-guide-linux/top/mpi-applications-support/collecting-mpi-performance-correctness-data.html
Regards
Prasanth
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Kihang,
Can you provide the CPU information of that processor?
command: cpuinfo
Intel does provide tools to find problems in HPC applications:
1)To check whether the problem is in cluster setup/hardware, you can check the status of hardware using cluster checker.
source <install_dir>/clck/latest/env/vars.sh
clck -F<Framework> -f nodefile
The nodefile to be filled with hostnames, the framework can be cpu_info, hyper_threading etc based on your requirement. You can get the list of available frameworks using clck -Xlist
For more info please check https://software.intel.com/content/www/us/en/develop/documentation/cluster-checker-user-guide/top/getting-started.html
2)To check for problems in code, you can analyze the application using ITAC.
source <itac_installdir>/bin/itacvars.sh
mpiicc -g -trace <app_name> <args..>
mpirun -genv VT_LOGFILE_FORMAT=SINGLESTF -trace -n 16 -ppn 2 -f hosts.txt ./<executable>
traceanalyzer ./<application_name>.stf &
For more info please check https://software.intel.com/content/www/us/en/develop/documentation/itac-vtune-mpi-openmp-tutorial-lin/top/identify-communication-issues-with-intel-trace-analyzer-and-collector.html
Hope this information will help in debugging the problem.
Regards
Prasanth
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Kihang,
Instead of a specific CPU not working, we think these errors might be due to data race conditions in the code.
Have you run the test enough times and found similar behaviour when that CPU is involved?
Can you provide your command line? (how you were launching the MPI) and if possible, provide the code too.
You can check for the correctness of the code using ITAC. (This will report if there any race conditions and other errors too).
source <itac_installdir>/bin/itacvars.sh
mpirun -np< > -check_mpi ./<executable>
If your program involves OpenMP/TBB or any other threading use Intel Inspector to analyse the application.
You can see how to use Inspector for MPI here: https://software.intel.com/content/www/us/en/develop/documentation/inspector-user-guide-linux/top/mpi-applications-support/collecting-mpi-performance-correctness-data.html
Regards
Prasanth
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Kihang,
We are closing this thread assuming your issue is resolved.
Please raise a new thread for any further questions. Any further interaction in this thread will be considered community only
Regards
Prasanth
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page