I am attempting to run an application that uses mpich/mpiexec to assign threads to cores. I have had no trouble extensively running the same application on similar hardware (Dual Intel Xeon E5-2680-V3 system), in the same version of Fedora (23), which suggests to me that this is memory/hardware related .
I note that the E5-2680-V3 system (which runs the application without issue) does not have TSX-NI, whereas the E52697A v4 system does. Could this be the issue? Is it possible to disable TSX-NI on my E5-2697A v4 system to diagnose this? Otherwise, would updating the CPU microcode help?
Very little debugging info is given when the application fails, but given how quick it fails after execution, it is quite clear that something is very wrong here:
[wri@wrimodels12 runs]$ ems_domain --localize midatl
Starting UEMS Program ems_domain (V15.99.8) on wrimodels12 at Sat Dec 2 20:18:28 2017 UTC
* Localizing "midatl" domain - /home/wri/wrfems/uems/runs/midatl
Projection : lat-lon
Standard Longitude : -41 Degrees
Reference Latitude : 42 Degrees
Reference Longitude : -41 Degrees
Grid NX x NY : 495 x 165
Grid Spacing : 0.170 Degrees
Geog Dset Res : modis_lakes+modis_30s+modis_15s+10m
* Burn'n up 32 processors to localize your domain. Please ignore the smoke - Failed (11)
! Error running GEOGRID - System Signal Code (SN) : 11 (Invalid Memory Reference - Seg Fault)
While perusing the log/domain_geogrid_stdout.log file I saw the following:
> YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)
Also use the --nogeogrid flag for debugging.
[wri@wrimodels12 static]$ /home/wri/wrfems/uems/util/mpich2/bin/mpiexec -n 32 /home/wri/wrfems/uems/bin/geogrid
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 47823 RUNNING AT wrimodels12
= EXIT CODE: 11
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions
bcapasso: Thank you very much for contacting the Intel® communities. We will do our best to try to provide the information you are looking for.
In regard to your inquiry about if the problem with the application is related to the Intel® E5-2697A v4 processor supporting TSX-NI, it is hard to tell for sure, it will depend on the requirements of the application itself. Depending on the model of the board, you might be able to disable it in the BIOS of it or by doing a BIOS update.
Now, remember that the tests done by Intel were done using Windows as operating system, since you are using Fedora, in this case we recommend to visit their forums for further technical assistance on this subject:
Any further questions, please let me know.