forrtl: severe (168): Program Exception - illegal instruction

Fan_F_ · ‎10-11-2015

Hello there,

I am using Intel Fortran Compiler version 14.0.2 on a Linux cluster. After compilation I got the executable file and it ran without any problem on the login node. However, when I tried to run the file on the compute node, an error occurred : forrtl: severe (168): Program Exception - illegal instruction.

The Vendor ID of the login node is GenuineIntel; and the Vendor ID of the compute node is AuthenticAMD. Is this difference that causes the error? How can I solve it?

Any help will be much appreciated.

Kevin_D_Intel · ‎10-11-2015

The suggests the compilation host (i.e. login node) has processor features the compiler has defaulted to using but which are not supported on the non-Intel compute node.

You might check whether your using any -m option, https://software.intel.com/en-us/node/579304, or not.

If not, you likely need to make a specific selection for the processor features appropriate for the login and compute nodes.

TimP · ‎10-11-2015

-msse3 will cover 98% of remaining amd and Intel nodes.

mecej4 · ‎10-11-2015

Merely having a different vendor-id is probably not the real issue, although that may be the most noticeable difference between the login and compute nodes.

Look at the contents of /proc/cpuinfo on the different nodes, and recompile with CPU options that are compatible with both. Or, if you are content with running only on the compute nodes, select options to match those.

In particular, study the flags entry in /proc/cpuinfo. With the information there, you may be able to spot the specific "illegal" instruction. Once in a while, some instructions may not be available unless a BIOS flag is correctly set, but that is somewhat rare.

Fan_F_ · ‎10-11-2015

Kevin Davis (Intel) wrote:

The suggests the compilation host (i.e. login node) has processor features the compiler as defaulted to using but which are not supported on the non-Intel compute node.

You might check whether your using any -m option, https://software.intel.com/en-us/node/579304, or not.

If not, you likely need to make a specific selection for the processor features appropriate for the login and compute nodes using the -x option, https://software.intel.com/en-us/node/579311.

Thank you, Kevin.

Firstly, I didn't use any -m option during compilation. I just compiled as:

ifort -o a.out f1.f90 fn.f90 -I/netcdf_dir/include -L/netcdf_dir/lib -lnetcdf -lnetcdff -limf -assume byterecl

====

Following mecej4's instruction, I looked up the flags entry in /proc/cpuinfo. Because I am a beginner here, I don't know what the flags are meaning for. So I list the information here:

Login node: /proc/cpuinfo

flags : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 popcnt aes lahf_lm ida arat dts tpr_shadow vnmi flexpriority ept vpid

Compute node :/proc/cpuinfo

flags : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nonstop_tsc extd_apicid amd_dcm pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt nodeid_msr npt lbrv svm_lock nrip_save pausefilter

====

So I tried using -msse2 option on the compute node, but the same error occurred.

TimP · ‎10-11-2015

As -msse2 has always been the default for x86_64, and also is the default for ifort 14.0 in 32-bit mode, it's not surprising that setting -msse2 made no difference. Your netcdf library must also have been built with a compatible option. If you add -traceback to the compile options, you should be able to determine where any instruction fault occurs if it is in the code you have compiled.

I'm surprised that sse3 doesn't appear in the list of supported flags as I thought that came before sse4a, but certainly all AMD opteron CPUs supported sse2.

Fan_F_ · ‎10-11-2015

Sorry I misunderstood your instructions earlier.

I compiled with -traceback like this:

FC=ifort
$FC -c cal_entropy.f90 -g -traceback -check all
$FC -c shared_data.f90 -g -traceback -check all
$FC -c sis_1point.f90 -g -traceback -check all
$FC -c resample.f90 -g -traceback -check all
$FC -c init_random_seed.f90 -g -traceback -check all
$FC -c get_cmd_line.f90 -g -traceback -check all
$FC -c main_cal_pp.f90 -g -traceback -check all -I/ncdir/include -L/ncdir/lib -lnetcdf -lnetcdff -limf -assume byterecl
$FC *.o -o cal_pp -I/ncdir/include -L/ncdir/lib -lnetcdf -lnetcdff -limf -assume byterecl

Then I got a run-time error :

forrtl: severe (193): Run-Time Check Failure. The variable 'fengfan_$PP0_SUM' is being used without being defined
Image PC Routine Line Source
cal_pp 00000000004087A6 Unknown Unknown Unknown
libc.so.6 000000310C01ECDD Unknown Unknown Unknown
cal_pp 0000000000408629 Unknown Unknown Unknown

But I have checked the program fengfan, and the variable pp0_sum is defined before used.

I am really confusing here. Any ideas?

And my netcdf library was built using the same compiler.

What's more, when I ran this program on the other cluster in our institute, everything worked just fine either on the login node or the compute node. But this cluster has been broken, so I have to use the first cluster and the error occurred.

Lorri_M_Intel · ‎10-13-2015

I think we're confusing many issues.

If I understand correctly, you're building on the "login node", and running on both the "login node", and the "compute node".
It succeeds on the "login node" but not on the "compute node".
Do I understand correctly?

Now my comments...

In note #5 you say you used -msse2 on the compute node.

I'm sorry, I think you misunderstood --- you need to add "-msse2" when you build on the "login node", to force the instruction set to be acceptable on the "compute node".

Your executable should run on both nodes now.

It is possible that the optimizations being done on the "login node" chose instructions from an instruction set on that node that is not on the "compute node".

Now, the issue in note #7. When you set "-check all" you turn off optimizations *and* turn on uninitialized variable checking.
It is likely that compiling without optimizations generates machine instructions that are acceptable on both the "login node" and "compute node".

So, instead, you're seeing a fail caused by the request for uninitialized variable checking; apparently there is a path where pp0_sum is not getting initialized before being referenced.

I hope this helps --

--Lorri

mecej4 · ‎10-13-2015

To round out what Lorri said in #8, I'll mention that your statement in #5 that "...I tried using -msse2 option on the compute node, but the same error occurred" indicates that you used -msse2 as an option to your program rather than to the ifort compiler. If so, the option would have had absolutely no effect, since (i) your program is unlikely to check for any command line arguments and, even if it did, (ii) it would be in no position to act on "-msse2".

Here is a suggestion to help you avoid all this confusion. Build and debug your program on a single CPU system rather than a cluster. It may be necessary to run with small array sizes, coarse meshes and small data sets in order to do so, but such size reductions are necessary for making debugging possible. If there are undefined variables in your program, you should fix that before attempting to run on a cluster.

Once your program is running, passing all checks and is giving correct results, you can move it to the cluster. On the cluster, when running the compiler, use a target architecture option (such as -msse2) that is appropriate for the compute node(s), whether the compiler itself is run on the login node or the compute node.

Of course, if the cluster provides compilers only on the login node, be aware that you are doing a kind of cross-compiling, so you may not be able to debug the program on the login node and you may not be able to compile on the compute node, and you have to arrange your work-flow to comply. If the login node CPU is capable of executing all the instruction that the compute nodes are capable of, however, you can do debugging/testing on the login node itself.

Fan_F_ · ‎10-14-2015

Thank all of you for your time and patience. I think I didn't state my problem thoroughly in note #7, sorry for that.

We have two clusters in my institute. Name them cluster1 and cluster2. At first I was using cluster1, and building on the "login node" and running the executable on the "compute node". Everything worked just fine and the result is correct. So I don't think there is any obvious mistake in my code. But now cluster1 is broken, so I can't provide the /proc/cpuinfo information of it.

Now I am using cluster2. Same code except for some necessary change of the file path et al. When I build and run on the "login node", no error warning occurred and it gives the correct result. However, when build on the "login node" OR on the "compute node" and run on the "compute node" ,here comes the forrtl: severe(168).

Following mecej4's suggestion in note #9, I am going to recheck the code and build with -traceback and -check all to make sure it pass all checks although it will take some time.