Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Daniel_H
Beginner
881 Views

[SOLVED] Executable form Intel compiler crashes the PC

Jump to solution

Hello,

I'm not 100% sure this is completely the right place, but since it sounds related to the usage of executables compiled with Intel compiler...
(otherwise feel free to redirect me to the proper place)

Here is my problem:

I'm using the Intel C++ 2018.2 (Linux version) to compile tool used to process scientific data heavily multithreaded and using MKL libraries.
If I'm trying to run it on my most recent PC, it crashes the PC completely (put in an undefined state which must be hard reset).
Crashes are random but occur quite rapidly. I tried compiling with different options without any luck. On reboot (after all the hassle of fscheck) I got messages like these ones:

icpc -O3 -Wuninitialized -funroll-loops -unroll-aggressive -restrict -wd3802 -xHOST

mai 29 14:20:32 erichthonios kernel: [Firmware Bug]: TSC ADJUST differs within socket(s), fixing all errors
mai 29 14:20:32 erichthonios kernel:   #2  #3  #4  #5  #6  #7  #8  #9 #10 #11 #12 #13 #14
mai 29 14:20:32 erichthonios kernel: mce: [Hardware Error]: Machine check events logged
mai 29 14:20:32 erichthonios kernel: mce: [Hardware Error]: CPU 14: Machine Check: 0 Bank 0: f200000000000005
mai 29 14:20:32 erichthonios kernel:  #15
mai 29 14:20:32 erichthonios kernel: mce: [Hardware Error]: TSC 0
mai 29 14:20:32 erichthonios kernel:  #16
mai 29 14:20:32 erichthonios kernel: mce: [Hardware Error]: PROCESSOR 0:50654 TIME 1527596388 SOCKET 0 APIC 30 microcode 2000043
mai 29 14:20:32 erichthonios kernel:  #17 #18 #19 #20 #21 #22 #23 #24 #25 #26 #27 #28 #29 #30 #31 #32 #33 #34 #35

icpc -O3 -Wuninitialized -funroll-loops -unroll-aggressive -restrict -wd3802 -xAVX

mai 29 14:32:43 erichthonios kernel: [Firmware Bug]: TSC ADJUST differs within socket(s), fixing all errors
mai 29 14:32:43 erichthonios kernel:   #2  #3  #4  #5  #6  #7  #8  #9 #10 #11 #12 #13 #14
mai 29 14:32:43 erichthonios kernel: mce: [Hardware Error]: Machine check events logged
mai 29 14:32:43 erichthonios kernel: mce: [Hardware Error]: CPU 14: Machine Check: 0 Bank 0: b200000000070005
mai 29 14:32:43 erichthonios kernel:  #15
mai 29 14:32:43 erichthonios kernel: mce: [Hardware Error]: TSC 0
mai 29 14:32:43 erichthonios kernel:  #16
mai 29 14:32:43 erichthonios kernel: mce: [Hardware Error]: PROCESSOR 0:50654 TIME 1527597117 SOCKET 0 APIC 30 microcode 2000043
mai 29 14:32:43 erichthonios kernel:  #17 #18 #19 #20 #21 #22 #23 #24 #25 #26 #27 #28 #29 #30 #31 #32 #33 #34 #35

icpc -O3 -Wuninitialized -funroll-loops -unroll-aggressive -restrict -wd3802

mai 29 14:43:49 erichthonios kernel: mce: [Hardware Error]: Machine check events logged
mai 29 14:43:49 erichthonios kernel: mce: [Hardware Error]: CPU 2: Machine Check: 0 Bank 0: f200000000000005
mai 29 14:43:49 erichthonios kernel:   #3
mai 29 14:43:49 erichthonios kernel: mce: [Hardware Error]: TSC 0
mai 29 14:43:49 erichthonios kernel:   #4
mai 29 14:43:49 erichthonios kernel: mce: [Hardware Error]: PROCESSOR 0:50654 TIME 1527597785 SOCKET 0 APIC 4 microcode 2000043

icpc -restrict -wd3802

This is dmesg this time, sorry...
[    0.003333] [FirmwareBug]: TSC ADJUST differs within socket(s), fixing all errors
[    0.150022]   #2  #3  #4
[    0.223347] mce: [Hardware Error]: Machine check events logged
[    0.223353] mce: [Hardware Error]: CPU 4: Machine Check: 0 Bank 0: f200000000000005
[    0.250024]   #5
[    0.253338] mce: [Hardware Error]: TSC 0
[    0.286687]   #6
[    0.290007] mce: [Hardware Error]: PROCESSOR 0:50654 TIME 1527676563 SOCKET 0 APIC 8 microcode 2000043


More troubling is that using g++ -O3 compiled executable works prefectly... and daily normal usage (python, mail, etc...) offers also normal stability.

My hardware:

  •  MB:   TUF X299 Mark 1 Bios 1301
  •  RAM:  128GB
  •  CPU:  Intel(R) Core(TM) i9-7980XE CPU @ 2.60GHz
  •  Video: nvidia GTX1060
  •  Memtest OK for a whole night (5 passes)
  •  Running on 12 cores. Temperature around 68C.

OS: Archlinux
 kernel: Linux version 4.16.11-1-ARCH (builduser@heftig-1505) (gcc version 8.1.0 (GCC)) #1 SMP PREEMPT Tue May 22 21:40:27 UTC 2018

On an old hardware (i7-3930K CPU @ 3.20GHz;  16GB RAM;  Asus P9X79) a part from overheating (89C) all work fine using exactly the same executable in the same conditions.

I tried googling around about this but didn't find any helpful answer. Could it be that the Intel compiler is producing code which is for some reason (eg this TSC) incompatible with the i9-7980XE?

I'm desperate of not being able to use the added value provided by the Intel compiler (which I bought on purpose) anymore, especially considering the high degree of vectorization implied in the application.

Any help and/or suggestion would be greatly appreciated,

Daniel

0 Kudos
1 Solution
AYee1
Novice
881 Views

Daniel H wrote:

Hi Mysticial,

Thanks for pointing that... I'm not, at least willingly, overclocked. I'll nevertheless dig in that direction in the "BIOS", I've noticed some obscure settings (like "AVX Core Instruction Ration Negative Offset").

I'm just a bit confused by the fact that even with no optimization, it fails too... unless MKL is issuing AVX instructions even without the agreement of the user...

I'll give updates of the results...

Daniel

One thing you can almost be sure of is that there's at least some kind of hardware/environment problem. Under no circumstances should a user-mode application cause a machine check exception.

You can try to confirm the hardware/environment problem with a real AVX/AVX512 stress test. At which point you can pretty much ignore the compiler/MKL factor until the hardware stability is fixed.

Intel doesn't actually seem to specify* the AVX/AVX512 speeds for the Skylake X chips. Which is probably why the mobo manufacturers seem to be getting it all wrong. Though in practice, offsets of -4 for AVX and -7 for AVX512 seem to be appropriate for stock speeds. So you can try entering those into the BIOS. If that still fails, you can try decreasing them further. But if that still doesn't work, then there are probably other issues involved.

*If I'm wrong here, can someone point me to an official doc with these specs?

View solution in original post

7 Replies
AYee1
Novice
881 Views

Those are the symptoms of the system being unstable under AVX and/or AVX512 workloads. "Basic" stress tests like Memtest will not catch it.

It's likely that the Intel Compiler and/or the MKL is issuing AVX(512) whereas GCC is not. Which is why it only crashes with ICC/MKL.

Are you overclocked? Even if you aren't doing it intentionally, some of the X299 motherboards have BIOS bugs that improperly (or not at all) implement the AVX and AVX512 clock offsets. This causes the processor to try to run AVX/AVX512 at much higher speeds than the processor is specified for which may cause instability. In effect, many X299 motherboards are improperly overclocking the processor out-of-box because they don't follow Intel's specifications.

 

 

Daniel_H
Beginner
881 Views

Hi Mysticial,

Thanks for pointing that... I'm not, at least willingly, overclocked. I'll nevertheless dig in that direction in the "BIOS", I've noticed some obscure settings (like "AVX Core Instruction Ration Negative Offset").

I'm just a bit confused by the fact that even with no optimization, it fails too... unless MKL is issuing AVX instructions even without the agreement of the user...

I'll give updates of the results...

Daniel

AYee1
Novice
882 Views

Daniel H wrote:

Hi Mysticial,

Thanks for pointing that... I'm not, at least willingly, overclocked. I'll nevertheless dig in that direction in the "BIOS", I've noticed some obscure settings (like "AVX Core Instruction Ration Negative Offset").

I'm just a bit confused by the fact that even with no optimization, it fails too... unless MKL is issuing AVX instructions even without the agreement of the user...

I'll give updates of the results...

Daniel

One thing you can almost be sure of is that there's at least some kind of hardware/environment problem. Under no circumstances should a user-mode application cause a machine check exception.

You can try to confirm the hardware/environment problem with a real AVX/AVX512 stress test. At which point you can pretty much ignore the compiler/MKL factor until the hardware stability is fixed.

Intel doesn't actually seem to specify* the AVX/AVX512 speeds for the Skylake X chips. Which is probably why the mobo manufacturers seem to be getting it all wrong. Though in practice, offsets of -4 for AVX and -7 for AVX512 seem to be appropriate for stock speeds. So you can try entering those into the BIOS. If that still fails, you can try decreasing them further. But if that still doesn't work, then there are probably other issues involved.

*If I'm wrong here, can someone point me to an official doc with these specs?

View solution in original post

jimdempseyatthecove
Black Belt
881 Views

>>a part from overheating (89C)

You shouldn't go over 80C. Get a different heatsink and/or fan.

You may also have a satisfactory heatsink and/or fan but the case temperature is too high (different case fans).

Also, your O/S version has to support AVX512.

Jim Dempsey

 

Daniel_H
Beginner
881 Views

Hello all,

Mysticial wrote:

Though in practice, offsets of -4 for AVX and -7 for AVX512 seem to be appropriate for stock speeds. So you can try entering those into the BIOS.

This solved the problem (Tested with FIRESTARTER and my code). It remains me some questions (probably out of purpose in this forum) about having to reduce so much the frequencies from 4400 (CPU) to 3300 in AVX and 2800 in AVX-512. The motherboard auto giving 3700 and 3500 respectively.

Thanks again for you help.

jimdempseyatthecove wrote:

You shouldn't go over 80C. Get a different heatsink and/or fan.

You may also have a satisfactory heatsink and/or fan but the case temperature is too high (different case fans).

Also, your O/S version has to support AVX512.

I agree T should remain as low as possible. That old case was never a hit concerning cooling and I used it as demonstration that the code was able to run correctly.

FYI Linux is able too handle AVX-512 since kernel 3.15 and I'm using 4.16 a few generations further.

 

I guess this solves my case.

Daniel

 

 

 

jimdempseyatthecove
Black Belt
881 Views

Daniel,

I would like to ask you to submit your issue, and solution, to the motherboard manufacturer, as a means to help improve their product.

Glad you are up and running.

Jim Dempsey

Daniel_H
Beginner
881 Views

jimdempseyatthecove wrote:

Daniel,

I would like to ask you to submit your issue, and solution, to the motherboard manufacturer, as a means to help improve their product.

Glad you are up and running.

Jim Dempsey

I'll try this, although I'm not sure that ASUS will be reactive with a Linux support.

Daniel

Reply