Solved: PCM V2.6 on Xeon E5 2667 v2, Fedora 20: floating point exception

Oleg_M_Intel · ‎05-15-2014

Hi, I am running PCM on my workstation with I7 3930K, Scientific Linux (clone of RHEL) just fine. However, for my server board with Xeon E5 2667 v2 with Fedora20, I have an exception:

root@node02 IntelPerformanceCounterMonitorV2.6# ./pcm.x 1

Intel(r) Performance Counter Monitor V2.6 (2013-11-04 13:43:31 +0100 ID=db05e43)

Floating point exception
root@node02 IntelPerformanceCounterMonitorV2.6#

Please advice. Thanks.

Emre_Eraltan · ‎05-22-2014

Hi Roman,

Thanks for the fix that I put as a patch file (in attachment).

I confirm that it works now on my quad socket iVy Bridge platform with HT disabled.

Regards,

Emre

View solution in original post

Bernard · ‎05-16-2014

Do you know the exact type of exception? I mean x87 type or SIMD type?

Can you run pcm under GDB? It should probably catch the exception and show the IP of faulting code.

Emre_Eraltan · ‎05-16-2014

Hi,

I got the same issue today on a quad socket ivy bridge (E7-8891 v2 @ 3.20GHz) running with RHEL 6.5. I dont think it is related to the OS.

pcm.x[19170] trap divide error ip:40975c sp:7fffdf6431b0 error:0 in pcm.x[400000+30000]

here is the end of strace logs:

read(3, " 2\nsiblings\t: 20\ncore id\t\t: 4\ncp"..., 1024) = 1024
read(3, " yes\nfpu_exception\t: yes\ncpuid l"..., 1024) = 1024
read(3, "e mce cx8 apic sep mtrr pge mca "..., 1024) = 1024
read(3, "t tm pbe syscall nx pdpe1gb rdts"..., 1024) = 1024
read(3, "ology nonstop_tsc aperfmperf pni"..., 1024) = 1024
read(3, "e3 cx16 xtpr pdcm pcid dca sse4_"..., 1024) = 1024
read(3, "x f16c rdrand lahf_lm ida arat x"..., 1024) = 1024
read(3, "pid fsgsbase smep erms\nbogomips\t"..., 1024) = 1024
read(3, "ss sizes\t: 46 bits physical, 48 "..., 1024) = 1024
read(3, "r_id\t: GenuineIntel\ncpu family\t:"..., 1024) = 1024
read(3, "91 v2 @ 3.20GHz\nstepping\t: 7\ncpu"..., 1024) = 1024
read(3, "\nsiblings\t: 20\ncore id\t\t: 7\ncpu "..., 1024) = 1024
read(3, "es\nfpu_exception\t: yes\ncpuid lev"..., 1024) = 1024
read(3, "mce cx8 apic sep mtrr pge mca cm"..., 1024) = 1024
read(3, "tm pbe syscall nx pdpe1gb rdtscp"..., 1024) = 1024
read(3, "ogy nonstop_tsc aperfmperf pni p"..., 1024) = 1024
read(3, " cx16 xtpr pdcm pcid dca sse4_1 "..., 1024) = 1024
read(3, " f16c rdrand lahf_lm ida arat xs"..., 1024) = 1024
read(3, "pid fsgsbase smep erms\nbogomips\t"..., 1024) = 1024
read(3, "ess sizes\t: 46 bits physical, 48"..., 1024) = 1024
read(3, "dor_id\t: GenuineIntel\ncpu family"..., 1024) = 1024
read(3, "-8891 v2 @ 3.20GHz\nstepping\t: 7\n"..., 1024) = 1024
read(3, "\t: 2\nsiblings\t: 20\ncore id\t\t: 11"..., 1024) = 1024
read(3, "u\t\t: yes\nfpu_exception\t: yes\ncpu"..., 1024) = 1024
read(3, "sr pae mce cx8 apic sep mtrr pge"..., 1024) = 1024
read(3, "2 ss ht tm pbe syscall nx pdpe1g"..., 1024) = 1024
read(3, "od xtopology nonstop_tsc aperfmp"..., 1024) = 1024
read(3, " tm2 ssse3 cx16 xtpr pdcm pcid d"..., 1024) = 330
read(3, "", 1024)                       = 0
close(3)                                = 0
munmap(0x7f5745645000, 4096)            = 0
--- SIGFPE (Floating point exception) @ 0 (0) ---
+++ killed by SIGFPE (core dumped) +++
Floating point exception (core dumped)

I will try to debug but I would appreciate if anyone has information on this issue which I have not seen on Sandy Bridge processors.

Thanks,

Emre

Bernard · ‎05-17-2014

Are you sure that floating point exception is thrown by PCM code?

hilgeman · ‎05-20-2014

I am having the same issue with E5-2667 v2 processors on RHEL6.5. Strangely enough, other processors like the E5-2680 v2 and E5-2697 v2 are fine.

I attached a debugger and got the following trace:

(gdb) r /bin/sleep 2
Starting program: /home/dell-guest/src/IntelPerformanceCounterMonitorV2.6/pcm-power.x /bin/sleep 2
[Thread debugging using libthread_db enabled]

Intel(r) Performance Counter Monitor V2.6 (2013-11-04 13:43:31 +0100 ID=db05e43)

Power Monitoring Utility
Copyright (c) 2011-2012 Intel Corporation

Program received signal SIGFPE, Arithmetic exception.
0x000000000040826c in PCM::PCM (this=0x623080) at cpucounters.cpp:785
785 std::cout << "Number of physical cores: " << (num_cores/threads_per_core) << std::endl;
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.107.el6.x86_64 libgcc-4.4.7-3.el6.x86_64 libstdc++-4.4.7-3.el6.x86_64
(gdb)

The problem is a divide by zero caused by threads_per_core. We are not using HT, so threads_per_core=1. When I make that change, it works fine.

regards,

-Martin

Bernard · ‎05-20-2014

I also started to suspect divison instruction now you have confirmed it.

Emre_Eraltan · ‎05-20-2014

I believe Martin found the issue.

Indeed, I had also disabled HT on my Dell platform. It seems that PCM is initializing threads_per_core to 0

PCM::PCM() :
    UnsupportedMessage("Error: unsupported processor. Only Intel(R) processors are supported (Atom(R) and microarchitecture codename Nehalem, Westmere, Sandy Bridge and Ivy Bridge)."),
    cpu_family(-1),
    cpu_model(-1),
    original_cpu_model(-1),
    threads_per_core(0),
   ...

and cannot get the information properly from /proc/cpuinfo ie. ++threads_per_core is never called.

Initializing threads_per_core to 1 is fixing the issue but this is just a workaround... Another workaround is to enable HT.

Thanks,
Emre

Roman_D_Intel · ‎05-20-2014

thanks for reporting this.

could you please share your /proc/cpuinfo file (for example attach to your post reply) to let us fix this properly.

Thank you

Roman

Emre_Eraltan · ‎05-21-2014

Hi Roman,

You can find the cpuinfo atached.

Regards,

Emre

Roman_D_Intel · ‎05-22-2014

thanks for the data.

Could you try this fix (put it above the line throwing the exception - cpucounters.cpp:785):

if(threads_per_core == 0)
{
for (int i = 0; i < num_cores; ++i)
{
if(topology.socket == topology[0].socket && topology.core_id == topology[0].core_id)
++threads_per_core;
}
}

thanks,

Roman

Emre_Eraltan · ‎05-22-2014

Hi Roman,

Thanks for the fix that I put as a patch file (in attachment).

I confirm that it works now on my quad socket iVy Bridge platform with HT disabled.

Regards,

Emre

Roman_D_Intel · ‎05-23-2014

Emre, thanks a lot for testing