- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi, I am running PCM on my workstation with I7 3930K, Scientific Linux (clone of RHEL) just fine. However, for my server board with Xeon E5 2667 v2 with Fedora20, I have an exception:
root@node02 IntelPerformanceCounterMonitorV2.6# ./pcm.x 1
Intel(r) Performance Counter Monitor V2.6 (2013-11-04 13:43:31 +0100 ID=db05e43)
Copyright (c) 2009-2013 Intel Corporation
Floating point exception
root@node02 IntelPerformanceCounterMonitorV2.6#
Please advice. Thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Roman,
Thanks for the fix that I put as a patch file (in attachment).
I confirm that it works now on my quad socket iVy Bridge platform with HT disabled.
Regards,
Emre
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Do you know the exact type of exception? I mean x87 type or SIMD type?
Can you run pcm under GDB? It should probably catch the exception and show the IP of faulting code.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I got the same issue today on a quad socket ivy bridge (E7-8891 v2 @ 3.20GHz) running with RHEL 6.5. I dont think it is related to the OS.
pcm.x[19170] trap divide error ip:40975c sp:7fffdf6431b0 error:0 in pcm.x[400000+30000]
here is the end of strace logs:
read(3, " 2\nsiblings\t: 20\ncore id\t\t: 4\ncp"..., 1024) = 1024
read(3, " yes\nfpu_exception\t: yes\ncpuid l"..., 1024) = 1024
read(3, "e mce cx8 apic sep mtrr pge mca "..., 1024) = 1024
read(3, "t tm pbe syscall nx pdpe1gb rdts"..., 1024) = 1024
read(3, "ology nonstop_tsc aperfmperf pni"..., 1024) = 1024
read(3, "e3 cx16 xtpr pdcm pcid dca sse4_"..., 1024) = 1024
read(3, "x f16c rdrand lahf_lm ida arat x"..., 1024) = 1024
read(3, "pid fsgsbase smep erms\nbogomips\t"..., 1024) = 1024
read(3, "ss sizes\t: 46 bits physical, 48 "..., 1024) = 1024
read(3, "r_id\t: GenuineIntel\ncpu family\t:"..., 1024) = 1024
read(3, "91 v2 @ 3.20GHz\nstepping\t: 7\ncpu"..., 1024) = 1024
read(3, "\nsiblings\t: 20\ncore id\t\t: 7\ncpu "..., 1024) = 1024
read(3, "es\nfpu_exception\t: yes\ncpuid lev"..., 1024) = 1024
read(3, "mce cx8 apic sep mtrr pge mca cm"..., 1024) = 1024
read(3, "tm pbe syscall nx pdpe1gb rdtscp"..., 1024) = 1024
read(3, "ogy nonstop_tsc aperfmperf pni p"..., 1024) = 1024
read(3, " cx16 xtpr pdcm pcid dca sse4_1 "..., 1024) = 1024
read(3, " f16c rdrand lahf_lm ida arat xs"..., 1024) = 1024
read(3, "pid fsgsbase smep erms\nbogomips\t"..., 1024) = 1024
read(3, "ess sizes\t: 46 bits physical, 48"..., 1024) = 1024
read(3, "dor_id\t: GenuineIntel\ncpu family"..., 1024) = 1024
read(3, "-8891 v2 @ 3.20GHz\nstepping\t: 7\n"..., 1024) = 1024
read(3, "\t: 2\nsiblings\t: 20\ncore id\t\t: 11"..., 1024) = 1024
read(3, "u\t\t: yes\nfpu_exception\t: yes\ncpu"..., 1024) = 1024
read(3, "sr pae mce cx8 apic sep mtrr pge"..., 1024) = 1024
read(3, "2 ss ht tm pbe syscall nx pdpe1g"..., 1024) = 1024
read(3, "od xtopology nonstop_tsc aperfmp"..., 1024) = 1024
read(3, " tm2 ssse3 cx16 xtpr pdcm pcid d"..., 1024) = 330
read(3, "", 1024) = 0
close(3) = 0
munmap(0x7f5745645000, 4096) = 0
--- SIGFPE (Floating point exception) @ 0 (0) ---
+++ killed by SIGFPE (core dumped) +++
Floating point exception (core dumped)
I will try to debug but I would appreciate if anyone has information on this issue which I have not seen on Sandy Bridge processors.
Thanks,
Emre
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Are you sure that floating point exception is thrown by PCM code?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am having the same issue with E5-2667 v2 processors on RHEL6.5. Strangely enough, other processors like the E5-2680 v2 and E5-2697 v2 are fine.
I attached a debugger and got the following trace:
(gdb) r /bin/sleep 2
Starting program: /home/dell-guest/src/IntelPerformanceCounterMonitorV2.6/pcm-power.x /bin/sleep 2
[Thread debugging using libthread_db enabled]
Intel(r) Performance Counter Monitor V2.6 (2013-11-04 13:43:31 +0100 ID=db05e43)
Power Monitoring Utility
Copyright (c) 2011-2012 Intel Corporation
Program received signal SIGFPE, Arithmetic exception.
0x000000000040826c in PCM::PCM (this=0x623080) at cpucounters.cpp:785
785 std::cout << "Number of physical cores: " << (num_cores/threads_per_core) << std::endl;
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.107.el6.x86_64 libgcc-4.4.7-3.el6.x86_64 libstdc++-4.4.7-3.el6.x86_64
(gdb)
The problem is a divide by zero caused by threads_per_core. We are not using HT, so threads_per_core=1. When I make that change, it works fine.
regards,
-Martin
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I also started to suspect divison instruction now you have confirmed it.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I believe Martin found the issue.
Indeed, I had also disabled HT on my Dell platform. It seems that PCM is initializing threads_per_core to 0
PCM::PCM() :
UnsupportedMessage("Error: unsupported processor. Only Intel(R) processors are supported (Atom(R) and microarchitecture codename Nehalem, Westmere, Sandy Bridge and Ivy Bridge)."),
cpu_family(-1),
cpu_model(-1),
original_cpu_model(-1),
threads_per_core(0),
...
and cannot get the information properly from /proc/cpuinfo ie. ++threads_per_core is never called.
Initializing threads_per_core to 1 is fixing the issue but this is just a workaround... Another workaround is to enable HT.
Thanks,
Emre
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
thanks for reporting this.
could you please share your /proc/cpuinfo file (for example attach to your post reply) to let us fix this properly.
Thank you
Roman
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
thanks for the data.
Could you try this fix (put it above the line throwing the exception - cpucounters.cpp:785):
if(threads_per_core == 0)
{
for (int i = 0; i < num_cores; ++i)
{
if(topology.socket == topology[0].socket && topology.core_id == topology[0].core_id)
++threads_per_core;
}
}
thanks,
Roman
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Emre, thanks a lot for testing
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page