4-Socket System - Xeon E7- 4850. Unstable performance on Socket 0.

Andreas_Hollmann · ‎02-25-2014

Hi,

I discovered an uneven utilization on my NUMA machine.
The workload is evenly distributed and initialized in parallel,
it's the stream benchmark.

It's a 4-Socket WestmereEX machine with 10 cores per
node and 80 logical CPUs in total. The machine is equipped with 256 GByte of RAM.

It runs well for some time and then the performance
drops and htop shows that the CPUs on Socket 0 are utilized
100 % while socket 1 - 3 are utilized only about 50 %. I'm using
likwid-pin, which does strict thread to core pinning. First thread
on first core in list and so on.

Running the stream benchmark like this

wget www.cs.virginia.edu/stream/FTP/Code/stream.c

gcc -fopenmp -mcmodel=medium -O -DSTREAM_ARRAY_SIZE=1000000000
-DNTIMES=1000 stream.c -o stream.1000M.1000

likwid-pin -c 0-39 ./stream.1000M.1000

The peformance also drops without pinning, but then it's not clear which
thread is running on which core.

What could cause such drops and how to detect it?

- thermal threshold ( checked and everything seems ok )
- powercap ? ( is this implemented on westmereEX ?)
- bad memory? ( why does it run well in the beginning)

Here is also the output of Intel's Performance Counter Monitor (PCM) Tool:

It looks like this in the beginning (output of /opt/PCM/pcm.x 2 -nc):

This values are for 2 seconds!

EXEC : instructions per nominal CPU cycle
IPC : instructions per CPU cycle
FREQ : relation to nominal CPU frequency='unhalted clock
ticks'/'invariant timer ticks'
(includes Intel Turbo Boost)
AFREQ : relation to nominal CPU frequency while in active state
(not in power-saving C state)='unhalted clock
ticks'/'invariant timer ticks while in C0-state
(includes Intel Turbo Boost)
READ : bytes read from memory controller (in GBytes)
WRITE : bytes written to memory controller (in GBytes)
TEMP : Temperature reading in 1 degree Celsius relative to the TjMax
temperature (thermal headroom):
0 corresponds to the max temperature

Core (SKT) | EXEC | IPC | FREQ | AFREQ | READ | WRITE | TEMP

----------------------------------------------------------------
SKT 0 0.13 0.26 0.50 1.00 27.20 11.52 N/A
SKT 1 0.13 0.26 0.50 1.00 27.14 11.51 N/A
SKT 2 0.13 0.27 0.50 1.00 27.13 11.51 N/A
SKT 3 0.13 0.27 0.50 1.00 27.12 11.51 N/A
----------------------------------------------------------------
TOTAL * 0.13 0.27 0.50 1.00 108.59 46.05 N/A

Instructions retired: 42 G ; Active cycles: 160 G ; Time (TSC): 4020 Mticks
C0 (active,non-halted) core residency: 49.95 %

C1 core residency: 49.64 %; C3 core residency: 0.03 %; C6 core
residency: 0.37 %;
C3 package residency: 0.00 %; C6 package residency: 0.00 %; C7
package residency: 0.00 %;

PHYSICAL CORE IPC : 0.53 => corresponds to 13.26 %
utilization for cores in active st
Instructions per nominal CPU cycle: 0.26 => corresponds to 6.62 %
core utilization over time interval

--- And then drops to this values ---

Core (SKT) | EXEC | IPC | FREQ | AFREQ | READ | WRITE | TEMP

----------------------------------------------------------------
SKT 0 0.06 0.12 0.50 1.00 13.82 5.28 N/A
SKT 1 0.06 0.25 0.23 1.00 12.68 4.97 N/A
SKT 2 0.06 0.25 0.23 1.00 12.67 4.97 N/A
SKT 3 0.06 0.25 0.23 1.00 12.67 4.96 N/A
----------------------------------------------------------------
TOTAL * 0.06 0.20 0.30 1.00 51.84 20.17 N/A

Instructions retired: 19 G ; Active cycles: 96 G ; Time (TSC):
4021 Mticks ;
C0 (active,non-halted) core residency: 29.84 %

C1 core residency: 53.53 %; C3 core residency: 0.00 %; C6 core
residency: 16.62 %;
C3 package residency: 0.00 %; C6 package residency: 0.00 %; C7
package residency: 0.00 %;

PHYSICAL CORE IPC : 0.40 => corresponds to 9.91 %
utilization for cores in active sta
Instructions per nominal CPU cycle: 0.12 => corresponds to 2.96 %
core utilization over time interval

----

This lines are interesting: SKT 0 has an IPC 0.12 while SKT 1 of 0.25.
SKT 0: The FREQ is 0.50 (half of the cores are idle because of hypertreading)
SK1: Here we is FREQ only 0.23 which means that also the "active" threads
are idling. Why? Why?

Core (SKT) | EXEC | IPC | FREQ | AFREQ | READ | WRITE | TEMP

SKT 0 0.06 0.12 0.50 1.00 13.82 5.28 N/A
SKT 1 0.06 0.25 0.23 1.00 12.68 4.97 N/A

Perf similar results, peformance and bandwidth are going down by 50 %.

perf stat --per-socket --interval-print 2000 -a -e
"uncore_mbox_0/event=bbox_cmds_read/","uncore_mbox_1/event=bbox_cmds_write/"
sleep 3600

Help or suggestions would be much appreciated.

The same happens when I run stream on the first socket. At first good performance and

then it drops and is unstable.

Thanks and best regards,
Andreas

McCalpinJohn · ‎03-03-2014

I don't know why socket 0 is slowing down, but once it does it is not surprising that the other threads go do idle. Although it may vary by product and version, the first result that I found suggested that the default time for OpenMP threads to "spin" before "sleeping" is 200 milliseconds. So if the threads on socket 0 slow down enough, the threads running on the other sockets will finish spinning and go to sleep.

Other than thermal throttling (which could be either in the processor or in the DRAM), the most likely cause of such a slowdown would be interference with other processes. You left lots of free thread contexts, which usually takes care of this problem, but won't help if the contending process is moving lots of data. If you have "transparent huge pages" enabled, the daemon that coalesces small pages into large pages might be causing trouble. You might want to dump the file system caches before running the job, just to be sure that the OS does not decide to write a lot of dirty buffer data.

I would probably revert to a manually instrumented version of the code so that I could check the CPU core frequency and elapsed time for each thread in each iteration of each kernel.

TimP · ‎03-03-2014

I believe all versions of Intel OpenMP have the same default KMP_BLOCKTIME=200, after which OpenMP threads go idle, which can be changed either by environment variable or function call (to be effective when starting a future new parallel).

gcc libgomp apparently (since 5 years ago) has similar control by OMP_WAIT_POLICY and GOMP_SPINCOUNT. On linux, gcc builds can be linked against Intel OpenMP (e.g. as supplied with MKL) so as to invoke KMP_BLOCKTIME.

Customers have disagreed vociferously on whether 200 ms or any other value is an appropriate default and whether a requirement to adjust this value (e.g. to switch thread models, or to maintain cpu binding) is appropriate, but I heard no expectation of a change.

I would think you would use facilities such as OMP_PROC_BIND to bind threads to cores, and (for Intel OpenMP) KMP_AFFINITY=verbose to get a report on such binding. libgomp by itself may be difficult unless you disable HyperThreading in BIOS setup.

In my understanding, you would need nontemporal intrinsics to see full STREAM performance in gcc, to match the streaming-store options of icc.

McCalpinJohn · ‎03-03-2014

I don't think that the non-temporal stores make much performance difference for the Xeon E7 parts.

If I understand correctly, the implementation of the directories for the coherence protocol requires that even streaming stores (effectively) read the cache line before overwriting it, so the primary difference between streaming and non-streaming stores is one of timing. I.e., cached stores don't get written back to memory until they become victims, while streaming stores (probably?) still get written to memory as soon as possible.

This does not make a lot of difference in performance unless you have chosen the array sizes and offsets that result in bad DRAM bank conflicts. In this case the addresses of the streaming stores are more likely to "line up" with the addresses of the loads and make the bank conflicts worse.

In any case, it seems unlikely that this would have anything to do with changes in STREAM performance during the middle of a single run.

More data would be helpful -- especially hardware performance counter measurements that exclude the OpenMP spin loops. To get this, I typically "strip-mine" the loops manually to get an outer loop with one iteration per OpenMP thread and an inner loop that I can instrument. An example code fragment is:

[cpp]#pragma omp parallel for private(j,jstart,jstop)
    for (i=0; i<numthreads; i++) {
        rdtsc64(starthi[3],startlo[3]);
            jstart = i*(N/numthreads);
            jstop = (i+1)*(N/numthreads)-1;
            for (j=start; j<jstop; j++) a = b+scalar*c;
        rdtsc64(endhi[3],endlo[3]);
    }[/cpp]

Patrick_F_Intel1 · ‎03-04-2014

Hello Andreas,

Certainly Dr. McCalpin is the expert on streams.

I have seen behavior somewhat like this on servers where the memory gets too hot. If I recall correctly the issue in that case was there was a fan which was not kicking in (or something like that).

Have you tried running stream on just 1 processor and its associated memory for a long time to see if the peak MB/s can be maintained? And then try the next processor, etc, to see if maybe there is one processor/memory combo causing issues?

Or, in a really seems-dumb-to-me-but-maybe-it-works kind of thing, point a big fan into the server and see if the slowdown goes away.

There is a utility 'powertop' at https://01.org/powertop which may help diagnose power-related problems. I'm not really familiar with what exactly it can do but, knowing the authors, I imagine powertop might provide useful insights.

Pat

Andreas_Hollmann · ‎03-10-2014

@ John D. McCalpin, Tim Prince and Patrick Fay

First of all, thank you for your helpful comments.

I was away for one week and couldn't reply earlier.

I've made a screenshot that shows the situation.

I'm using pcm-sensor.x and plotting the memory reads and writes on socket 0 with 10 Hz (0.1 second interval). All sockets except socket 0 show a more or less straight line. In the beginning everything is fine on socket 0, but then the behavior ist starting to change, as shown in the image below. This happens also with different kernels (3.10 and 3.13). I've written a small triad benchmark to show this behavior and I'm running it with

OMP_NUM_THREADS=10 GOMP_CPU_AFFINITY="0-9" ./triad

Tomorrow I'll try the big fan ... since I'm quite sure that it is a hardware problem, but it will be hard to find the responsible dimm module.

@Patrick

I know from the lmsensors developers that registered DDR3 DIMM moudules are equipped with thermal sensors, but I haven't found a possiblity to read these sensors. There is a kernel module (jc42) to read the thermal sensors, but if the SMBUs is multiplexed on the mainboard it has to be implemented for each mainboard and the vendor has to provide the information about the multiplexing.

Reading these sensors is not even supported in the BIOS nor in the tools that came with the IPMI device. Are there mainboards that support on-dimm thermal sensors?

It would be really helpful to identify broken memory modules. Broken in the sense of thermal problems.

Andreas_Hollmann · ‎03-13-2014

Patrick Fay (Intel) wrote:

I have seen behavior somewhat like this on servers where the memory gets too hot. If I recall correctly the issue in that case was there was a fan which was not kicking in (or something like that).

Have you tried running stream on just 1 processor and its associated memory for a long time to see if the peak MB/s can be maintained? And then try the next processor, etc, to see if maybe there is one processor/memory combo causing issues?

Or, in a really seems-dumb-to-me-but-maybe-it-works kind of thing, point a big fan into the server and see if the slowdown goes away.

This solution really worked. A 40 watt fan solved the problem. It's quite certain a problem of overheated memory modules.

Thanks for your help!