performance loss

Bo_W_3 · ‎02-28-2014

Hi,

some interesting performance loss happened with my measurements.

I have a system with two sockets, each socket is a E5-2680 processor. Each processor has 8 cores and with hyper-threading. The hyper-threading was ignored.

On this system, I started a program 16 times at the same time and each time pinned the program to different cores. At first, i set all cores to 2.7GHz and saw :

Program 0 Runtime 7.7s

Program 8 Runtime 7.63s

And then, i set cores on the second socket to 1.2GHz and saw:

Program 0 Runtime 12.18s

Program 8 Runtime 15.73s

The program 8 ran slower. It is clear, because core 8 had lower frequency. But why was program 0 also slower? Its frequency wasn't touched.

Regards,

Bo

jimdempseyatthecove · ‎03-03-2014

Did you verify that you actually can set different clock rates per socket? (measure the rates too)

Jim Dempsey

Bo_W_3 · ‎03-03-2014

Yes. I get following output with "cat /proc/cpuinfo | grep MHz"

cpu MHz       : 2700.000
cpu MHz       : 2700.000
cpu MHz       : 2700.000
cpu MHz       : 2700.000
cpu MHz       : 2700.000
cpu MHz       : 2700.000
cpu MHz       : 2700.000
cpu MHz       : 2700.000
cpu MHz       : 1200.000
cpu MHz       : 1200.000
cpu MHz       : 1200.000
cpu MHz       : 1200.000
cpu MHz       : 1200.000
cpu MHz       : 1200.000
cpu MHz       : 1200.000
cpu MHz       : 1200.000
cpu MHz       : 2700.000
cpu MHz       : 2700.000
cpu MHz       : 2700.000
cpu MHz       : 2700.000
cpu MHz       : 2700.000
cpu MHz       : 2700.000
cpu MHz       : 2700.000
cpu MHz       : 2700.000
cpu MHz       : 1200.000
cpu MHz       : 1200.000
cpu MHz       : 1200.000
cpu MHz       : 1200.000
cpu MHz       : 1200.000
cpu MHz       : 1200.000
cpu MHz       : 1200.000
cpu MHz       : 1200.000

With numaclt --hardware, i get:

available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23
node 0 size: 32735 MB
node 0 free: 30458 MB
node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31

Bo_W_3 · ‎03-03-2014

BTW, you can recognize the new frequency with different runtime of these two measurements.

Bo_W_3 · ‎03-03-2014

To check whether a new frequency has been set, I have "cat /proc/cpuinfo | grep MHz" and get:

cpu MHz       : 2700.000
cpu MHz       : 2700.000
cpu MHz       : 2700.000
cpu MHz       : 2700.000
cpu MHz       : 2700.000
cpu MHz       : 2700.000
cpu MHz       : 2700.000
cpu MHz       : 2700.000
cpu MHz       : 1200.000
cpu MHz       : 1200.000
cpu MHz       : 1200.000
cpu MHz       : 1200.000
cpu MHz       : 1200.000
cpu MHz       : 1200.000
cpu MHz       : 1200.000
cpu MHz       : 1200.000
cpu MHz       : 2700.000
cpu MHz       : 2700.000
cpu MHz       : 2700.000
cpu MHz       : 2700.000
cpu MHz       : 2700.000
cpu MHz       : 2700.000
cpu MHz       : 2700.000
cpu MHz       : 2700.000
cpu MHz       : 1200.000
cpu MHz       : 1200.000
cpu MHz       : 1200.000
cpu MHz       : 1200.000
cpu MHz       : 1200.000
cpu MHz       : 1200.000
cpu MHz       : 1200.000
cpu MHz       : 1200.000

With "numaclt --hardware", i get:

available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23
node 0 size: 32735 MB
node 0 free: 30458 MB
node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31

Vladimir_P_1234567890 · ‎03-03-2014

did you allocate memory on fast node or slow one?

--Vladimir

Bo_W_3 · ‎03-04-2014

Each program has local memory, i.e. memory is distributed over these two sockets.

jimdempseyatthecove · ‎03-04-2014

In your motherboard BIOS you can configure the memory in two different ways

UMA: all banks attached to both sockets are interleaved (sequential addresses are sequentially distributed across all banks) such that on average memory access everywhere is uniform. Depending on who wrote the BIOS user guide there is often (occasionally) a mistranslation of what interleaved means. Some list this backwards

NUMA: Each memory attached to each socket has contiguous address blocks. Meaning CPU0 can access the block locally attached faster than the block remotely attached.

Now then, in your sample program, should your memory system be configured UMA, then slowing down one CPU will slow down both CPU's access to memory. Should your memory system be configured NUMA, then provided that memory is allocated from the addresses local to the CPU, then each CPU would experience your expected results.

You will have to read up on how to configure your memory system (UMA or NUMA), as well as read up on the rules to follow to assure your memory allocations, and use, reside with the socket you expect.

Jim Dempsey

Vladimir_P_1234567890 · ‎03-04-2014

right, the simplest way to check what's going on is to take vtune amplifier and look at hotspots difference.

--Vladimir