Intel® Moderncode for Parallel Architectures
Support for developing parallel programming applications on Intel® Architecture.

performance loss

Bo_W_3
Beginner
672 Views

Hi,

some interesting performance loss happened with my measurements.

I have a system with two sockets, each socket is a E5-2680 processor. Each processor has 8 cores and with hyper-threading. The hyper-threading was ignored. 

On this system, I started a program 16 times at the same time and each time pinned the program to different cores. At first, i set all cores to 2.7GHz and saw :

Program 0 Runtime 7.7s

Program 8 Runtime 7.63s

And then, i set  cores on the second socket  to 1.2GHz and saw:

Program 0 Runtime 12.18s

Program 8 Runtime 15.73s

The program 8 ran slower. It is clear, because core 8 had lower frequency. But why was program 0 also slower? Its frequency wasn't touched.

 

Regards,

Bo

0 Kudos
8 Replies
jimdempseyatthecove
Honored Contributor III
672 Views

Did you verify that you actually can set different clock rates per socket? (measure the rates too)

Jim Dempsey

0 Kudos
Bo_W_3
Beginner
672 Views

Yes. I get following output with "cat /proc/cpuinfo | grep MHz"

cpu MHz        : 2700.000
cpu MHz        : 2700.000
cpu MHz        : 2700.000
cpu MHz        : 2700.000
cpu MHz        : 2700.000
cpu MHz        : 2700.000
cpu MHz        : 2700.000
cpu MHz        : 2700.000
cpu MHz        : 1200.000
cpu MHz        : 1200.000
cpu MHz        : 1200.000
cpu MHz        : 1200.000
cpu MHz        : 1200.000
cpu MHz        : 1200.000
cpu MHz        : 1200.000
cpu MHz        : 1200.000
cpu MHz        : 2700.000
cpu MHz        : 2700.000
cpu MHz        : 2700.000
cpu MHz        : 2700.000
cpu MHz        : 2700.000
cpu MHz        : 2700.000
cpu MHz        : 2700.000
cpu MHz        : 2700.000
cpu MHz        : 1200.000
cpu MHz        : 1200.000
cpu MHz        : 1200.000
cpu MHz        : 1200.000
cpu MHz        : 1200.000
cpu MHz        : 1200.000
cpu MHz        : 1200.000
cpu MHz        : 1200.000

With numaclt --hardware, i get:

available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23
node 0 size: 32735 MB
node 0 free: 30458 MB
node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31

 

 

 

0 Kudos
Bo_W_3
Beginner
672 Views

BTW, you can recognize the new frequency with different runtime of these two measurements.

0 Kudos
Bo_W_3
Beginner
672 Views

To check whether a new frequency has been set, I have  "cat /proc/cpuinfo | grep MHz" and get:

cpu MHz        : 2700.000
cpu MHz        : 2700.000
cpu MHz        : 2700.000
cpu MHz        : 2700.000
cpu MHz        : 2700.000
cpu MHz        : 2700.000
cpu MHz        : 2700.000
cpu MHz        : 2700.000
cpu MHz        : 1200.000
cpu MHz        : 1200.000
cpu MHz        : 1200.000
cpu MHz        : 1200.000
cpu MHz        : 1200.000
cpu MHz        : 1200.000
cpu MHz        : 1200.000
cpu MHz        : 1200.000
cpu MHz        : 2700.000
cpu MHz        : 2700.000
cpu MHz        : 2700.000
cpu MHz        : 2700.000
cpu MHz        : 2700.000
cpu MHz        : 2700.000
cpu MHz        : 2700.000
cpu MHz        : 2700.000
cpu MHz        : 1200.000
cpu MHz        : 1200.000
cpu MHz        : 1200.000
cpu MHz        : 1200.000
cpu MHz        : 1200.000
cpu MHz        : 1200.000
cpu MHz        : 1200.000
cpu MHz        : 1200.000

With "numaclt --hardware", i get:

available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23
node 0 size: 32735 MB
node 0 free: 30458 MB
node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31

 

0 Kudos
Vladimir_P_1234567890
672 Views

did you allocate memory on fast node or slow one?

--Vladimir

0 Kudos
Bo_W_3
Beginner
672 Views

Each program has local memory, i.e. memory is distributed over these two sockets.

0 Kudos
jimdempseyatthecove
Honored Contributor III
672 Views

In your motherboard BIOS you can configure the memory in two different ways

UMA: all banks attached to both sockets are interleaved (sequential addresses are sequentially distributed across all banks) such that on average memory access everywhere is uniform. Depending on who wrote the BIOS user guide there is often (occasionally) a mistranslation of what interleaved means. Some list this backwards

NUMA: Each memory attached to each socket has contiguous address blocks. Meaning CPU0 can access the block locally attached faster than the block remotely attached.

Now then, in your sample program, should your memory system be configured UMA, then slowing down one CPU will slow down both CPU's access to memory. Should your memory system be configured NUMA, then provided that memory is allocated from the addresses local to the CPU, then each CPU would experience your expected results.

You will have to read up on how to configure your memory system (UMA or NUMA), as well as read up on the rules to follow to assure your memory allocations, and use, reside with the socket you expect.

Jim Dempsey

0 Kudos
Vladimir_P_1234567890
672 Views

right, the simplest way to check what's going on is to take vtune amplifier and look at hotspots difference.

--Vladimir

0 Kudos
Reply