- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
some interesting performance loss happened with my measurements.
I have a system with two sockets, each socket is a E5-2680 processor. Each processor has 8 cores and with hyper-threading. The hyper-threading was ignored.
On this system, I started a program 16 times at the same time and each time pinned the program to different cores. At first, i set all cores to 2.7GHz and saw :
Program 0 Runtime 7.7s
Program 8 Runtime 7.63s
And then, i set cores on the second socket to 1.2GHz and saw:
Program 0 Runtime 12.18s
Program 8 Runtime 15.73s
The program 8 ran slower. It is clear, because core 8 had lower frequency. But why was program 0 also slower? Its frequency wasn't touched.
Regards,
Bo
- Tags:
- Parallel Computing
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Did you verify that you actually can set different clock rates per socket? (measure the rates too)
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes. I get following output with "cat /proc/cpuinfo | grep MHz"
cpu MHz : 2700.000
cpu MHz : 2700.000
cpu MHz : 2700.000
cpu MHz : 2700.000
cpu MHz : 2700.000
cpu MHz : 2700.000
cpu MHz : 2700.000
cpu MHz : 2700.000
cpu MHz : 1200.000
cpu MHz : 1200.000
cpu MHz : 1200.000
cpu MHz : 1200.000
cpu MHz : 1200.000
cpu MHz : 1200.000
cpu MHz : 1200.000
cpu MHz : 1200.000
cpu MHz : 2700.000
cpu MHz : 2700.000
cpu MHz : 2700.000
cpu MHz : 2700.000
cpu MHz : 2700.000
cpu MHz : 2700.000
cpu MHz : 2700.000
cpu MHz : 2700.000
cpu MHz : 1200.000
cpu MHz : 1200.000
cpu MHz : 1200.000
cpu MHz : 1200.000
cpu MHz : 1200.000
cpu MHz : 1200.000
cpu MHz : 1200.000
cpu MHz : 1200.000
With numaclt --hardware, i get:
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23
node 0 size: 32735 MB
node 0 free: 30458 MB
node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
BTW, you can recognize the new frequency with different runtime of these two measurements.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
To check whether a new frequency has been set, I have "cat /proc/cpuinfo | grep MHz" and get:
cpu MHz : 2700.000
cpu MHz : 2700.000
cpu MHz : 2700.000
cpu MHz : 2700.000
cpu MHz : 2700.000
cpu MHz : 2700.000
cpu MHz : 2700.000
cpu MHz : 2700.000
cpu MHz : 1200.000
cpu MHz : 1200.000
cpu MHz : 1200.000
cpu MHz : 1200.000
cpu MHz : 1200.000
cpu MHz : 1200.000
cpu MHz : 1200.000
cpu MHz : 1200.000
cpu MHz : 2700.000
cpu MHz : 2700.000
cpu MHz : 2700.000
cpu MHz : 2700.000
cpu MHz : 2700.000
cpu MHz : 2700.000
cpu MHz : 2700.000
cpu MHz : 2700.000
cpu MHz : 1200.000
cpu MHz : 1200.000
cpu MHz : 1200.000
cpu MHz : 1200.000
cpu MHz : 1200.000
cpu MHz : 1200.000
cpu MHz : 1200.000
cpu MHz : 1200.000
With "numaclt --hardware", i get:
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23
node 0 size: 32735 MB
node 0 free: 30458 MB
node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
did you allocate memory on fast node or slow one?
--Vladimir
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Each program has local memory, i.e. memory is distributed over these two sockets.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In your motherboard BIOS you can configure the memory in two different ways
UMA: all banks attached to both sockets are interleaved (sequential addresses are sequentially distributed across all banks) such that on average memory access everywhere is uniform. Depending on who wrote the BIOS user guide there is often (occasionally) a mistranslation of what interleaved means. Some list this backwards
NUMA: Each memory attached to each socket has contiguous address blocks. Meaning CPU0 can access the block locally attached faster than the block remotely attached.
Now then, in your sample program, should your memory system be configured UMA, then slowing down one CPU will slow down both CPU's access to memory. Should your memory system be configured NUMA, then provided that memory is allocated from the addresses local to the CPU, then each CPU would experience your expected results.
You will have to read up on how to configure your memory system (UMA or NUMA), as well as read up on the rules to follow to assure your memory allocations, and use, reside with the socket you expect.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
right, the simplest way to check what's going on is to take vtune amplifier and look at hotspots difference.
--Vladimir
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page