Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.
Announcements
FPGA community forums and blogs have moved to the Altera Community. Existing Intel Community members can sign in with their current credentials.

Sandy bridge performance degradation compare to Westmere

amk21
Beginner
11,732 Views

i created a simple memtest that allocate large vector that get random number and update the vector data.
pseudo code 

DataCell* dataCells = new DataCell[VECTOR_SIZE]
for(int cycles = 0; cycles < gCycles; cycles++){    u64 randVal = random()
    DataCell* dataCell = dataCells[randVal % VECTOR_SIZE]

    dataCell->m_count = cycles

    dataCell->m_random = randVal

    dataCell->m_flag = 1

}


i'm using perf util to gather performance counter info.
the most interesting results are when the vector size is larger then the cache size tix8 20MB tix2 12MB 

hardware specification

tix2 - cpu X5680 3.33GHz, mother board - Supermicro X8DTU , memory - 64GB divided 32GB to each bank at 1.33GHz

tix8 - cpu E5-2690 2.90GHz, mother board - Intel S2600GZ, memory - 64GB divided 32GB to each bank at 1.60GHz

compiled with gcc 4.6.1 -O3 -mtune=native -march=native

amk@tix2:~/amir/memtest$ perf stat -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u -e L1-dcache-stores:u -e L1-dcache-store-misses:u ./memtest -v 10000019 -c 100000000
total Time (rdtsc) 21800971556 nano time 6542908630 vector size 240000456

Performance counter stats for './memtest -v 10000019 -c 100000000':

21842742688 cycles # 0.000 M/sec
5869556879 instructions # 0.269 IPC
1700665337 L1-dcache-loads # 0.000 M/sec
221870903 L1-dcache-load-misses # 0.000 M/sec
1130278738 L1-dcache-stores # 0.000 M/sec
0 L1-dcache-store-misses # 0.000 M/sec

6.628680493 seconds time elapsed

amk@tix8:~/amir/memtest$ perf stat -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u -e L1-dcache-stores:u -e L1-dcache-store-misses:u ./memtest -v 10000019 -c 100000000
total Time (rdtsc) 24362574412 nano time 8424126698 vector size 240000456

Performance counter stats for './memtest -v 10000019 -c 100000000':

24409499958 cycles # 0.000 M/sec
5869656821 instructions # 0.240 IPC
1192635035 L1-dcache-loads # 0.000 M/sec
94702716 L1-dcache-load-misses # 0.000 M/sec
1373779283 L1-dcache-stores # 0.000 M/sec
306775598 L1-dcache-store-misses # 0.000 M/sec

8.525456817 seconds time elapsed

what am is missing is Sandy bridge slower then Westmere ???????

Amir.

0 Kudos
42 Replies
matt_garman
Beginner
5,084 Views
I am having similar problems with our own proprietary application. I haven't been able to isolate the problematic code to a nice tidy snippet like you have. Just looking at your program run times, 8.5 seconds versus 6.6 seconds, that's about a 23% difference. For the purpose of testing and eliminating variables, I suggest trying the following: - configure the BIOS in both systems to disable any and all power-saving features (C-states, C1E, memory power saving, etc) - enable turbo boost on both - use the idle=poll kernel commandline param on your Sandy Bridge server (needed, as the BIOS settings alone won't keep the CPU from leaving C0 state) In this setup, you can use the "i7z" program to see what speed all your cores are running at. At least on my systems, taking all the above steps results in all cores constantly running above their "advertised" clock speed, i.e. turbo boost is kicking in. Yes, this will make the servers run hot and use lots of power. :) These are tunings for a low-latency environment, but I think they might be appropriate for testing/experimenting in your case. At least, if you do these things, and see the difference between Westmere and Sandy Bridge narrow, then you can attribute it to one of these tweaks. At least in my low-latency world, the aggressive power-saving features are bad for performance. Just a random guess here, but: perhaps your application is such that, during execution, it allows the CPU to drop into some kind of a sleep state many times. There is a latency penalty for coming out of a sleep state. If you drop in and out of sleep states many times during execution, you might see a cumulative effect in increased overall runtime.
0 Kudos
amk21
Beginner
5,084 Views
all power saving is disabled hyper thread is disabled i7m report cpu frequency of 3290.1 but the performance is even worse amk@tix8:~/amir/memtest$ perf stat -c -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u -e L1-dcache-stores:u -e L1-dcache-store-misses:u ./memtest -v 10000019 -c 100000000 total Time (rdtsc) 25724416756 nano time 21437013963 vector size 240000456 Performance counter stats for './memtest -v 10000019 -c 100000000': 29300656750 cycles # 0.000 M/sec 5869414958 instructions # 0.200 IPC 1190853811 L1-dcache-loads # 0.000 M/sec 94650151 L1-dcache-load-misses # 0.000 M/sec 1379446403 L1-dcache-stores # 0.000 M/sec 306750238 L1-dcache-store-misses # 0.000 M/sec 8.990783606 seconds time elapsed the results bellow is without turbo boost !!!! amk@tix8:~/amir/memtest$ perf stat -c -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u -e L1-dcache-stores:u -e L1-dcache-store-misses:u ./memtest -v 10000019 -c 100000000 total Time (rdtsc) 24314509968 nano time 8404600749 vector size 240000456 Performance counter stats for './memtest -v 10000019 -c 100000000': 24360101110 cycles # 0.000 M/sec 5869421474 instructions # 0.241 IPC 1191790678 L1-dcache-loads # 0.000 M/sec 94483286 L1-dcache-load-misses # 0.000 M/sec 1374772009 L1-dcache-stores # 0.000 M/sec 306899965 L1-dcache-store-misses # 0.000 M/sec 8.506839690 seconds time elapsed
0 Kudos
TimP
Honored Contributor III
5,084 Views
The web sites appear to confirm that those motherboards are the usual full featured ones, with 8 channels on Sandy Bridge and 6 on Westmere. My E5-2670 has 1 stick in each channel. I do see lower performance than the 5680 on operations where performance is proportional to clock speed and doesn't need the superior memory system. I suppose gcc 4.6 doesn't use nontemporal stores directly, and I guess you have excluded use of simd instructions.
0 Kudos
TimP
Honored Contributor III
5,084 Views
The web sites appear to confirm that those motherboards are the usual full featured ones, with 8 channels on Sandy Bridge and 6 on Westmere. My E5-2670 has 1 stick in each channel. I do see lower performance than the 5680 on operations where performance is proportional to clock speed and doesn't need the superior memory system. I suppose gcc 4.6 doesn't use nontemporal stores directly, and I guess you have excluded use of simd instructions.
0 Kudos
amk21
Beginner
5,084 Views
i don't fully understand your answer first of all we are using E5-2690 is the answer "E5-2670 has 1 stick in each channel" relevant to this cpu ? what i understand from your answer ("I do see lower performance than the 5680 on operations where performance") is that Sandy bridge (E5-2690) is slower then Westmere (5680) on the pseudo code i wrote previously (i can supply code for this test), and there is nothing i can do in order to solve this issue (change compiler, change compile flags, change bios settings ....)
0 Kudos
Patrick_F_Intel1
Employee
5,084 Views
Hello amk21, For the pseudo-code where you pick an index into the array... I assume that random() returns something in the range of VECTOR_SIZE. The test that you've generated is sort of a memory latency test. I say 'sort of' because the usual latency test uses linked list of dependent addresses (so that only one load is outstanding at a time). Doing a random list can generate more than one load outstanding at a time. Do you know if the prefetchers are disabled in the BIOS? If one system has the prefetchers enabled and another system has them disabled, things can get confusing. Do you have 2 processors on the system or just 1 chip? If you have more than 1 chip, do you know if NUMA is enabled on both systems? For latency tests, it is better to have the prefetchers disabled (just to make thinks simpler). If both systems are configured optimally, I would expect the sandybridge-based system (tix8) to have lower latency than the westmere-based system (tix2). Optimally means 1 DIMM per slot and numa enabled (if there is more than 1 processor). Are you running on Windows? If so, the cpu-z folks have a memory latency tool that you could run to see if their tool get similar results to what you are seeing. Try running the latency.exe in http://www.cpuid.com/medias/files/softwares/misc/latency.zip If you could send the output. On linux you can use lmbench to get latency... see http://sourceforge.net/projects/lmbench/ But I'm not too familiar with lmbench so i can't help too much with running instructions. Running these industry standard benchmarks will give us more information on the relative performance of the systems. Pat
0 Kudos
amk21
Beginner
5,084 Views
i have 2 processors on the system and numa is enabled. i'll verify tix2 bios setting and run lmbench but i need the simple memtest because it simulate my application i have a very large map that is actually 2 dimension vector and i found out that the finding the right ling in the map is the most costly operation. "For latency tests, it is better to have the prefetchers disabled" - what bios setting did you had in mind ? disabled data prefetcher Numa optimized - Enabled MLC streamer - Enabled MLC spatial prefetcher - Enabled DCU Data prefetcher - Disabled DCU instruction prefetcher - Enabled amk@tix8:~/amir/memtest$ perf stat -c -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u -e L1-dcache-stores:u -e L1-dcache-store-misses:u ./memtest -v 10000019 -c 100000000 total Time (rdtsc) 24457792172 nano time 8457051235 vector size 240000456 Performance counter stats for './memtest -v 10000019 -c 100000000': 24504834353 cycles # 0.000 M/sec 5869424898 instructions # 0.240 IPC 1193553992 L1-dcache-loads # 0.000 M/sec 94548506 L1-dcache-load-misses # 0.000 M/sec 1370182667 L1-dcache-stores # 0.000 M/sec 306627891 L1-dcache-store-misses # 0.000 M/sec 8.559050619 seconds time elapsed disabled data prefetcher and numa optimized Numa optimized - Disabled MLC streamer - Enabled MLC spatial prefetcher - Enabled DCU Data prefetcher - Disabled DCU instruction prefetcher - Enabled amk@tix8:~/amir/memtest$ perf stat -c -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u -e L1-dcache-stores:u -e L1-dcache-store-misses:u ./memtest -v 10000019 -c 100000000 total Time (rdtsc) 33150418300 nano time 11462800242 vector size 240000456 Performance counter stats for './memtest -v 10000019 -c 100000000': 33191154216 cycles # 0.000 M/sec 5869420947 instructions # 0.177 IPC 1190593871 L1-dcache-loads # 0.000 M/sec 94498148 L1-dcache-load-misses # 0.000 M/sec 1382188152 L1-dcache-stores # 0.000 M/sec 306662218 L1-dcache-store-misses # 0.000 M/sec 11.568955857 seconds time elapsed disabled numa optimized Numa optimized - Disabled MLC streamer - Enabled MLC spatial prefetcher - Enabled DCU Data prefetcher - Enabled DCU instruction prefetcher - Enabled amk@tix8:~/amir/memtest$ perf stat -c -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u -e L1-dcache-stores:u -e L1-dcache-store-misses:u ./memtest -v 10000019 -c 100000000 total Time (rdtsc) 33150283136 nano time 11462753504 vector size 240000456 Performance counter stats for './memtest -v 10000019 -c 100000000': 33190933585 cycles # 0.000 M/sec 5869420768 instructions # 0.177 IPC 1190685322 L1-dcache-loads # 0.000 M/sec 94769556 L1-dcache-load-misses # 0.000 M/sec 1382058359 L1-dcache-stores # 0.000 M/sec 306649458 L1-dcache-store-misses # 0.000 M/sec 11.569743183 seconds time elapsed
0 Kudos
Roman_D_Intel
Employee
5,084 Views
Hi, regarding the latency measurement using lmbench. Build utility called "lat_mem_rd" in the package. Then: numactl --cpunodebind=0 --membind=1 ./lat_mem_rd -t 1024 to measure the memory access latency between NUMA node 0 and 1. The latency test increases the working set and converges towards the end to the memory latency. numactl --cpunodebind=0 --membind=0 ./lat_mem_rd -t 1024 to measure the local memory latency on NUMA node 0. -- Roman
0 Kudos
amk21
Beginner
5,084 Views
results for: numactl --cpunodebind=0 --membind=1 ./lat_mem_rd -t 1024 tix8 bios setting Numa optimized - Enabled MLC streamer - Enabled MLC spatial prefetcher - Enabled DCU Data prefetcher - Enabled DCU instruction prefetcher - Enabled amk@tix8:~/amir/lmbench-3.0-a9/bin/x86_64-linux-gnu$ numactl --cpunodebind=0 --membind=1 ./lat_mem_rd -t 1024 "stride=64 0.00049 1.383 0.00098 1.383 0.00195 1.383 0.00293 1.383 0.00391 1.383 0.00586 1.383 0.00781 1.383 0.01172 1.383 0.01562 1.383 0.02344 1.383 0.03125 1.383 0.04688 4.149 0.06250 4.149 0.09375 4.149 0.12500 4.982 0.18750 5.461 0.25000 5.746 0.37500 15.573 0.50000 15.997 0.75000 16.331 1.00000 16.418 1.50000 18.140 2.00000 20.505 3.00000 23.942 4.00000 25.148 6.00000 26.235 8.00000 26.562 12.00000 28.049 16.00000 30.442 24.00000 101.998 32.00000 129.382 48.00000 139.500 64.00000 139.948 96.00000 141.216 128.00000 141.265 192.00000 140.899 256.00000 140.582 384.00000 140.045 512.00000 139.745 768.00000 139.379 1024.00000 139.220 amk@tix2:~/amir/lmbench-3.0-a9/bin/x86_64-linux-gnu$ numactl --cpunodebind=0 --membind=1 ./lat_mem_rd -t 1024 "stride=64 0.00049 1.200 0.00098 1.200 0.00195 1.200 0.00293 1.200 0.00391 1.200 0.00586 1.200 0.00781 1.200 0.01172 1.200 0.01562 1.200 0.02344 1.200 0.03125 1.200 0.04688 3.000 0.06250 3.000 0.09375 3.000 0.12500 3.005 0.18750 3.290 0.25000 4.042 0.37500 16.192 0.50000 16.536 0.75000 16.592 1.00000 16.844 1.50000 18.993 2.00000 20.285 3.00000 23.431 4.00000 24.892 6.00000 25.694 8.00000 26.324 12.00000 53.074 16.00000 108.794 24.00000 121.599 32.00000 124.198 48.00000 124.514 64.00000 125.408 96.00000 125.025 128.00000 124.773 192.00000 124.447 256.00000 124.205 384.00000 123.776 512.00000 123.546 768.00000 123.323 1024.00000 123.189
0 Kudos
amk21
Beginner
5,084 Views
results for: numactl --cpunodebind=0 --membind=0 ./lat_mem_rd -t 1024 tix8 bios setting Numa optimized - Enabled MLC streamer - Enabled MLC spatial prefetcher - Enabled DCU Data prefetcher - Enabled DCU instruction prefetcher - Enabled amk@tix8:~/amir/lmbench-3.0-a9/bin/x86_64-linux-gnu$ numactl --cpunodebind=0 --membind=0 ./lat_mem_rd -t 1024 "stride=64 0.00049 1.383 0.00098 1.383 0.00195 1.383 0.00293 1.383 0.00391 1.383 0.00586 1.383 0.00781 1.383 0.01172 1.383 0.01562 1.383 0.02344 1.383 0.03125 1.383 0.04688 4.149 0.06250 4.149 0.09375 4.147 0.12500 4.149 0.18750 4.905 0.25000 5.253 0.37500 15.746 0.50000 15.817 0.75000 16.226 1.00000 16.954 1.50000 18.774 2.00000 20.563 3.00000 23.922 4.00000 25.201 6.00000 26.089 8.00000 26.732 12.00000 28.367 16.00000 30.853 24.00000 75.662 32.00000 89.364 48.00000 94.962 64.00000 96.098 96.00000 96.829 128.00000 96.941 192.00000 96.801 256.00000 96.716 384.00000 96.468 512.00000 96.297 768.00000 96.103 1024.00000 95.989 amk@tix2:~/amir/lmbench-3.0-a9/bin/x86_64-linux-gnu$ numactl --cpunodebind=0 --membind=0 ./lat_mem_rd -t 1024 "stride=64 0.00049 1.200 0.00098 1.200 0.00195 1.200 0.00293 1.200 0.00391 1.200 0.00586 1.200 0.00781 1.200 0.01172 1.200 0.01562 1.200 0.02344 1.200 0.03125 1.200 0.04688 3.000 0.06250 3.000 0.09375 3.000 0.12500 3.001 0.18750 3.726 0.25000 4.169 0.37500 16.219 0.50000 16.600 0.75000 16.765 1.00000 16.668 1.50000 18.637 2.00000 20.386 3.00000 23.847 4.00000 25.480 6.00000 29.075 8.00000 31.644 12.00000 53.601 16.00000 74.186 24.00000 83.683 32.00000 85.331 48.00000 86.139 64.00000 86.394 96.00000 87.177 128.00000 87.167 192.00000 87.413 256.00000 87.230 384.00000 87.255 512.00000 86.998 768.00000 87.018 1024.00000 86.786
0 Kudos
amk21
Beginner
5,084 Views
adding graph of lat_mem_rd results
0 Kudos
Patrick_F_Intel1
Employee
5,084 Views
Hello amk21, A coworker made a suggestion... Sandybridge-EP power management is probably putting the 2nd processor into a low power state. In this low power state the snoops will take longer since the 2nd processor is running at (probably) a low frequency. Can you try pinning and running this 'spin loop' program on the 2nd processor when you run the latency program on the 1st processor? The spin.c program... you'll have to kill it with control-c. #include int main(int argc, char **argv) { int i=0; printf("begin spin loop\n"); while(1) {i++;} printf("i= %d\n", i); return 0; } In order for us to compare your latency numbers with our numbers, you'll need to disable all the prefetchers and enable numa. But I'd still like to see the impact of the spinner on your 'prefetchers on, numa on' latency. Pat
0 Kudos
amk21
Beginner
5,084 Views
Hello Pat, "disable all the prefetchers " - what setting in the bios are you referring to ? these are the setting i found in the bios and there state Numa optimized - Enabled MLC streamer - Enabled MLC spatial prefetcher - Enabled DCU Data prefetcher - Enabled DCU instruction prefetcher - Enabled regrding the power saving (C1, C3 and C6) features all of them are disabled including turbo boost (see results with turbo boost above) . can you share the reference latency numbers ? we tried to replace the memory with other memories .... currently we are using 8GB X 8 (part number ACT8GHR72Q4H1600S CL-11) we tried to replace it with the memory tix2 is using 4GB * 8 (part number 25L3205 CL-9) what is the lowest latency memory type and memory setup we can use assuming we need at list 48GB of memory ? amir
0 Kudos
amk21
Beginner
5,084 Views
lat_mem_rd results with spin loop running on second cpu bios settings Numa optimized - Enabled MLC streamer - Enabled MLC spatial prefetcher - Enabled DCU Data prefetcher - Enabled DCU instruction prefetcher - Enabled turbo boost - disabled numactl --cpunodebind=0 --membind=0 ./lat_mem_rd -t 1024 "stride=64 0.00049 1.383 0.00098 1.383 0.00195 1.383 0.00293 1.383 0.00391 1.383 0.00586 1.383 0.00781 1.383 0.01172 1.383 0.01562 1.383 0.02344 1.383 0.03125 1.383 0.04688 4.149 0.06250 4.149 0.09375 4.149 0.12500 4.149 0.18750 4.703 0.25000 5.405 0.37500 15.608 0.50000 15.790 0.75000 16.268 1.00000 17.013 1.50000 18.484 2.00000 20.319 3.00000 23.514 4.00000 25.144 6.00000 26.056 8.00000 26.881 12.00000 28.537 16.00000 32.171 24.00000 75.093 32.00000 89.267 48.00000 94.837 64.00000 95.840 96.00000 96.386 128.00000 95.148 192.00000 95.833 256.00000 95.996 384.00000 95.885 512.00000 95.866 768.00000 95.681 1024.00000 95.692 numactl --cpunodebind=0 --membind=1 ./lat_mem_rd -t 1024 "stride=64 0.00049 1.383 0.00098 1.383 0.00195 1.383 0.00293 1.383 0.00391 1.383 0.00586 1.383 0.00781 1.383 0.01172 1.383 0.01562 1.383 0.02344 1.383 0.03125 1.383 0.04688 4.149 0.06250 4.149 0.09375 4.654 0.12500 4.528 0.18750 4.906 0.25000 5.292 0.37500 15.496 0.50000 15.884 0.75000 16.188 1.00000 16.672 1.50000 18.687 2.00000 20.402 3.00000 23.885 4.00000 25.180 6.00000 26.213 8.00000 26.657 12.00000 28.135 16.00000 30.406 24.00000 100.014 32.00000 129.697 48.00000 139.414 64.00000 140.246 96.00000 141.176 128.00000 141.207 192.00000 140.858 256.00000 140.624 384.00000 140.085 512.00000 139.800 768.00000 139.485 1024.00000 139.227
0 Kudos
Patrick_F_Intel1
Employee
5,084 Views
Thanks amk21, When you say "with spin loop running on second cpu" do you mean that you are running the spin loop on one of the cpus on the 2nd processor? That is, you are running the spin loop with something like "numactl --cpunodebind=1 --membind=1 ./spin" ? I don't see any difference in the latency with the spin loop versus no spin loop so I'm wondering why. Pat
0 Kudos
amk21
Beginner
5,084 Views
the spin loop is running on the second package with "numactl --cpunodebind=1 --membind=1 ./spin"
0 Kudos
Patrick_F_Intel1
Employee
5,084 Views
Ok... I'm going to have to find a box and run on it myself. I'll let you know what I find. Pat
0 Kudos
Patrick_F_Intel1
Employee
5,084 Views
Ok... I'm going to have to find a box and run on it myself. I'll let you know what I find. Pat
0 Kudos
amk21
Beginner
5,083 Views
what is the best memory setup and type (cas latency) i should use for low latency if i need at list 48GB ?
0 Kudos
Patrick_F_Intel1
Employee
5,010 Views
I've run on a Sandybridge-EP box now. I'm kind of confused by your results. I used an array size of 1 GB, numa enabled, and a dependent load latency test, ran on cpu 3 of the 1st socket. The cpu speed is 2.7 GHz, the memory is hynix 1600 MHz, HMT31GR7BFR4C-P. Here is a table of my results. prefetcher, spin, turbo, latency (nanosec) off, off, off, 86.487 off, off, on, 79.664 off, on, off, 75.072 off, on, on, 66.470 on, off, off, 11.347 on, off, on, 9.531 on, on, off, 10.684 on, on, on, 8.771 where prefetcher off means MLC streamer - disabled MLC spatial prefetcher - disabled DCU Data prefetcher - disabled DCU instruction prefetcher - disabled and prefetcher on means all of the above prefetchers enabled. 'spin on' means running the spin program on cpu 2 of the other socket. 'spin off' means not running the spin program. 'turbo on' means turbo enabled, off means turbo disabled. So... after all the explanations... I don't see how you are getting latencies of about 95 ns with the prefetchers enabled. I get about 8.8-11.3 ns. Your numbers look like prefetchers are disabled. So I'm puzzled. You are enabling/disabling prefetchers using the bios right? Pat
0 Kudos
Reply