Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.
Announcements
FPGA community forums and blogs have moved to the Altera Community. Existing Intel Community members can sign in with their current credentials.

Sandy bridge performance degradation compare to Westmere

amk21
Beginner
11,785 Views

i created a simple memtest that allocate large vector that get random number and update the vector data.
pseudo code 

DataCell* dataCells = new DataCell[VECTOR_SIZE]
for(int cycles = 0; cycles < gCycles; cycles++){    u64 randVal = random()
    DataCell* dataCell = dataCells[randVal % VECTOR_SIZE]

    dataCell->m_count = cycles

    dataCell->m_random = randVal

    dataCell->m_flag = 1

}


i'm using perf util to gather performance counter info.
the most interesting results are when the vector size is larger then the cache size tix8 20MB tix2 12MB 

hardware specification

tix2 - cpu X5680 3.33GHz, mother board - Supermicro X8DTU , memory - 64GB divided 32GB to each bank at 1.33GHz

tix8 - cpu E5-2690 2.90GHz, mother board - Intel S2600GZ, memory - 64GB divided 32GB to each bank at 1.60GHz

compiled with gcc 4.6.1 -O3 -mtune=native -march=native

amk@tix2:~/amir/memtest$ perf stat -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u -e L1-dcache-stores:u -e L1-dcache-store-misses:u ./memtest -v 10000019 -c 100000000
total Time (rdtsc) 21800971556 nano time 6542908630 vector size 240000456

Performance counter stats for './memtest -v 10000019 -c 100000000':

21842742688 cycles # 0.000 M/sec
5869556879 instructions # 0.269 IPC
1700665337 L1-dcache-loads # 0.000 M/sec
221870903 L1-dcache-load-misses # 0.000 M/sec
1130278738 L1-dcache-stores # 0.000 M/sec
0 L1-dcache-store-misses # 0.000 M/sec

6.628680493 seconds time elapsed

amk@tix8:~/amir/memtest$ perf stat -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u -e L1-dcache-stores:u -e L1-dcache-store-misses:u ./memtest -v 10000019 -c 100000000
total Time (rdtsc) 24362574412 nano time 8424126698 vector size 240000456

Performance counter stats for './memtest -v 10000019 -c 100000000':

24409499958 cycles # 0.000 M/sec
5869656821 instructions # 0.240 IPC
1192635035 L1-dcache-loads # 0.000 M/sec
94702716 L1-dcache-load-misses # 0.000 M/sec
1373779283 L1-dcache-stores # 0.000 M/sec
306775598 L1-dcache-store-misses # 0.000 M/sec

8.525456817 seconds time elapsed

what am is missing is Sandy bridge slower then Westmere ???????

Amir.

0 Kudos
42 Replies
amk21
Beginner
5,918 Views
Hi Pat, i'll post results base on your instructions shortly. but i have several question regarding the latency test "I used an array size of 1 GB, numa enabled, and a dependent load latency test, ran on cpu 3 of the 1st socket. The cpu speed is 2.7 GHz, the memory is hynix 1600 MHz, HMT31GR7BFR4C-P." what cpu are you using ? (cpu id, how many cores) what operating system are use using ? and most importantly can you share the exact test command (are you using lat_mem_rd) Regards Amir
0 Kudos
amk21
Beginner
5,918 Views
the best results are with prefetcher on, spin - on, turbo - on still slower the tix2 (westmere) the results are bellow: spin loop is running with the following command: numactl --cpunodebind=1 --membind=1 ./spin_loop prefetcher on, spin - off, turbo - off amk@tix8:~/amir/lmbench-3.0-a9/bin/x86_64-linux-gnu$ numactl --cpunodebind=0 --membind=0 ./lat_mem_rd -t 1024 "stride=64 0.00049 1.383 0.00098 1.383 0.00195 1.383 0.00293 1.383 0.00391 1.383 0.00586 1.383 0.00781 1.383 0.01172 1.383 0.01562 1.383 0.02344 1.383 0.03125 1.383 0.04688 4.149 0.06250 4.149 0.09375 4.754 0.12500 4.527 0.18750 4.654 0.25000 5.783 0.37500 15.656 0.50000 16.030 0.75000 16.210 1.00000 16.435 1.50000 18.233 2.00000 20.576 3.00000 23.694 4.00000 25.267 6.00000 26.345 8.00000 26.782 12.00000 28.119 16.00000 31.143 24.00000 74.321 32.00000 89.466 48.00000 94.858 64.00000 95.909 96.00000 96.681 128.00000 96.835 192.00000 96.770 256.00000 96.673 384.00000 96.380 512.00000 96.267 768.00000 96.091 1024.00000 95.831 prefetcher on, spin - on, turbo - off amk@tix8:~/amir/lmbench-3.0-a9/bin/x86_64-linux-gnu$ numactl --cpunodebind=0 --membind=0 ./lat_mem_rd -t 1024 "stride=64 0.00049 1.383 0.00098 1.383 0.00195 1.383 0.00293 1.383 0.00391 1.383 0.00586 1.383 0.00781 1.383 0.01172 1.383 0.01562 1.383 0.02344 1.383 0.03125 1.383 0.04688 4.149 0.06250 4.149 0.09375 4.147 0.12500 4.149 0.18750 5.157 0.25000 8.717 0.37500 15.632 0.50000 16.089 0.75000 16.283 1.00000 16.971 1.50000 17.534 2.00000 20.677 3.00000 23.613 4.00000 25.086 6.00000 26.135 8.00000 26.648 12.00000 28.132 16.00000 34.167 24.00000 74.176 32.00000 89.376 48.00000 94.873 64.00000 95.977 96.00000 96.707 128.00000 96.886 192.00000 96.766 256.00000 96.649 384.00000 96.413 512.00000 96.282 768.00000 96.090 1024.00000 95.963 prefetcher on, spin - on, turbo - on amk@tix8:~/amir/lmbench-3.0-a9/bin/x86_64-linux-gnu$ numactl --cpunodebind=0 --membind=0 ./lat_mem_rd -t 1024 "stride=64 0.00049 1.216 0.00098 1.216 0.00195 1.216 0.00293 1.216 0.00391 1.216 0.00586 1.216 0.00781 1.216 0.01172 1.216 0.01562 1.216 0.02344 1.216 0.03125 1.216 0.04688 3.647 0.06250 3.647 0.09375 3.646 0.12500 3.980 0.18750 4.359 0.25000 9.690 0.37500 13.562 0.50000 14.044 0.75000 14.167 1.00000 14.640 1.50000 16.854 2.00000 17.950 3.00000 20.417 4.00000 22.197 6.00000 23.251 8.00000 24.041 12.00000 25.474 16.00000 28.963 24.00000 69.301 32.00000 83.078 48.00000 88.359 64.00000 88.757 96.00000 90.156 128.00000 90.073 192.00000 90.063 256.00000 89.684 384.00000 89.604 512.00000 89.322 768.00000 89.100 1024.00000 88.997 prefetcher on, spin - off, turbo - on amk@tix8:~/amir/lmbench-3.0-a9/bin/x86_64-linux-gnu$ numactl --cpunodebind=0 --membind=0 ./lat_mem_rd -t 1024 "stride=64 0.00049 1.216 0.00098 1.215 0.00195 1.215 0.00293 1.215 0.00391 1.216 0.00586 1.215 0.00781 1.215 0.01172 1.215 0.01562 1.215 0.02344 1.216 0.03125 1.216 0.04688 3.646 0.06250 3.647 0.09375 4.092 0.12500 3.981 0.18750 4.670 0.25000 10.002 0.37500 12.216 0.50000 14.128 0.75000 14.218 1.00000 14.486 1.50000 16.220 2.00000 17.906 3.00000 20.454 4.00000 22.157 6.00000 23.333 8.00000 24.089 12.00000 25.540 16.00000 29.205 24.00000 77.929 32.00000 93.621 48.00000 99.870 64.00000 100.751 96.00000 101.882 128.00000 102.161 192.00000 102.135 256.00000 102.118 384.00000 101.911 512.00000 101.754 768.00000 101.620 1024.00000 101.544 prefetcher off, spin - on, turbo - on amk@tix8:~/amir/lmbench-3.0-a9/bin/x86_64-linux-gnu$ numactl --cpunodebind=0 --membind=0 ./lat_mem_rd -t 1024 "stride=64 0.00049 1.216 0.00098 1.216 0.00195 1.216 0.00293 1.216 0.00391 1.216 0.00586 1.216 0.00781 1.216 0.01172 1.216 0.01562 1.216 0.02344 1.216 0.03125 1.216 0.04688 3.647 0.06250 3.647 0.09375 3.647 0.12500 8.731 0.18750 8.086 0.25000 10.492 0.37500 12.626 0.50000 14.122 0.75000 14.221 1.00000 14.736 1.50000 16.544 2.00000 17.951 3.00000 20.269 4.00000 21.904 6.00000 23.859 8.00000 24.570 12.00000 25.762 16.00000 29.485 24.00000 69.711 32.00000 82.572 48.00000 88.484 64.00000 88.633 96.00000 90.292 128.00000 90.326 192.00000 90.139 256.00000 89.840 384.00000 89.481 512.00000 89.381 768.00000 89.130 1024.00000 89.065 prefetcher off, spin - off, turbo - on amk@tix8:~/amir/lmbench-3.0-a9/bin/x86_64-linux-gnu$ numactl --cpunodebind=0 --membind=0 ./lat_mem_rd -t 1024 "stride=64 0.00049 1.216 0.00098 1.216 0.00195 1.216 0.00293 1.216 0.00391 1.216 0.00586 1.216 0.00781 1.216 0.01172 1.216 0.01562 1.216 0.02344 1.216 0.03125 1.216 0.04688 3.647 0.06250 3.647 0.09375 3.647 0.12500 4.112 0.18750 7.247 0.25000 8.032 0.37500 12.501 0.50000 14.032 0.75000 14.215 1.00000 14.862 1.50000 16.577 2.00000 17.771 3.00000 20.397 4.00000 21.801 6.00000 23.880 8.00000 24.416 12.00000 26.486 16.00000 34.013 24.00000 76.538 32.00000 93.146 48.00000 99.596 64.00000 100.588 96.00000 101.902 128.00000 102.158 192.00000 102.238 256.00000 102.145 384.00000 101.958 512.00000 101.832 768.00000 101.627 1024.00000 101.539
0 Kudos
amk21
Beginner
5,918 Views
one more thing i rerun my own simple benchmark (random access to large vector) and the results are still slow above 8 seconds .... westmere is 6.6 seconds !!!! the test and make file are attached (just remove the .txt from file names)
0 Kudos
Patrick_F_Intel1
Employee
5,918 Views
Hello Amir, You asked: 1) what cpu are you using ? (cpu id, how many cores) I used a pre-production Sandybridge-EP chip, cpuid.1.eax= 0x206d5 (so ext_model= 0x2d, stepping 0x5). It has 8 cores/16 threads per socket. 2) what operating system are use using ? Microsoft Windows Server 2008 R2 Enterprise 3) and most importantly can you share the exact test command (are you using lat_mem_rd) I'm using my own latency utility so the command line won't correspond to lat_mem_rd. The latency results of my utility agree with the latency of the main public latency utilities (such as cpu-z latency utility). I'll see if I can get someone to install linux on the box and run lat_mem_rd directly. But I'm 95% sure that something is wrong. Here is a short table of your results (using just the 1GB latency #) row, prefetch, spin, turbo, latency(ns) 1, on, off, off, 95.831 2, on, on, off, 95.963 3, on, off, on, 101.544 4, off, on, on, 89.065 5, off, off, on, 101.539 The latency for the 'only difference is the state of the prefetcher' case (rows 3 and 5) shows 101.544 vs. 101.536. So prefetcher makes NO difference for a 64 byte stride? This can't be right. This IS the test I use to see whether prefetchers are enabled or disabled and, on this system, the prefetchers are disabled, always. How are you enabling/disabling the prefetchers? Using the BIOS settings right? After you make a enable all 4 prefetchers in BIOS and boot, and then reboot, does the BIOS still show the prefetchers enabled? I'll ask someone to install linux on the box so I can run lat_mem_rd. This will take a while. But I don't expect the results and conclusions won't change much. Pat
0 Kudos
amk21
Beginner
5,918 Views
Hi Pat, i'm updating the perfetchers using the BIOS and then reboot the computer. next time i'm entering the BIOS i can see that the setting are correct (as i set them before the reboot). i ordered the exact memory as you are using hopefully i'll have it next week any other ideas i should check ? Thanks Amir
0 Kudos
Patrick_F_Intel1
Employee
5,918 Views
Maybe see if the system vendor has a more recent bios. Is my cpuid model (0x2d) the same as yours? Unfortunately, on pre-production chips, they don't put the name string (like E5-2670) in the cpuid info. I'll check on getting linux on the box. Pat
0 Kudos
amk21
Beginner
5,918 Views
regarding the BIOS we already have the newest version i'm a bit confuse about the cpuid is 0xd as you can see the output of /proc/cpuinfo and cpuid bellow processor : 15 vendor_id : GenuineIntel cpu family : 6 model : 45 model name : Intel(R) Xeon(R) CPU E5-2690 0 @ 2.90GHz stepping : 7 cpu MHz : 1200.000 cache size : 20480 KB physical id : 1 siblings : 8 core id : 7 cpu cores : 8 apicid : 46 initial apicid : 46 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 x2apic popcnt aes xsave avx lahf_lm ida arat tpr_shadow vnmi flexpriority ept vpid bogomips : 5786.05 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual ----------------------------------------------------------------------------------------------------- cpuid eax in eax ebx ecx edx 00000000 0000000d 756e6547 6c65746e 49656e69 00000001 000206d7 00200800 1fbee3ff bfebfbff 00000002 76035a01 00f0b0ff 00000000 00ca0000 00000003 00000000 00000000 00000000 00000000 00000004 00000000 00000000 00000000 00000000 00000005 00000040 00000040 00000003 00021120 00000006 00000077 00000002 00000009 00000000 00000007 00000000 00000000 00000000 00000000 00000008 00000000 00000000 00000000 00000000 00000009 00000001 00000000 00000000 00000000 0000000a 07300803 00000000 00000000 00000603 0000000b 00000000 00000000 0000005f 00000000 0000000c 00000000 00000000 00000000 00000000 0000000d 00000000 00000000 00000000 00000000 80000000 80000008 00000000 00000000 00000000 80000001 00000000 00000000 00000001 2c100800 80000002 20202020 49202020 6c65746e 20295228 80000003 6e6f6558 20295228 20555043 322d3545 80000004 20303936 20402030 30392e32 007a4847 80000005 00000000 00000000 00000000 00000000 80000006 00000000 00000000 01006040 00000000 80000007 00000000 00000000 00000000 00000100 80000008 0000302e 00000000 00000000 00000000 Vendor ID: "GenuineIntel"; CPUID level 13 Intel-specific functions: Version 000206d7: Type 0 - Original OEM Family 6 - Pentium Pro Model 13 - Stepping 7 Reserved 8 Extended brand string: " Intel(R) Xeon(R) CPU E5-2690 0 @ 2.90GHz" CLFLUSH instruction cache line size: 8 Hyper threading siblings: 32 Feature flags bfebfbff: FPU Floating Point Unit VME Virtual 8086 Mode Enhancements DE Debugging Extensions PSE Page Size Extensions TSC Time Stamp Counter MSR Model Specific Registers PAE Physical Address Extension MCE Machine Check Exception CX8 COMPXCHG8B Instruction APIC On-chip Advanced Programmable Interrupt Controller present and enabled SEP Fast System Call MTRR Memory Type Range Registers PGE PTE Global Flag MCA Machine Check Architecture CMOV Conditional Move and Compare Instructions FGPAT Page Attribute Table PSE-36 36-bit Page Size Extension CLFSH CFLUSH instruction DS Debug store ACPI Thermal Monitor and Clock Ctrl MMX MMX instruction set FXSR Fast FP/MMX Streaming SIMD Extensions save/restore SSE Streaming SIMD Extensions instruction set SSE2 SSE2 extensions SS Self Snoop HT Hyper Threading TM Thermal monitor 31 reserved TLB and cache info: 5a: unknown TLB/cache descriptor 03: Data TLB: 4KB pages, 4-way set assoc, 64 entries 76: unknown TLB/cache descriptor ff: unknown TLB/cache descriptor b0: unknown TLB/cache descriptor f0: unknown TLB/cache descriptor ca: unknown TLB/cache descriptor Processor serial: 0002-06D7-0000-0000-0000-0000
0 Kudos
Patrick_F_Intel1
Employee
5,918 Views
The cpuid signature is cpuid.1.eax (input value=1, output register eax). In your data above, the signature is 000206d7. The model is 0xd. The extended model is 0x2d. the family is 0x6. The stepping is 0x7. So we are using the same chip but your chip is 2 steppings after my chip. You can see the explanation of model, extended model, etc in Intel CPUID app note at http://www.intel.com/content/www/us/en/processors/processor-identification-cpuid-instruction-note.html Pat
0 Kudos
amk21
Beginner
5,918 Views
could this explain the performance issues ?
0 Kudos
Patrick_F_Intel1
Employee
5,918 Views
I doubt it...
0 Kudos
Patrick_F_Intel1
Employee
5,918 Views
Can you try running the prefetcher enabled and prefetcher disabled on sandybridge EP again, using lat_mem_rd without the '-t' option. The '-t' option says to 'thrash' memory, so it doesn't really do (near as I can tell) sequential 64 byte strides. It would be good to run on it the westmere-based EP box too. Sorry for all the email/forum thrashing. Pat
0 Kudos
amk21
Beginner
5,918 Views
prefetcher on, spin - off, turbo - on amk@tix8:~/amir/lmbench-3.0-a9/bin/x86_64-linux-gnu$ numactl --cpunodebind=0 --membind=0 ./lat_mem_rd 1024 "stride=64 0.00049 1.215 0.00098 1.215 0.00195 1.215 0.00293 1.215 0.00391 1.215 0.00586 1.215 0.00781 1.215 0.01172 1.215 0.01562 1.215 0.02344 1.216 0.03125 1.216 0.04688 3.646 0.06250 3.647 0.09375 3.653 0.12500 3.663 0.18750 3.697 0.25000 3.700 0.37500 4.947 0.50000 4.944 0.75000 4.961 1.00000 4.952 1.50000 4.971 2.00000 4.967 3.00000 5.073 4.00000 5.079 6.00000 5.077 8.00000 5.073 12.00000 5.073 16.00000 5.083 24.00000 9.123 32.00000 9.333 48.00000 9.401 64.00000 9.399 96.00000 9.396 128.00000 9.398 192.00000 9.398 256.00000 9.396 384.00000 9.396 512.00000 9.398 768.00000 9.396 1024.00000 9.397 prefetcher on, spin - on, turbo - on amk@tix8:~/amir/lmbench-3.0-a9/bin/x86_64-linux-gnu$ numactl --cpunodebind=0 --membind=0 ./lat_mem_rd 1024 "stride=64 0.00049 1.215 0.00098 1.215 0.00195 1.215 0.00293 1.215 0.00391 1.215 0.00586 1.215 0.00781 1.215 0.01172 1.215 0.01562 1.215 0.02344 1.215 0.03125 1.216 0.04688 3.647 0.06250 3.647 0.09375 3.649 0.12500 3.665 0.18750 3.695 0.25000 3.699 0.37500 4.981 0.50000 4.977 0.75000 4.976 1.00000 4.972 1.50000 4.971 2.00000 4.973 3.00000 5.091 4.00000 5.092 6.00000 5.092 8.00000 5.098 12.00000 5.095 16.00000 5.097 24.00000 8.525 32.00000 8.722 48.00000 8.791 64.00000 8.787 96.00000 8.787 128.00000 8.787 192.00000 8.790 256.00000 8.786 384.00000 8.788 512.00000 8.781 768.00000 8.795 1024.00000 8.784 prefetcher off, spin - off, turbo - on amk@tix8:~/amir/lmbench-3.0-a9/bin/x86_64-linux-gnu$ numactl --cpunodebind=0 --membind=0 ./lat_mem_rd 1024 "stride=64 0.00049 1.215 0.00098 1.215 0.00195 1.215 0.00293 1.215 0.00391 1.215 0.00586 1.215 0.00781 1.215 0.01172 1.215 0.01562 1.215 0.02344 1.215 0.03125 1.216 0.04688 3.646 0.06250 3.646 0.09375 3.647 0.12500 3.649 0.18750 7.202 0.25000 8.710 0.37500 12.205 0.50000 12.205 0.75000 12.209 1.00000 12.208 1.50000 12.210 2.00000 12.210 3.00000 12.279 4.00000 12.279 6.00000 12.278 8.00000 12.278 12.00000 12.349 16.00000 18.538 24.00000 61.805 32.00000 77.941 48.00000 82.244 64.00000 82.315 96.00000 82.262 128.00000 82.311 192.00000 82.304 256.00000 82.307 384.00000 82.319 512.00000 82.325 768.00000 82.333 1024.00000 82.328 prefetcher off, spin - on, turbo - on amk@tix8:~/amir/lmbench-3.0-a9/bin/x86_64-linux-gnu$ numactl --cpunodebind=0 --membind=0 ./lat_mem_rd 1024 "stride=64 0.00049 1.215 0.00098 1.215 0.00195 1.215 0.00293 1.215 0.00391 1.215 0.00586 1.215 0.00781 1.215 0.01172 1.215 0.01562 1.215 0.02344 1.215 0.03125 1.216 0.04688 3.646 0.06250 3.646 0.09375 3.646 0.12500 3.646 0.18750 7.204 0.25000 6.586 0.37500 12.211 0.50000 12.210 0.75000 12.210 1.00000 12.210 1.50000 12.209 2.00000 12.209 3.00000 12.277 4.00000 12.278 6.00000 12.277 8.00000 12.277 12.00000 12.287 16.00000 23.342 24.00000 54.411 32.00000 66.018 48.00000 69.587 64.00000 69.298 96.00000 69.551 128.00000 69.295 192.00000 69.205 256.00000 69.046 384.00000 69.004 512.00000 68.928 768.00000 68.905 1024.00000 68.874
0 Kudos
Patrick_F_Intel1
Employee
5,918 Views
Thanks Amir, Below is shorter version of your results of running lat_mem_rd without the -t option. These number are about what I got on my SNB-EP system. Can you run the same tests (lat_mem_rd without the -t option) on the westmere-based system please? Then we'll have a pretty complete set of data to investigate. SNB-EP prefetch, spin, turbo, latency(ns) on, on, on, 8.784 on, off, on, 9.397 off, on, on, 68.874 off, off, on, 82.328 Thanks, Pat
0 Kudos
amk21
Beginner
5,918 Views
prefetcher on, spin - off amk@tix2:~/amir/lmbench-3.0-a9/bin/x86_64-linux-gnu$ numactl --cpunodebind=0 --membind=0 ./lat_mem_rd 1024 "stride=64 0.00049 1.200 0.00098 1.200 0.00195 1.200 0.00293 1.200 0.00391 1.200 0.00586 1.200 0.00781 1.200 0.01172 1.200 0.01562 1.200 0.02344 1.200 0.03125 1.200 0.04688 3.000 0.06250 3.000 0.09375 3.002 0.12500 3.014 0.18750 3.044 0.25000 3.066 0.37500 3.916 0.50000 3.916 0.75000 3.916 1.00000 3.916 1.50000 3.916 2.00000 3.917 3.00000 3.956 4.00000 3.957 6.00000 3.957 8.00000 3.957 12.00000 5.359 16.00000 7.706 24.00000 8.369 32.00000 8.544 48.00000 8.553 64.00000 8.459 96.00000 8.537 128.00000 8.518 192.00000 8.451 256.00000 8.555 384.00000 8.574 512.00000 8.512 768.00000 8.516 1024.00000 8.552 prefetcher on, spin - on amk@tix2:~/amir/lmbench-3.0-a9/bin/x86_64-linux-gnu$ numactl --cpunodebind=0 --membind=0 ./lat_mem_rd 1024 "stride=64 0.00049 1.200 0.00098 1.200 0.00195 1.200 0.00293 1.200 0.00391 1.200 0.00586 1.200 0.00781 1.200 0.01172 1.200 0.01562 1.200 0.02344 1.200 0.03125 1.200 0.04688 3.000 0.06250 3.000 0.09375 3.000 0.12500 3.001 0.18750 3.055 0.25000 3.067 0.37500 3.916 0.50000 3.917 0.75000 3.916 1.00000 3.916 1.50000 3.916 2.00000 3.917 3.00000 3.957 4.00000 3.956 6.00000 3.957 8.00000 3.957 12.00000 5.311 16.00000 7.582 24.00000 8.327 32.00000 8.522 48.00000 8.535 64.00000 8.476 96.00000 8.526 128.00000 8.547 192.00000 8.494 256.00000 8.503 384.00000 8.521 512.00000 8.473 768.00000 8.472 1024.00000 8.482 prefetcher off, spin - off amk@tix2:~/amir/lmbench-3.0-a9/bin/x86_64-linux-gnu$ numactl --cpunodebind=0 --membind=0 ./lat_mem_rd 1024 "stride=64 0.00049 1.200 0.00098 1.200 0.00195 1.200 0.00293 1.200 0.00391 1.200 0.00586 1.200 0.00781 1.200 0.01172 1.200 0.01562 1.200 0.02344 1.200 0.03125 1.200 0.04688 3.000 0.06250 3.000 0.09375 3.001 0.12500 3.000 0.18750 3.000 0.25000 3.004 0.37500 15.026 0.50000 15.026 0.75000 15.026 1.00000 15.026 1.50000 15.026 2.00000 15.027 3.00000 15.099 4.00000 15.100 6.00000 15.099 8.00000 15.573 12.00000 30.821 16.00000 59.533 24.00000 66.361 32.00000 67.914 48.00000 67.940 64.00000 67.909 96.00000 67.985 128.00000 67.943 192.00000 67.890 256.00000 67.845 384.00000 67.834 512.00000 67.831 768.00000 67.807 1024.00000 67.803 prefetcher off, spin - on amk@tix2:~/amir/lmbench-3.0-a9/bin/x86_64-linux-gnu$ numactl --cpunodebind=0 --membind=0 ./lat_mem_rd 1024 "stride=64 0.00049 1.200 0.00098 1.200 0.00195 1.200 0.00293 1.200 0.00391 1.200 0.00586 1.200 0.00781 1.200 0.01172 1.200 0.01562 1.200 0.02344 1.200 0.03125 1.200 0.04688 3.001 0.06250 3.000 0.09375 3.000 0.12500 3.000 0.18750 3.000 0.25000 6.381 0.37500 15.026 0.50000 15.027 0.75000 15.027 1.00000 15.027 1.50000 15.027 2.00000 15.027 3.00000 15.100 4.00000 15.099 6.00000 15.099 8.00000 15.099 12.00000 34.131 16.00000 59.888 24.00000 66.616 32.00000 67.897 48.00000 67.942 64.00000 67.927 96.00000 67.987 128.00000 67.922 192.00000 67.890 256.00000 67.858 384.00000 67.828 512.00000 67.813 768.00000 67.811 1024.00000 67.799
0 Kudos
amk21
Beginner
5,918 Views
prefetcher off, spin - off, turbo - on amk@tix8:~/amir/memtest$ numactl --cpunodebind=0 --membind=0 perf stat -c -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u -e L1-dcache-stores:u -e L1-dcache-store-misses:u ./memtest -v 10000019 -c 100000000 total Time (rdtsc) 25692816948 nano time 21410680790 vector size 240000456 Performance counter stats for './memtest -v 10000019 -c 100000000': 29266867942 cycles # 0.000 M/sec 5869414932 instructions # 0.201 IPC 1190560945 L1-dcache-loads # 0.000 M/sec 94484644 L1-dcache-load-misses # 0.000 M/sec 1380084899 L1-dcache-stores # 0.000 M/sec 306792765 L1-dcache-store-misses # 0.000 M/sec 8.992366456 seconds time elapsed prefetcher off, spin - on, turbo - on amk@tix8:~/amir/memtest$ numactl --cpunodebind=0 --membind=0 perf stat -c -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u -e L1-dcache-stores:u -e L1-dcache-store-misses:u ./memtest -v 10000019 -c 100000000 total Time (rdtsc) 23101364825 nano time 19251137354 vector size 240000456 Performance counter stats for './memtest -v 10000019 -c 100000000': 26324320151 cycles # 0.000 M/sec 5869414840 instructions # 0.223 IPC 1191849861 L1-dcache-loads # 0.000 M/sec 94700405 L1-dcache-load-misses # 0.000 M/sec 1374421515 L1-dcache-stores # 0.000 M/sec 306818694 L1-dcache-store-misses # 0.000 M/sec 8.084655008 seconds time elapsed prefetcher on, spin - off, turbo - on amk@tix8:~/amir/memtest$ numactl --cpunodebind=0 --membind=0 perf stat -c -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u -e L1-dcache-stores:u -e L1-dcache-store-misses:u ./memtest -v 10000019 -c 100000000 total Time (rdtsc) 25695167924 nano time 21412639936 vector size 240000456 Performance counter stats for './memtest -v 10000019 -c 100000000': 29269690219 cycles # 0.000 M/sec 5869414932 instructions # 0.201 IPC 1190419590 L1-dcache-loads # 0.000 M/sec 94353778 L1-dcache-load-misses # 0.000 M/sec 1380351700 L1-dcache-stores # 0.000 M/sec 306799151 L1-dcache-store-misses # 0.000 M/sec 8.988294003 seconds time elapsed prefetcher on, spin - on, turbo - on amk@tix8:~/amir/memtest$ numactl --cpunodebind=0 --membind=0 perf stat -c -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u -e L1-dcache-stores:u -e L1-dcache-store-misses:u ./memtest -v 10000019 -c 100000000 total Time (rdtsc) 23141081595 nano time 19284234662 vector size 240000456 Performance counter stats for './memtest -v 10000019 -c 100000000': 26370388712 cycles # 0.000 M/sec 5869414841 instructions # 0.223 IPC 1192518843 L1-dcache-loads # 0.000 M/sec 94714263 L1-dcache-load-misses # 0.000 M/sec 1372891292 L1-dcache-stores # 0.000 M/sec 306802914 L1-dcache-store-misses # 0.000 M/sec 8.094156100 seconds time elapsed
0 Kudos
amk21
Beginner
5,918 Views
prefetcher on, spin - off amk@tix2:~/amir/memtest$ numactl --cpunodebind=0 --membind=0 perf stat -c -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u -e L1-dcache-stores:u -e L1-dcache-store-misses:u ./memtest -v 10000019 -c 100000000 total Time (rdtsc) 21767099764 nano time 6530783007 vector size 240000456 Performance counter stats for './memtest -v 10000019 -c 100000000': 21809907154 cycles # 0.000 M/sec 5869320950 instructions # 0.269 IPC 1700577108 L1-dcache-loads # 0.000 M/sec 222105053 L1-dcache-load-misses # 0.000 M/sec 1130245449 L1-dcache-stores # 0.000 M/sec 0 L1-dcache-store-misses # 0.000 M/sec 6.617275213 seconds time elapsed prefetcher on, spin - on amk@tix2:~/amir/memtest$ numactl --cpunodebind=0 --membind=0 perf stat -c -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u -e L1-dcache-stores:u -e L1-dcache-store-misses:u ./memtest -v 10000019 -c 100000000 total Time (rdtsc) 21759419860 nano time 6528478805 vector size 240000456 Performance counter stats for './memtest -v 10000019 -c 100000000': 21803939087 cycles # 0.000 M/sec 5869320950 instructions # 0.269 IPC 1700577108 L1-dcache-loads # 0.000 M/sec 221837111 L1-dcache-load-misses # 0.000 M/sec 1130245449 L1-dcache-stores # 0.000 M/sec 0 L1-dcache-store-misses # 0.000 M/sec 6.614343550 seconds time elapsed
0 Kudos
Patrick_F_Intel1
Employee
5,918 Views
Hello Amir. Below I summarize our results so far. Looking at table 1, it seems that the latency of your wsm (85.707ns) and snb (89.065ns) systems is about the same. The frequency of your wsm system is about 1.148x higher than the snb box. In your memtest main.cpp, it seems like the 2 main components of the time are a) the random number genration and b) the loading of the random memory location. Given that your memory latencies look about equal, I wonder how much of the difference is due to the higher wsm frequency. If you want to test this, there are 2 ways: 1) change the frequency of the cpus (see the attached how_to_change_frequency_on_linux_pub.txt file) or 2) move the 'generate the random numbers' out of the timing loop. For 2), you can see my win_main.cpp which is a modified for windows version of your main.cpp. I put the random numbers into an array. I'm sorry that the 2 systems I used were not more similar and that they were not linux. Pat Amir's 2 systems: wsm-ep tix2 - cpu X5680 3.33GHz, mother board - Supermicro X8DTU , memory - 64GB divided 32GB to each bank at 1.33GHz snb-ep tix8 - cpu E5-2690 2.90GHz, mother board - Intel S2600GZ, memory - 64GB divided 32GB to each bank at 1.60GHz Frequency ratio wsm/snb = 1.148x Table 1 below Amir running lmbench lat_mem_rd -t (random memory accesses) system prefetch spin turbo random latency(ns) Best snb/wsm snb off on on on 89.065 1.039179997x wsm off ? ? on 85.707 via private msg Table 2 below Amir running his memtest microkernel system prefetch spin turbo random time(secs) Best snb/wsm snb off on on on 8.084655008 snb off off on on 8.992366456 wsm on off ? on 6.617275213 1.221749851x Pat's systems: wsm-ep - cpu L5640 @ 2.27GHz, mother board - Intel S5500WB, memory - 12GB total divided 2GB per channel, 3 DIMMs per node at 1.33GHz snb-ep - cpu @ 2.70GHz, cpuid signature 0x206d5, mother board - ASUSTek Z9PP-D24, memory - 64GB total divided 8GB per channel, 4 DIMMs per node at 1.60GHz Frequency ratio wsm/snb = 1.189x Table 3 below Pat running a modified version of Amir's memtest modified memtest now generates random numbers outside of timing loop system prefetch spin turbo random time(secs) Best snb/wsm snb off on on on 6.41873 wsm off on on on 7.02422 1.094331745x Table 4 below. Pat running a memory latency test with a random memory access system prefetch spin turbo random latency(ns) Best wsm/snb snb off off on on 96.714 snb off on on on 87.844 wsm off off on on 99.976 1.138108465x
0 Kudos
amk21
Beginner
5,918 Views
Hi Pat, i added the following data in attached file due to forum misbehave (deleting space) I think that you found the problem .... i made some more tests base on your instruction to separate the random call from the memory access. test 0 - original test one loop with random and memory access base on the random test 1 - separate the random from the memory access by running 2 loops one calling random and storing it in vector and another loop getting the random number from vector and accessing the large vector test 2 - only random call test 3 - same as 1 + calling random inside the second loop and storing it's value results test 0 amk@tix8:~/amir/memtest$ numactl --cpunodebind=0 --membind=0 perf stat -c -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u -e L1-dcache-stores:u -e L1-dcache-store-misses:u ./memtest -v 10000019 -c 100000000 -q 0 total Time (rdtsc) 25694935989 nano time 21412446657 vector size 240000456 Performance counter stats for './memtest -v 10000019 -c 100000000 -q 0': 29269105173 cycles # 0.000 M/sec 5869651728 instructions # 0.201 IPC 1190715446 L1-dcache-loads # 0.000 M/sec 94428062 L1-dcache-load-misses # 0.000 M/sec 1380723917 L1-dcache-stores # 0.000 M/sec 306786820 L1-dcache-store-misses # 0.000 M/sec 8.986841975 seconds time elapsed amk@tix2:~/amir/memtest$ perf stat -c -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u -e L1-dcache-stores:u -e L1-dcache-store-misses:u ./memtest -v 10000019 -c 100000000 -q 0 total Time (rdtsc) 21768192992 nano time 6531111008 vector size 240000456 Performance counter stats for './memtest -v 10000019 -c 100000000 -q 0': 21811501362 cycles # 0.000 M/sec 5869557137 instructions # 0.269 IPC 1700665349 L1-dcache-loads # 0.000 M/sec 221906581 L1-dcache-load-misses # 0.000 M/sec 1130278735 L1-dcache-stores # 0.000 M/sec 0 L1-dcache-store-misses # 0.000 M/sec 6.616472433 seconds time elapsed test 1 amk@tix8:~/amir/memtest$ numactl --cpunodebind=0 --membind=0 perf stat -c -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u -e L1-dcache-stores:u -e L1-dcache-store-misses:u ./memtest -v 10000019 -c 100000000 -q 1 total Time (rdtsc) 5628116479 nano time 4690097065 vector size 240000456 Performance counter stats for './memtest -v 10000019 -c 100000000 -q 1': 9058106175 cycles # 0.000 M/sec 6269846648 instructions # 0.692 IPC 1499386470 L1-dcache-loads # 0.000 M/sec 99173796 L1-dcache-load-misses # 0.000 M/sec 1253847318 L1-dcache-stores # 0.000 M/sec 323522565 L1-dcache-store-misses # 0.000 M/sec 3.099189926 seconds time elapsed amk@tix2:~/amir/memtest$ perf stat -c -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u -e L1-dcache-stores:u -e L1-dcache-store-misses:u ./memtest -v 10000019 -c 100000000 -q 1 total Time (rdtsc) 6830226432 nano time 2049272856 vector size 240000456 Performance counter stats for './memtest -v 10000019 -c 100000000 -q 1': 9913974959 cycles # 0.000 M/sec 6269752348 instructions # 0.632 IPC 1700860367 L1-dcache-loads # 0.000 M/sec 235049597 L1-dcache-load-misses # 0.000 M/sec 1230473719 L1-dcache-stores # 0.000 M/sec 0 L1-dcache-store-misses # 0.000 M/sec 3.263528592 seconds time elapsed test 2 amk@tix8:~/amir/memtest$ numactl --cpunodebind=0 --membind=0 perf stat -c -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u -e L1-dcache-stores:u -e L1-dcache-store-misses:u ./memtest -v 10000019 -c 100000000 -q 2 total Time (rdtsc) 2186068316 nano time 1821723596 vector size 240000456 Performance counter stats for './memtest -v 10000019 -c 100000000 -q 2': 2553096540 cycles # 0.000 M/sec 4869650845 instructions # 1.907 IPC 1206799012 L1-dcache-loads # 0.000 M/sec 92179 L1-dcache-load-misses # 0.000 M/sec 830236117 L1-dcache-stores # 0.000 M/sec 6999 L1-dcache-store-misses # 0.000 M/sec 0.860435270 seconds time elapsed amk@tix2:~/amir/memtest$ perf stat -c -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u -e L1-dcache-stores:u -e L1-dcache-store-misses:u ./memtest -v 10000019 -c 100000000 -q 2 total Time (rdtsc) 2397898132 nano time 719441383 vector size 240000456 Performance counter stats for './memtest -v 10000019 -c 100000000 -q 2': 2462879102 cycles # 0.000 M/sec 4869556479 instructions # 1.977 IPC 1600664741 L1-dcache-loads # 0.000 M/sec 34091 L1-dcache-load-misses # 0.000 M/sec 830278129 L1-dcache-stores # 0.000 M/sec 0 L1-dcache-store-misses # 0.000 M/sec 0.805532006 seconds time elapsed test 3 amk@tix8:~/amir/memtest$ numactl --cpunodebind=0 --membind=0 perf stat -c -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u -e L1-dcache-stores:u -e L1-dcache-store-misses:u ./memtest -v 10000019 -c 100000000 -q 3 total Time (rdtsc) 25908789550 nano time 21590657958 vector size 240000456 Performance counter stats for './memtest -v 10000019 -c 100000000 -q 3': 32109608407 cycles # 0.000 M/sec 11066621621 instructions # 0.345 IPC 2541732011 L1-dcache-loads # 0.000 M/sec 95383828 L1-dcache-load-misses # 0.000 M/sec 2284592360 L1-dcache-stores # 0.000 M/sec 306323402 L1-dcache-store-misses # 0.000 M/sec 10.108541779 seconds time elapsed amk@tix2:~/amir/memtest$ perf stat -c -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u -e L1-dcache-stores:u -e L1-dcache-store-misses:u ./memtest -v 10000019 -c 100000000 -q 3 total Time (rdtsc) 21862175024 nano time 6559308438 vector size 240000456 Performance counter stats for './memtest -v 10000019 -c 100000000 -q 3': 24521143944 cycles # 0.000 M/sec 11066527039 instructions # 0.451 IPC 3400860832 L1-dcache-loads # 0.000 M/sec 235216594 L1-dcache-load-misses # 0.000 M/sec 2030474184 L1-dcache-stores # 0.000 M/sec 0 L1-dcache-store-misses # 0.000 M/sec 7.650797013 seconds time elapsed i made some more runs comparing test 1 and test 3 using perf amk@tix8:~/amir/memtest$ numactl --cpunodebind=0 --membind=0 perf stat -c -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u -e L1-dcache-stores:u -e L1-dcache-store-misses:u -e l1-icache-loads:u -e l1-icache-load-misses:u ./memtest -v 10000019 -c 100000000 -q 1 total Time (rdtsc) 5632905513 nano time 4694087927 vector size 240000456 Performance counter stats for './memtest -v 10000019 -c 100000000 -q 1': 9061230941 cycles # 0.000 M/sec 6269846648 instructions # 0.692 IPC 1518818209 L1-dcache-loads # 0.000 M/sec 99191827 L1-dcache-load-misses # 0.000 M/sec 1253671516 L1-dcache-stores # 0.000 M/sec 323370488 L1-dcache-store-misses # 0.000 M/sec 7318275 L1-icache-loads # 0.000 M/sec 7262 L1-icache-load-misses # 0.000 M/sec 3.100873645 seconds time elapsed amk@tix8:~/amir/memtest$ numactl --cpunodebind=0 --membind=0 perf stat -c -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u -e L1-dcache-stores:u -e L1-dcache-store-misses:u -e l1-icache-loads:u -e l1-icache-load-misses:u ./memtest -v 10000019 -c 100000000 -q 3 total Time (rdtsc) 26078419396 nano time 21732016163 vector size 240000456 Performance counter stats for './memtest -v 10000019 -c 100000000 -q 3': 32296048271 cycles # 0.000 M/sec 11066621630 instructions # 0.343 IPC 2534574377 L1-dcache-loads # 0.000 M/sec 95835472 L1-dcache-load-misses # 0.000 M/sec 2279501544 L1-dcache-stores # 0.000 M/sec 306153140 L1-dcache-store-misses # 0.000 M/sec 385461391 L1-icache-loads # 0.000 M/sec 12590 L1-icache-load-misses # 0.000 M/sec 10.168108997 seconds time elapsed amk@tix2:~/amir/memtest$ perf stat -c -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u -e L1-dcache-stores:u -e L1-dcache-store-misses:u -e l1-icache-loads:u -e l1-icache-load-misses:u ./memtest -v 10000019 -c 100000000 -q 1 total Time (rdtsc) 6824750972 nano time 2047630054 vector size 240000456 Performance counter stats for './memtest -v 10000019 -c 100000000 -q 1': 9491721818 cycles # 0.000 M/sec (scaled from 62.32%) 6285584526 instructions # 0.662 IPC (scaled from 75.06%) 1705801008 L1-dcache-loads # 0.000 M/sec (scaled from 75.16%) 119200389 L1-dcache-load-misses # 0.000 M/sec (scaled from 75.16%) 1232065739 L1-dcache-stores # 0.000 M/sec (scaled from 75.16%) 62005805 L1-dcache-store-misses # 0.000 M/sec (scaled from 75.16%) 2020404631 L1-icache-loads # 0.000 M/sec (scaled from 49.69%) 951758 L1-icache-load-misses # 0.000 M/sec (scaled from 49.69%) 3.139811349 seconds time elapsed amk@tix2:~/amir/memtest$ perf stat -c -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u -e L1-dcache-stores:u -e L1-dcache-store-misses:u -e l1-icache-loads:u -e l1-icache-load-misses:u ./memtest -v 10000019 -c 100000000 -q 3 total Time (rdtsc) 21801073812 nano time 6540976241 vector size 240000456 Performance counter stats for './memtest -v 10000019 -c 100000000 -q 3': 24523308462 cycles # 0.000 M/sec (scaled from 62.32%) 11103166989 instructions # 0.453 IPC (scaled from 74.90%) 3406669306 L1-dcache-loads # 0.000 M/sec (scaled from 75.03%) 117732566 L1-dcache-load-misses # 0.000 M/sec (scaled from 75.11%) 2031002083 L1-dcache-stores # 0.000 M/sec (scaled from 75.11%) 62171871 L1-dcache-store-misses # 0.000 M/sec (scaled from 75.11%) 3641162479 L1-icache-loads # 0.000 M/sec (scaled from 49.86%) 956647 L1-icache-load-misses # 0.000 M/sec (scaled from 49.79%) 7.632643782 seconds time elapsed Node Tix2 Tix8 test 1 3 3/1 1 3 3/1 cycles 9491721818 24523308462 2.58365225321862 9061230941 32296048271 3.56420098784457 instructions 6285584526 11103166989 1.76644939592687 6269846648 11066621630 1.7650545940434 L1-dcache-loads 1705801008 3406669306 1.99710827348743 1518818209 2534574377 1.66878060980634 L1-dcache-load-misses 119200389 117732566 0.987686088843217 99191827 95835472 0.966162988408309 L1-dcache-stores 1232065739 2031002083 1.6484526910459 1253671516 2279501544 1.81826061684247 L1-dcache-store-misses 62005805 62171871 1.00267823311059 323370488 306153140 0.94675658837488 L1-icache-loads 2020404631 3641162479 1.80219468077432 7318275 385461391 52.6710722130557 L1-icache-load-misses 951758 956647 1.00513680998741 7262 12590 1.7336821812173 please lookat L1-icache-loads in sandy bridge Regards Amir
0 Kudos
matt_garman
Beginner
5,918 Views
Patrick Fay (Intel) wrote:
1) change the frequency of the cpus (see the attached how_to_change_frequency_on_linux_pub.txt file) or
Hi Pat, That file wasn't attached in your previous message. Could you please post it? Thank you! Matt
0 Kudos
Patrick_F_Intel1
Employee
5,829 Views
Try #2 at attaching how_to_change_frequency_on_linux_pub.txt
0 Kudos
Patrick_F_Intel1
Employee
5,829 Views
Try #2 at attaching win_main.cpp
0 Kudos
Reply