Solved: i3-2120 FSB Speed and Memory bandwidth - Page 2

k_sarnath · ‎01-29-2012

All,

The URL below tells that i3-2120 has 3.3GHz CPU clock and a bus/core ratio of 33.

http://ark.intel.com/products/53426/Intel-Core-i3-2120-Processor-%283M-Cache-3_30-GHz%29

This means that the FSB base-clock is 3.3GHz/33 = 100Mhz.
Since FSBs are quad-pumped, we can just look at it as 100*4MT/s = 400MT/s
Say, each transation transfers 8 bytes (64-bit), this leads to 3200MB/s or 3.2GB/s

The URL above says that there are 2 memory channels.
Assuming 2 CPUs can simultaneously read (not sure how), we can say that the max bandwidth to CPU is around 6.4GB/s

However, the URL specifies that the RAM used is DDR3-1066/1333 and specifies the max-memory bandwidth as "1333*8*2MB/s" = 21GB/s

My question is what is the point in having a super-fast memory while the data to CPU can be transferred only at a much lower rate... I am so confused by all these numbers. Can someone lend me some help?

Thanks,
Best Regards,
Sarnath

Patrick_F_Intel1 · ‎01-29-2012

Hello Sarnath,
The 2nd generation core processors (formerly codenamed sandybridge) such as the i3-2120 have integrated memory controllers and don't use FSB technology anymore.
On my2.3 GHz Sandybridge-based system, the dual channel integrated memory controller is able to hit 18.5 GB/sec using 1333 GT/s memory (with a read memory test running on all cpus).
So Sandybridge (and Nehalem) can make good use of the high speed memory.
I hope this helps,
Pat

View solution in original post

TimP · ‎01-31-2012

Yes, memcpy should be using 128-bit SIMD for large enough transfers. Current compilers are capable of applying auto-vectorization to accomplish this without requiring SSE intrinsics or asm. Recent glibc memcpy with 64-bit SSE2 (including non-temporal) is reasonably competitive. 64-bit was chosen there to maximize performance on CPUs like Pentium-m, atom, Athlon,....
I haven't heard of an investigation of which AVX CPUs might benefit from 256-bit moves.

SergeyKostrov · ‎01-31-2012

Quoting Patrick Fay (Intel)

...
loops = 0;
...
while((time_end = your_timer_routine()) < 10) # spin for 10 seconds
{
loops++;
...
}
...
printf("time= %f, MB/sec = %f j=%d \n", time_end - time_beg, 1.0e-6 * bytes/(time_end-time_beg)
...

Hi Patrick,

Why do you use 'loops++' in the 'while' loop?

It takes time to increment andthe variable is not used later for analysis in 'printf'.

Best regards,
Sergey

k_sarnath · ‎01-31-2012

Usually the only way you can beat the system memcpy is if you know something about how you are going to use the memcpy... like you KNOW the source and dest are 16 byte aligned and you KNOW that the size is a multiple of 16 bytes, or something like that which the compiler can't figure out at compile time.

This is correct. I do know that everything is 16-byte aligned and size is a multiple of 16 bytes. But the system memcpy should find this out at run time in probably 20 to 30 cycles and then choose an optimized path - no? But I dont expect it to be 2x slower.... Most of that 2x comes from the 'non-temporal' writes... Only 10% performance comes from prefetching. I use 4MB for the transfers. We work in image-processing.

Also, for a general memcpy, the most common sizes are usually less than 256 bytes, and generally less than 64 bytes. At least, that was the case when I profiled memcpy usages a decade ago. These short cases are harder to optimize.

Thanks for sharing your experience. May be, REP MOVS... works well for these sizes...But my question remains. How do I force the compiler to use "fast_memcpy". I will try adding "__declspec(align(16))" and then see if that helps....I understand that the declspec directive helps in aligning start-address of data items... But how do I qualify a pointer as to be *pointing* to a 16-byte aligned data-structure. I will muck around with this for sometime today. I will post if I find something interesting...

So, unless have time to burn, I'd recommend making sure you have a current glibc and/or Intel compiler. Or profile your application and check whether a significant amount of time is actually being spent in memcpy.

I tried with "glibc" memcpy and I find that it comes very to my performance. I am 99.99% sure that glibc memcpy is using non-temporal writes to accelerate which is probably *not* being used by "rep movs" based intel's memcpy....I tired Encoding 4MB manually to hint the compiler about a big size during compile time.. But that does not change a thing. The only thing remaining is to use pure global variables so that compiler knows that they are 16-byte aligned.. Lets see how that one goes.

Thanks for all your time and help! This thread has been immensely useful!
Best Regards,
Sarnath

k_sarnath · ‎01-31-2012

The evaluation copy of icc should behave the same as a fully licensed copy.

This is great! Thanks!

I didn't see what architecture setting you tried; I guess it must be -xSSSE3 or later, so it might be of interest to see what other choices do.

It was set to -xSSE4.2

k_sarnath · ‎01-31-2012

[cpp]   for(i=0; i < BIG; i+=64)
   {
      j += array;
   }[/cpp]

Since "j += .." introduces a dependence chain, I was just thinking how the miroarch would be handling it. Let me just share what I think would happen. Please Correct me if I am wrong:

1. The branch predictor would predict the control flow correctly most of the time. So, branch mis-prediction isa non-issue for this loop.

2. Since the loop-body is very small, it is possible that LSD logic kicks in and Microcode Op queue will be nailed to generate a stream of micro-ops to the renamer unit until a mis-prediction breaks the stream. So there is absolutely NO front-end related bandwidth issues for this code.

3. If the compiler had generated a ADD REG, [MEMORY] type instruction, the micro-op queue would un-laminate them into two micro-ops - (or) If the compiler had generated LD + ADD then, it will also result in a similar muop sequence

4. All the ADD muops that add to "j" will form a huge dependent chain. The renamer unit has no choice but to honour the dependency chain (so renaming cannot be done). The performance of this loop is limited by the resources that the renamer unit has to handle dependence chains.

5. As the "ADD toj"muops stack up on one another waiting for LD data to arrive, the "LD" continues to execute in the execute pipeline. The only other ADD that executes without getting stacked upis the one that calculates the address of the "LD"s. These "Address ADD" instructions probably are incrementing a register and will form a dependency chain among themselves. They will execute one after another. So, addresses toLDs are not available immediately every cycle. Due to latencies and dependence chains, it is quite possible that only 1 LDmuop executes per cycle though SNB provides for 2 LD ports (Can vTune confirm this?)

6. The DCU handles the LOADs with ease. Since there are no outstanding stores, memory disambiguation is not a factor here... LOADs just slide through the DCU..Hardware prefetcher will kick in and will start prefetching more cache-lines ahead of the execute pipeline. This can mitigate the effect of "less optimal number" of LOADs executed by ooo execution engine.

7. As the Loaded data become available, the stacked up dependent ADDs enter the ALU pipe one after another. There is no real pipelining here.. They have to wait for the previous muop to complete even if their LOAD data is available.

Is this understanding correct?

Well, after reading how much smartness goes into intel's chips, I do feel it is fully justified to say "Intel inside; Idiot outside" :-)

btw, Learning how to edit, quote and paste code in this forum is more difficult than learning intel micro-architecture. I toggle to HTML editor for the blockquote thing.. It just does not work as intended inside the normal text box

Best Regards,
Sarnath

k_sarnath · ‎02-01-2012

The only thing remaining is to use pure global variables so that compiler knows that they are 16-byte aligned..

I did and it worked! The compiler uses 'intel_fast_memcpy' while copying global arrays.....

I just benchmarked. My fast implementation is as good as g++ libc and both of us are ~2x faster than intel's implementation..... But therez a catch here.
Since "memcpy"ed data are immediately re-used (temporal nature of data), it is possible that Intel is using "cached" writes (instead of non-temporal). That is good for small and typical use. Good!

However this makes no sense whenthe "copy size" is greater than the LLC size. For example, if I copy 5MB of data, libc and my implementation clock around 9.7GBps. Intel memcpy clocks around 6.594GBps. Theoretical peak of my system is ~10GBps. May be, ICPC guys can look into this aspect and fix it.

Thanks!

TimP · ‎02-01-2012

intel_fast_memcpy() (as well as recent glibc implementations) should have a threshold where it chooses nontemporal over a certain size.

Patrick_F_Intel1 · ‎02-01-2012

For 1), true, branch mispredict should not be a factor.

2) this is probably true but the only way to know for sure is to look at the dis-assembly code and look at front-end events with VTune (or a similar tool). In general, if something is going to memory then the memory latency is the bottleneck, not the front-end. The front-end isues are usually up to a few cycles or why you are not retiring more than 1 uop/cycle. If you are fetching memory,a load can take 100s of cycles.
For instance, with the prefetchers disabled, and using a load-to-use, dependent load latency test, a memory load takes 187 cycles on my system.
The prefetchers and the out-of-order execution helps keep multiple loads outstanding so that the effective latency is actually much less.

3) I don't know about this one. I'm not an expert on the uarch details. They keep changing the uarch on me.
4) Probably true.
5) for the case where the loads are coming from memory, most probably less than1 load is executed per cycle. Note that if the data come from anywhere besides L1D then a full cache line is moved to the L1D. If you run this simple read test with an array size that fits into L1 then you are just loading a registers worth of data per load (this messes up the bw calculation).
The cpu can speculatively execute 2 loads per cycle even with the dependency and just be careful to update 'j' in order.
You can use VTune to count # loads executed per cycle.
6) Yes, the out-of-order engine and prefetchers kick in to keep mutiple loads outstanding.
7) Yes. The stores to 'j' will execute in order.

I know the 'Intel Inside' is a jest but I'd prefer 'Intel inside; genius outside'. The chips are tools. Without the great software creators and users of the software the tools are useless.

Pat

k_sarnath · ‎02-01-2012

intel_fast_memcpy() (as well as recent glibc implementations) should have a threshold where it chooses nontemporal over a certain size.

Here is the sample test that transfers 40MB of data from src to destination. I memset both src and dest (which are global arrays) to 0 and 1 respectively before the test begins - So, there is no loss of bandwidth due to page-faults. Why is 40MB not enough for "intel_fast_memcpy" to use the non-temporal access? I think something is amiss here. We are using the latest eval copy from the intel website. So, I believe I must be having the latest version.

sarnath@SandyBridge:~/intel_forums$ cat Makefile
gcc:
g++ -O3 -o memcpy memcpy.cpp -lrt
g++ -O3 -o memread memread.cpp -lrt
icc:
icpc -O3 -xSSE4.2 -o memcpy memcpy.cpp -lrt
icpc -O3 -xSSE4.2 -o memread memread.cpp -lrt
sarnath@SandyBridge:~/intel_forums$ make gcc && ./memcpy && make icc && ./memcpy
g++ -O3 -o memcpy memcpy.cpp -lrt
g++ -O3 -o memread memread.cpp -lrt
Data = 83.89MB, time= 0.009030, MB/sec (RW Bandwidth) = 9289.441357
icpc -O3 -xSSE4.2 -o memcpy memcpy.cpp -lrt
icpc -O3 -xSSE4.2 -o memread memread.cpp -lrt
Data = 83.89MB, time= 0.012870, MB/sec (RW Bandwidth) = 6518.093703 sarnath@SandyBridge:~/intel_forums$

Thanks for your time on this, Best Regards, Sarnath

k_sarnath · ‎02-01-2012

Hello Pat,

Thanks for all your clarifications! It has been great to have your support in the forum!
It is a great help to developers,

btw, we are using Ubuntu 10.04 and Intel vTune (eval) will not even install. It takes an exception and stops..
Is ubuntu supported?

Best Regards,
Sarnath

Patrick_F_Intel1 · ‎02-01-2012

I've reported this to the compiler team.
They are looking into it. I'll let you know what becomes of it.
Thanks,
Pat

k_sarnath · ‎02-01-2012

Thanks for that!
+
The fastMemcpy() code that I had written runs slightly faster if I compile with "g++" than "icpc" :-(

Shannon_C_Intel · ‎02-03-2012

Hi Sarnath,
Intel VTune Amplifier XE is supported on Ubuntu* 10.10, 11.04 and 11.10 currently.
Thanks,
Shannon