Solved: i3-2120 FSB Speed and Memory bandwidth

k_sarnath · ‎01-29-2012

All,

The URL below tells that i3-2120 has 3.3GHz CPU clock and a bus/core ratio of 33.

http://ark.intel.com/products/53426/Intel-Core-i3-2120-Processor-%283M-Cache-3_30-GHz%29

This means that the FSB base-clock is 3.3GHz/33 = 100Mhz.
Since FSBs are quad-pumped, we can just look at it as 100*4MT/s = 400MT/s
Say, each transation transfers 8 bytes (64-bit), this leads to 3200MB/s or 3.2GB/s

The URL above says that there are 2 memory channels.
Assuming 2 CPUs can simultaneously read (not sure how), we can say that the max bandwidth to CPU is around 6.4GB/s

However, the URL specifies that the RAM used is DDR3-1066/1333 and specifies the max-memory bandwidth as "1333*8*2MB/s" = 21GB/s

My question is what is the point in having a super-fast memory while the data to CPU can be transferred only at a much lower rate... I am so confused by all these numbers. Can someone lend me some help?

Thanks,
Best Regards,
Sarnath

Patrick_F_Intel1 · ‎01-29-2012

Hello Sarnath,
The 2nd generation core processors (formerly codenamed sandybridge) such as the i3-2120 have integrated memory controllers and don't use FSB technology anymore.
On my2.3 GHz Sandybridge-based system, the dual channel integrated memory controller is able to hit 18.5 GB/sec using 1333 GT/s memory (with a read memory test running on all cpus).
So Sandybridge (and Nehalem) can make good use of the high speed memory.
I hope this helps,
Pat

View solution in original post

Patrick_F_Intel1 · ‎01-29-2012

Hello Sarnath,
The 2nd generation core processors (formerly codenamed sandybridge) such as the i3-2120 have integrated memory controllers and don't use FSB technology anymore.
On my2.3 GHz Sandybridge-based system, the dual channel integrated memory controller is able to hit 18.5 GB/sec using 1333 GT/s memory (with a read memory test running on all cpus).
So Sandybridge (and Nehalem) can make good use of the high speed memory.
I hope this helps,
Pat

k_sarnath · ‎01-30-2012

Hi Patrick,
Thanks for answering this over a weekend! Appreciate it very much.
This is a great piece of info.

Few more questions:
1.Can you tell what exact purpose does the FSB serve in these new chipsets?
What other clocks get derived from the FSB?
2. What does "2 memory channels" mean ? Does it mean that 2 CPUs can read simultaneously?
OR Does it mean that there are 2 paths every CPU can take to memory depending on the load?
3. Is the dual-port a motivation for Sandy Bridge micro-architecture to dedicate 2 ports for memory-loads?
If so, Does Nehalem miro-arch also dedicate 2 ports for loads? (I will look up the manual anyway for the last one)

Thanks a lot,
Your answer has been very helpful,

Best Regards,
Sarnath

TimP · ‎01-30-2012

You're opening up a lot of questions which might require reading references, I'll try some over-simplified replies.
1. As there is no "FSB" on Nehalem and Sandy Bridge, FSB functions of prior architectures are taken over by QPI and "un-core" et al. As far as I know, the clocks you would be interested in are derived from QPI clock. Nehalem and Sandy Bridge differ in that I believe Nehalem has more possible model-dependent ratios of CPU clock to uncore clock.
2. You have only 1 CPU, it presumably interleaves access on the 2 memory channels (at least when fetching whole cache lines from memory to last level cache).
3. If you're talking about memory channels, Nehalem server and desktop usually had 3 per CPU. Sandy Bridge will soon add servers (2, and later, 4 CPUs) with 4 channels per CPU. Evidently, such expense (both money and power consumption) is outside the possibility for low end mobile.

k_sarnath · ‎01-30-2012

Hi Tim,
Thanks for your answer. I am pleasantly surprised that Intel forum is so well attended and active. Thanks!

Coming back,

I just wrote a memcpy program routine that is able to hit 7GBps on my sandybridge system here running on 1 CPU. I use non-temporal writes and aggressive software prefetch. This betters intel's "ssse_rep" based memcpy routine but I think that is expected. I dont think REP MOV instructions make use of non-temporal writes. I am not too happy with my performance. Because going by Pat's answer 18.2GBps should be reachable per CPU on the system. So, I think I am atleast 2.5x away from the best performance.
I think I am not using both the memory channels effectively. I want to understand what is the correct way of reading memory so that I can load both the memory controllers simultaneously. Pardon my ignorance. Thanks for all your time,

Best Regards,
Sarnath

Patrick_F_Intel1 · ‎01-30-2012

Note thatmy 18+ GB/s result was running on all the 8 cpus (4 cores with HT enabled) of the processor.
If you run multiple instances (1 instance/cpu)of your memcpy, I'm pretty sure you will get into the 16-19 GB/s range.
I was not able to max out the memory bandwidth with just 1 cpu.

You don't have to do anything special to use both memory channels. The hardware will automatically use both memory channels.
Pat

TimP · ‎01-30-2012

Evidently, the bandwidth of a high end desktop (particularly if using DDR3-1600) with 4 channels, using multiple threads, is better than a low end laptop with 2 channels.

k_sarnath · ‎01-30-2012

Hello Patrick,

Thanks for this useful piece of info.The takeaway is: One has to read from all cpus and hyper-threaded cpus to get this number..

If you dont mind, Can you publish the benchmark number with:

1. Only 2 CPUs - no hyper-threading

2. Only 2 cores - with hyper-threading

I wonder what role hyper-threading plays here. Is it key to take advantage of the dual-memory channels?

Thanks,

Best Regards,

Sarnath

Patrick_F_Intel1 · ‎01-30-2012

So... I had all those numbers this weekend and then deleted them.

BW(GB/sec), # cpus (config)
15.5, 1 (use 1st thread on core 0)
18.9, 2 (1thread on core 0 and2nd thread on core 1)
19.2, 4 (1 thread on each core)

16.8, 2 (both threads on core 0)
18.8, 4 (use both threads on core 0 and both threads on core 1)

You see that you can hit close to the max bw with just 2 CPUs.
Hyper-threading plays no role in utilizing the dual channel memory.
Just using the 2 HT threads on the 1st core probably just can't quitekeep the memory busy enough.

Nowsome comments about my memory bw benchmark, parameters, measuring bw, etc.
The memory test I used is a 'read' memory bw. All this benchmark does is read the memory.
I use this test as a sanity check of the memory system.
It also it easy to count the bw since the bw is just '# of bytes read/elapsed_time'.
Also,each thread reads their own 40MB array.

Computing bandwidth for a memcpy is harder. The actual amount of memory moved may be 2 or 3 times the size of the dest array size.
For a standard memcpy (no non-temporal stores), the total memory moved is 3 times the dest array size. You do an RFO (read for ownership) of the dest, a read of the source, and (eventually) a writeback of the dest.
For the non-temporal store memcpy, (if everything is done correctly) you do a readof the source and a store of the dest. So the total memory moved is 2x the dest.

UsuallyI like to check thebandwidth that people quote for memcpywith VTune Amplifier counters.

Pat

k_sarnath · ‎01-30-2012

Hello Patrick,

Thanks for this great deal of info. The "memcpy" 3x thing is news for me. I have been scratching my head for sometime on this. I get it completely now. Thanks!

The numbers you get are pretty impressive compared to the 7GBps that I am getting for my version of memcpy. I was just using prefetch to accelerate the loads (keep the memory busy - the manual says 64 muops can be in flight in DCU - I just prefetch around 16 cache-lines every 4th iteration and then copy 4 cache-lines per iteration using non-temporal stores).

I want to profile this with Intel's fast memcpy. But my intel compiler (trial version) is using only "ssse3 rep movs" version of memcpy. How do I make the compiler use fast memcpy? I see that symbol defined in libirc.a. Not sure how to use it.

Meanwhile, Appreciate any ideas to improve memcpy.

Intel arch is intriguing and I think it is going to take some time to master it.
I am sure I can do this with all the support you guys are giving,
Thanks a lot for your time on this!

Best Regards,
Sarnath

Patrick_F_Intel1 · ‎01-31-2012

Hey Sarnath,
The first thing I'd recommend is just making a simple loop, without prefetching that just reads the memory and see what sort of performance you get.
This why I start simple and then get more complicated.
This will let you test your framework to make sure you are timing correctly, counting correctly, etc.
You can use a simple loop like:

#define BIG 40*1024*1024
char array[BIG];
int i, j, loops;
double time_beg, time_end, bytes=0;
j = 0;
loops = 0;
memset(array, 1, sizeof(array));
time_beg = your_timer_routine(); # replace with your timer routine
while((time_end = your_timer_routine()) < 10) # spin for 10 seconds
{
loops++;
bytes += BIG;
for(i=0; i < BIG; i+=64)
{
j += array;
}
}

# print results. Add print of j just to make sure compiler doesn't optimize everything away.
printf("time= %f, MB/sec = %f j=%d \n", time_end - time_beg, 1.0e-6 * bytes/(time_end-time_beg), j);

If you only have 1 DIMM in your system, then I believe you will only exercise 1 channel.
How many DIMMs do you have?
Pat

SergeyKostrov · ‎01-31-2012

Quoting k_sarnath

...
I want to profile this with Intel's fast memcpy. But my intel compiler (trial version) is using only "ssse3 rep movs" version of memcpy. How do I make the compiler use fast memcpy? I see that symbol defined in libirc.a. Not sure how to use it.
...

Hi everybody,

I'd like to get more information for Intel's fast 'memcpy' ( SSE based ):

- Isitavailableon Windows platforms?

- Is there a source code for the function and if Yes how could I download it? If No, please provide with details
what Intel's libraryhas that function?

Best regards,
Sergey

k_sarnath · ‎01-31-2012

Hello Pat,

Thanks for the code and your time! I understand that you are doing an extra-mile by typing out the code. Appreciate it very much!

I hear from my colleague that the machine has only one4GB RAMthing inserted in the slot although the machine has 2 slots ( I hope that is what DIMM means..). So that means I am limited to 10GBps :-(
btw, That means the RAM controller interleaves memory address among the DIMMs.... And that threshold size must be quite less (64 bytes or 128 bytes)...Any idea what this size could be?

I will check out your code (which looks cool) and post the results,
Thanks,
Best Regards,
Sarnath

k_sarnath · ‎01-31-2012

Hello Pat,
The code that you gave me reaches 10GBps on my machine.....

sarnath@SandyBridge:~/intel_forums$ ./a.out
time= 0.406786, MB/sec = 10310.828824 j=65536000

That possibly confirms the fact that my system has only 1 DIMM.
btw, This is a "great" learning for me -- that DIMMs are tied to the channels. Thanks!

Best Regards,
Sarnath

TimP · ‎01-31-2012

This article gives a summary of some of the memcpy issues as of several years ago.
fast_memcpy from the Intel compiler library is set up for automatic substitution by the Intel compilers (all versions). Both explicit memcpy() and several versions of source code loops will be converted.
If you wish to see source code for linux, various versions of glibc (2.6 and later) may be your best choice. Several of those were designed to achieve nearly full performance on CPUs of more than one brand which were in service at the time of writing.

k_sarnath · ‎01-31-2012

Hello Tim,

Thanks for the link. I just stumbled on it last weekend when I was browsing around for memcpy thingies...I did not read it fully though. I will check out..

Intel uses "ssse3_rep_movsb" stuff as replacement for my invocation of memcpy. I will check that again with "const restrict" pointers and see if that changes something....

Patrick_F_Intel1 · ‎01-31-2012

This performance sounds reasonable considering you only have 1 DIMM (so you are only using 1 of the 2 channels).
You'll need to have both slots populated to get the dual channel performance.
Maybe you can trade theone 4GB DIMM for two 2GB DIMMs (or just add another 4GB DIMM).
Pat

k_sarnath · ‎01-31-2012

Hello Pat,

Thanks for all your help! The DIMM thingis definitely a new learning for me! I am glad I asked around...

Moreover,
I need to find why ICPC is replacing my call to "memcpy" with calls to "__intel_ssse3_rep_memcpy" instead of "intel_fast_memcpy". I just tried casting the src pointer as "const void * restrict". But that does not change a thing... Any help?
The rep_memcpy is almost 2x slower than what I wrote.... I was expecting intel's implementation to beat mine so that I could learn what intel is doing differently...At least, I wanted to know what is my ceiling...but this "rep" thing is killing me...btw, i have only an eval-copy of icc (intel composer studio on 64-bit linux). Does that matter?

Thanks,
Best Regards,
Sarnath

Patrick_F_Intel1 · ‎01-31-2012

In general, if you are using a current glibc or the Intel compiler, it will be hard to beat the system memcpy performance.
I wrote the aversion of memcpy & memset routines used bythe Intel compiler at one point in time.
Usually the only way you can beat the system memcpy is if you know something about how you are going to use the memcpy... like you KNOW the source and dest are 16 byte aligned and you KNOW that the size is a multiple of 16 bytes, or something like that which the compiler can't figure out at compile time.
Also, for a general memcpy, the most common sizes are usually less than 256 bytes, and generally less than 64 bytes. At least, that was the case when I profiled memcpy usages a decade ago. These short cases are harder to optimize.
So, unless have time to burn, I'd recommend making sure you have a current glibc and/or Intel compiler.
Or profile your application and check whether a significant amount of time is actually being spent in memcpy.
Pat

TimP · ‎01-31-2012

The evaluation copy of icc should behave the same as a fully licensed copy. I didn't see what architecture setting you tried; I guess it must be -xSSSE3 or later, so it might be of interest to see what other choices do. The SNB CPU was supposed to improve the rep memcpy performance, but not so much as to make that the preferred method, except possibly for short string lengths, depending on alignment.

SergeyKostrov · ‎01-31-2012

Thank you, Tim! That thread gets "hot". :)

When you're speaking about Intel's 'memcpy' do you mean a version that uses128-bit Streaming SIMD registers?

Best regards,
Sergey