Xeon processors route each store through a Write Combine Buffer. On the assumption that most programs write into a small number of cache lines until they are full, there are 6 or 8 of these cache-sized buffers, according to the model. If your program writes to a cache line which is not represented in the buffers, it stalls until one is cleared and becomes available. The hardware tries to keep 2 buffers available by initiating a write back of the 2 least recently used buffers, when there aren't 2 available. Thus the recommendation to limit your program to writing 4 data streams at a time. This number can be increased to 6 on the Prescott models, when HyperThreading is not in use. When HT is in use, each logical processor is limited to 2 WCB's, plus one attempting to be kept free,on the older models.
If a program is properly written (or compiled), repeated writes to a single memory locationmight be postponed until after a loop exits, in order to save the WCB's for writes to arrays.
The Intel 8.x compiler vectorizer attempts to split up the code where necessary to optimize WCB use. For non-vectorizable code segments in loops, the "distribute point" pragmas and directives are available for the programmer to suggest points where the loop could be split.
If you write to a large number of different cache lines within a loop, it is difficult to avoid these stalls.
Thank you TCprince, so as I understand correctly, the store buffer is on top of L1 cache, what's the purpose of this buffer? Is it faster's than L1 cache? I guess so. What does "stalls" mean? I guess when "stalls" happens, CPU is waiting for data to be brought from L1 cache to WCB, is that right?
So, "stalls" will affect the program running speed, but not on memory bandwidth usage( I assume the cache misses are the same no matter "stalls" happens or not).
As I've been told, Write combine buffers work directly with L2. The data written wouldn't be needed in L1 until they are re-read, and that gets into the subject of "store forwarding." As I've hinted above, much of the store buffer stall time is probably spent freeing up a new buffer. That could involve cache misses.
In the minds of many people, store buffer stalls areconsidered as reducing memory bandwidth.