- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am hiting quite a lot of this event. I am still trying to understand what it is, here is the description of the event in VTUNE reference guide.
Stalls of Store Buffer Resources (non-standard)
This event counts the number of allocation stalls due to lack of store buffers.
Do you guys know what does this mean? What are the store buffer mean, and how does it cause the allocation stalls.
Stalls of Store Buffer Resources (non-standard)
This event counts the number of allocation stalls due to lack of store buffers.
Do you guys know what does this mean? What are the store buffer mean, and how does it cause the allocation stalls.
Link Copied
3 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Xeon processors route each store through a Write Combine Buffer. On the assumption that most programs write into a small number of cache lines until they are full, there are 6 or 8 of these cache-sized buffers, according to the model. If your program writes to a cache line which is not represented in the buffers, it stalls until one is cleared and becomes available. The hardware tries to keep 2 buffers available by initiating a write back of the 2 least recently used buffers, when there aren't 2 available. Thus the recommendation to limit your program to writing 4 data streams at a time. This number can be increased to 6 on the Prescott models, when HyperThreading is not in use. When HT is in use, each logical processor is limited to 2 WCB's, plus one attempting to be kept free,on the older models.
If a program is properly written (or compiled), repeated writes to a single memory locationmight be postponed until after a loop exits, in order to save the WCB's for writes to arrays.
The Intel 8.x compiler vectorizer attempts to split up the code where necessary to optimize WCB use. For non-vectorizable code segments in loops, the "distribute point" pragmas and directives are available for the programmer to suggest points where the loop could be split.
If you write to a large number of different cache lines within a loop, it is difficult to avoid these stalls.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you TCprince, so as I understand correctly, the store buffer is on top of L1 cache, what's the purpose of this buffer? Is it faster's than L1 cache? I guess so. What does "stalls" mean? I guess when "stalls" happens, CPU is waiting for data to be brought from L1 cache to WCB, is that right?
So, "stalls" will affect the program running speed, but not on memory bandwidth usage( I assume the cache misses are the same no matter "stalls" happens or not).
So, "stalls" will affect the program running speed, but not on memory bandwidth usage( I assume the cache misses are the same no matter "stalls" happens or not).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
As I've been told, Write combine buffers work directly with L2. The data written wouldn't be needed in L1 until they are re-read, and that gets into the subject of "store forwarding." As I've hinted above, much of the store buffer stall time is probably spent freeing up a new buffer. That could involve cache misses.
In the minds of many people, store buffer stalls areconsidered as reducing memory bandwidth.
Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page