spatial locality: writes versus loads (IA32 Architecture question)
Hi Forum people, this is an IA32 architecture question for performance, but I am not sure which would be a suitable forum for it, so I post is here.
It's about the balance of spatial locality of loads versus writes: I have a dilemma in my algorithm implementation: I could improve storage spatial locality at the expense of sacrifying loads spatial locality, but I don't know *how much* or any quantitative criteria, in the form of byte quantities.
I think that indeed spatial locality in storages is more important -in terms of performance- than spatial locality in loads, in order to decrease WCB partial evictions. However, I don't know where the limit is.
For example, an extreme would be ~100% of consecutive stores, but very 'spotted' (~0% consecutive) loads.
Do you know where/how I can determine such balance?
As input parameters of this criteria, I know:
- the number and size of the WCBs (from the pentium model; let's say Nwd)
- the cache parameters (L1 and L2 sizes, cache line size, etc.)
I would expect something like
optimum ratio: [load transaction size] VERSUS [storage transaction size]
It looks like you have spotted the main issues, those which are big enough to show up consistently above all the others.
A WCB eviction, full or partial, is likely to cost as much or more than reading from 2 pairs of cache lines which aren't present in L1. So, there is a clear preference for optimizing writes in simple cases. In more complicated situations, it may be hard to justify anything other than the simplest and most maintainable way of writing the code.
I have a doubt about your answer: you mention both partial and full evictions. Suppose that, for a given run of my algorithm, I'm getting f full evictions, and p partial evictions of the WCBs. Improving the local spatiality in those stores, would result in more full evictions than partial evictions, and also the total evictions will decrease; let's say that my improved (in terms of spatiality) results in f' and p', so
( f' + p' ) < (f + p)
and, p' < p, f' > f. (meaning, improved algo. will have: more full evictions, less partial evictions, and less total evictions). Right?
But, what about the full/partial evictions penalties? In other words, I guess that improving local spatiality (if my above assumptions are correct) improves more than just the fact that I'll get less total evictions, since I get a better full evictions ratio. So, what about this kind of improvement? I think that those 8-bytes chunks due to partial evictions will impact somehow.
I think you have taken my point, that you want to reduce the number of total WCB evictions, if it is convenient to do so. The condition under which partial evictions are associated with extremely poor performance is where they are associated with buffer allocation stalls and WCB buffer thrashing. Among the events available in Vtune, the one most often directly associated with such performance problems is the stalled cycles due to store buffer allocation stalls. There is likely a high rate of partial evictions, as the CPU tries to resolve these allocation stalls, so partial evictions may well be a symptom of a situation where improvement is needed. As evidence of how difficult it is to make a quantitative correlation with performance, Vtune's own performance impact assessment is often exaggerated. I hope this addresses your question.