- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hi Forum people,

this is an IA32 architecture question for performance, but I am not sure which would be a suitable forum for it, so I post is here.

this is an IA32 architecture question for performance, but I am not sure which would be a suitable forum for it, so I post is here.

It's about the balance of spatial locality of loads versus writes: I have a dilemma in my algorithm implementation: I could improve storage spatial locality at the expense of sacrifying loads spatial locality, but I don't know *how much* or any quantitative criteria, in the form of byte quantities.

I think that indeed spatial locality in storages is more important -in terms of performance- than spatial locality in loads, in order to decrease WCB partial evictions. However, I don't know where the limit is.

For example, an extreme would be ~100% of consecutive stores, but very 'spotted' (~0% consecutive) loads.

Do you know where/how I can determine such balance?

As input parameters of this criteria, I know:

- the number and size of the WCBs (from the pentium model; let's say Nwd)

- the cache parameters (L1 and L2 sizes, cache line size, etc.)

I would expect something like

optimum ratio: [load transaction size] VERSUS [storage transaction size]

Thanks!

daniel.

Link Copied

3 Replies

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

A WCB eviction, full or partial, is likely to cost as much or more than reading from 2 pairs of cache lines which aren't present in L1. So, there is a clear preference for optimizing writes in simple cases. In more complicated situations, it may be hard to justify anything other than the simplest and most maintainable way of writing the code.

Message Edited by tim18 on 10-26-2004 09:21 AM

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hi Tim,

I have a doubt about your answer: you mention both partial and full evictions. Suppose that, for a given run of my algorithm, I'm getting f full evictions, and p partial evictions of the WCBs. Improving the local spatiality in those stores, would result in more full evictions than partial evictions, and also the total evictions will decrease; let's say that my improved (in terms of spatiality) results in f' and p', so

( f' + p' ) < (f + p)

and, p' < p, f' > f. (meaning, improved algo. will have: more full evictions, less partial evictions, and less total evictions). Right?

But, what about the full/partial evictions penalties? In other words, I guess that improving local spatiality (if my above assumptions are correct) improves more than just the fact that I'll get less total evictions, since I get a better full evictions ratio. So, what about this kind of improvement? I think that those 8-bytes chunks due to partial evictions will impact somehow.

Thanks!

Daniel.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Among the events available in Vtune, the one most often directly associated with such performance problems is the stalled cycles due to store buffer allocation stalls. There is likely a high rate of partial evictions, as the CPU tries to resolve these allocation stalls, so partial evictions may well be a symptom of a situation where improvement is needed. As evidence of how difficult it is to make a quantitative correlation with performance, Vtune's own performance impact assessment is often exaggerated.

I hope this addresses your question.

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page