- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I am trying to leverage the write combining buffer in X86 processors to perform some memory and io optimizations.
My question is,
is it possible for me probe into the write combining buffer and know exactly how many bytes are getting evicted out ? Are there any hacks/performance counters that can give me this information ?
Thanks
Srinath
Link Copied
7 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Srinath,
I don't think there is any way do what you want to do.
On older processors there was a partial wc buffer write event (I think) but this event doesn't exist on current processors. And the event didn't tell how many bytes were written, just the number of times a partial write occurred.
Sorry,
Pat
I don't think there is any way do what you want to do.
On older processors there was a partial wc buffer write event (I think) but this event doesn't exist on current processors. And the event didn't tell how many bytes were written, just the number of times a partial write occurred.
Sorry,
Pat
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks Patrick. Thanks for the quick response.
Is there a way to retrive data addresses (virtual or physical) accessed by loads and stores (I am specifically looking for non-temporal stores) ? I was initially using PIN to do that and the performance was terrible. Is there a way to get that info from the hardware directly, say at the time of instruction retirement or something ?
Thanks
Srinath
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hey
I actually have one more question. Is there any information that I can get from write combining buffers on modern processors ?
Srinath
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Only by looking it up under the current equivalent name "fill buffer," and not a great deal.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Let me see if I understand your questions correctly -- are you trying to use non-temporal stores but you are not seeing the improvement?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
My orignial question is as follows:
Is there any meta information (possibly through hardware counters) that I can retrieve from Write Combing Buffers (WCB) (I mmap-ed by address space to WC mode to bypass cache) ?
My question was not about performance. I am trying to save some cache space by bypassing writes that don't need any locality, directly into a storage device. Similar to bypassing framebuffers into the graphics device.
One of the problems I am facing is that, I need to provide some correctness guarantees. I need to verify if the cache-lines flushed from the WCBs have all reached the destination. But to do that I need some information from the processor side such as
1) physical address of cache lines flushed from WCB
2) Number of bytes
3) Atlleast number of partial/full lines flushed
Any combination of 1), 2) or 3) is fine.
Is there a way for me retrieve such information somewhere from the processor ?
Srinath
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
As you probably know, non-temporal load/store instructions (movntps, movntdq, etc) are a way to bypass the cache on read and write.
Since non-temporal stores are weakly ordered before using the data you need to issue mfence/lfence/sfence (depending on what you are doing with the data, most likely sfence in your case).
Those fencing instructions are the only guarantee that the data has reached the destination before you use it.
As far as I know, the metrics that you are looking for are not available as CPU counters -- perhaps they can be observed through ITP debugging but I am not sure about that, and such hardware is extremely expensive anyway.
Since non-temporal stores are weakly ordered before using the data you need to issue mfence/lfence/sfence (depending on what you are doing with the data, most likely sfence in your case).
Those fencing instructions are the only guarantee that the data has reached the destination before you use it.
As far as I know, the metrics that you are looking for are not available as CPU counters -- perhaps they can be observed through ITP debugging but I am not sure about that, and such hardware is extremely expensive anyway.

Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page