- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I've been looking at the latency of LDs upon my IB. I've noted that in the opt guide there's discussion as to the effect displacement size has upon LD latency. So I decided to test that, and to my best efforts I find that you can only get 4 clks of LD latency if you DON'T have a displacement. As soon as a displacement is added I find that I observe 1 extra clk in latency, which is counter to what the optimization guide states. The test is actually quite simple, create a pointer chase, accessing the next additional 8 bytes from the current, or whatever strikes your fancy, and then offset the address at each jump which is loaded by some number of bytes (which you correct in the load of that point chase via a displacement).
To the best of my knowledge.. it appears all loads which have a displacement are 5 clks. Can someone from Intel comment as to this and if it's correct clarify/correct it in the opt guide.
Perfwise
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Page 2-19.. in the latest opt guide.
Perfwise
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This is very easy to test, with all the discussions about moves and their performance, knowing the latency of a ld is of the utmost importance. If anyone has difficulty writing the said test to illustrate this load latency observation with base+displacement.. let me know and i'll post something to illustrate my observation. However I've been unable to generate the said 4 cycle latency on lds when I'm using any displacement which I find somewhat strange.
Perfwise
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
perfwise wrote:
However I've been unable to generate the said 4 cycle latency on lds when I'm using any displacement which I find somewhat strange.
FWIW I just read a message from someone with the same concerns than you here : http://www.realworldtech.com/forum/?threadid=133659&curpostid=133745
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In my observations I have HT disabled in the bios... and Intel quotes this latency for SB and IB. I don't believe this is an HT issue.
Thanks for the update though.. Perfwise
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
perfwise wrote:
In my observations I have HT disabled in the bios... and Intel quotes this latency for SB and IB. I don't believe this is an HT issue.
the poster there says that "it's a hyper-threading optimization that causes it to go from 3 cycles to 4.", he is referering to a DL1$ design decision (answering another poster argument), the fact that HT is enabled or not isn't relevant (it's easy to test by measuring the speed of your code with HT enabled vs. disabled)
but, he also says this: "But what about the 5th cycle for base+offset+scale? What's up with that?", is is measuring 5 clocks instead of 4 clocks with small offsets (< 2048B), just like you, also for pointer chasing code
if your code is amenable to threading, HT should be a great way to improve your overall throughput, you can hope for 1.3x or better speedup, provided you are not RAM bandwidth bound
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I wrote a more detailed test, and have determined that lds which are within the 0:2047 range can in fact execute with 4 cycle latency, but not all of them can. It's dependent upon the address range of the base and whether carries are performed into bit 12. I believe it all centers about timing, power consumption and catching the "lion's share" of the performance/power opportunity to address a significant fraction of addresses. If you have to borrow or your displacement is >= 2048 it takes 5 clks. If you cross into another page, it takes 9 clks, which is quite surprising to myself, but it's the law of probablilities.. and you don't build solutions which are perfect/absolute but are a compromise for the greater good. So in certain cases, depending upon your code and the memory addresses it's touching, you could see some dramatic performance drawbacks.. on the latest core products.
Perfwise
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>>if your code is amenable to threading, HT should be a great way to improve your overall throughput, you can hope for 1.3x or better speedup, provided you are not RAM bandwidth bound>>>
And not saturating in case of heavy floating point load fp execution units.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
iliyapolak wrote:
>>>if your code is amenable to threading, HT should be a great way to improve your overall throughput, you can hope for 1.3x or better speedup, provided you are not RAM bandwidth bound>>>
And not saturating in case of heavy floating point load fp execution units.
I was meaning for his pointer chasing code, but pointer chasing with FP pointers, it will be fun, we need a new floating point addressing mode allowing to see what's lurking inside the bits
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes I see what you mean.Btw are you still working on Khronos project?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
iliyapolak wrote:
Yes I see what you mean.Btw are you still working on Khronos project?
never worked with KHRONOS, and I'm using neither Open GL nor Open CL, I'm a big fan of pure software rendering on CPUs
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
bronxzv wrote:
Quote:
iliyapolakwrote:Yes I see what you mean.Btw are you still working on Khronos project?
never worked with KHRONOS, and I'm using neither Open GL nor Open CL, I'm a big fan of pure software rendering on CPUs
Sorry it was not Khronos, I cannot remember the exact name of pure software rendering project you few times mentioned the name in our past conversation.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
iliyapolak wrote:
Quote:
bronxzvwrote:Quote:
iliyapolakwrote:
Yes I see what you mean.Btw are you still working on Khronos project?
never worked with KHRONOS, and I'm using neither Open GL nor Open CL, I'm a big fan of pure software rendering on CPUs
Sorry it was not Khronos, I cannot remember the exact name of pure software rendering project you few times mentioned the name in our past conversation.
the engine I'm working on is called "Kribi 3D" there is also an initial "K" http://www.inartis.com/
though I spend now most of my time developping a business web application based on this engine, it's described here : http://www.planogrambuilder.com/
we have a lot of users all over the world in big companies such as Procter & Gamble and Logitech, for all these people pure software rendering is a daily reality but they are not aware of it and they don't care, it just works well
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks
Put it simply during new IDZ website upgrade a large number of posts were lost and I was not able to find the exact name of your project.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page