IB gpr load latency and displacement size

perfwise · ‎05-28-2013

I've been looking at the latency of LDs upon my IB. I've noted that in the opt guide there's discussion as to the effect displacement size has upon LD latency. So I decided to test that, and to my best efforts I find that you can only get 4 clks of LD latency if you DON'T have a displacement. As soon as a displacement is added I find that I observe 1 extra clk in latency, which is counter to what the optimization guide states. The test is actually quite simple, create a pointer chase, accessing the next additional 8 bytes from the current, or whatever strikes your fancy, and then offset the address at each jump which is loaded by some number of bytes (which you correct in the load of that point chase via a displacement).

To the best of my knowledge.. it appears all loads which have a displacement are 5 clks. Can someone from Intel comment as to this and if it's correct clarify/correct it in the opt guide.

Perfwise

SergeyKostrov · ‎05-28-2013

>>...I observe 1 extra clk in latency, which is counter to what the optimization guide states... On what page is that statement in the manual?

perfwise · ‎05-29-2013

Page 2-19.. in the latest opt guide.

Perfwise

perfwise · ‎05-29-2013

This is very easy to test, with all the discussions about moves and their performance, knowing the latency of a ld is of the utmost importance. If anyone has difficulty writing the said test to illustrate this load latency observation with base+displacement.. let me know and i'll post something to illustrate my observation. However I've been unable to generate the said 4 cycle latency on lds when I'm using any displacement which I find somewhat strange.

Perfwise

bronxzv · ‎05-29-2013

perfwise wrote:
However I've been unable to generate the said 4 cycle latency on lds when I'm using any displacement which I find somewhat strange.

FWIW I just read a message from someone with the same concerns than you here : http://www.realworldtech.com/forum/?threadid=133659&curpostid=133745

perfwise · ‎05-30-2013

In my observations I have HT disabled in the bios... and Intel quotes this latency for SB and IB. I don't believe this is an HT issue.

Thanks for the update though.. Perfwise

SergeyKostrov · ‎05-30-2013

>>...This is very easy to test, with all the discussions about moves and their performance, knowing the latency of a ld is >>of the utmost importance... If you think your results are very interesting and codes demonstrate / prove it, then simply post all that stuff for everybody's review. Create a complete Visual Studio project and comment codes, etc, and upload it. Personally, I wouldn't expect that 99% of IDZ users are really concered about this and will try to implement a test case based on your requirements. If somebody will find your results and codes are interesting then a follow up will follow. Does it make sence? PS: I could quickly verify your results on my Ivy Bridge system with Windows 7 Professional 64-bit.

bronxzv · ‎05-31-2013

perfwise wrote:
In my observations I have HT disabled in the bios... and Intel quotes this latency for SB and IB. I don't believe this is an HT issue.

the poster there says that "it's a hyper-threading optimization that causes it to go from 3 cycles to 4.", he is referering to a DL1$ design decision (answering another poster argument), the fact that HT is enabled or not isn't relevant (it's easy to test by measuring the speed of your code with HT enabled vs. disabled)

but, he also says this: "But what about the 5th cycle for base+offset+scale? What's up with that?", is is measuring 5 clocks instead of 4 clocks with small offsets (< 2048B), just like you, also for pointer chasing code

if your code is amenable to threading, HT should be a great way to improve your overall throughput, you can hope for 1.3x or better speedup, provided you are not RAM bandwidth bound

perfwise · ‎06-03-2013

I wrote a more detailed test, and have determined that lds which are within the 0:2047 range can in fact execute with 4 cycle latency, but not all of them can. It's dependent upon the address range of the base and whether carries are performed into bit 12. I believe it all centers about timing, power consumption and catching the "lion's share" of the performance/power opportunity to address a significant fraction of addresses. If you have to borrow or your displacement is >= 2048 it takes 5 clks. If you cross into another page, it takes 9 clks, which is quite surprising to myself, but it's the law of probablilities.. and you don't build solutions which are perfect/absolute but are a compromise for the greater good. So in certain cases, depending upon your code and the memory addresses it's touching, you could see some dramatic performance drawbacks.. on the latest core products.

Perfwise

Bernard · ‎06-03-2013

>>>if your code is amenable to threading, HT should be a great way to improve your overall throughput, you can hope for 1.3x or better speedup, provided you are not RAM bandwidth bound>>>

And not saturating in case of heavy floating point load fp execution units.

bronxzv · ‎06-03-2013

iliyapolak wrote:

>>>if your code is amenable to threading, HT should be a great way to improve your overall throughput, you can hope for 1.3x or better speedup, provided you are not RAM bandwidth bound>>>

And not saturating in case of heavy floating point load fp execution units.

I was meaning for his pointer chasing code, but pointer chasing with FP pointers, it will be fun, we need a new floating point addressing mode allowing to see what's lurking inside the bits

Bernard · ‎06-03-2013

Yes I see what you mean.Btw are you still working on Khronos project?

bronxzv · ‎06-03-2013

iliyapolak wrote:

Yes I see what you mean.Btw are you still working on Khronos project?

never worked with KHRONOS, and I'm using neither Open GL nor Open CL, I'm a big fan of pure software rendering on CPUs

Bernard · ‎06-03-2013

bronxzv wrote:

Quote:

iliyapolakwrote:
Yes I see what you mean.Btw are you still working on Khronos project?

never worked with KHRONOS, and I'm using neither Open GL nor Open CL, I'm a big fan of pure software rendering on CPUs

Sorry it was not Khronos, I cannot remember the exact name of pure software rendering project you few times mentioned the name in our past conversation.

bronxzv · ‎06-04-2013

iliyapolak wrote:

Quote:

bronxzvwrote:
Quote:

iliyapolakwrote:

Yes I see what you mean.Btw are you still working on Khronos project?

never worked with KHRONOS, and I'm using neither Open GL nor Open CL, I'm a big fan of pure software rendering on CPUs

Sorry it was not Khronos, I cannot remember the exact name of pure software rendering project you few times mentioned the name in our past conversation.

the engine I'm working on is called "Kribi 3D" there is also an initial "K" http://www.inartis.com/

though I spend now most of my time developping a business web application based on this engine, it's described here : http://www.planogrambuilder.com/

we have a lot of users all over the world in big companies such as Procter & Gamble and Logitech, for all these people pure software rendering is a daily reality but they are not aware of it and they don't care, it just works well

Bernard · ‎06-04-2013

Thanks

Put it simply during new IDZ website upgrade a large number of posts were lost and I was not able to find the exact name of your project.