Re: Core i7 40% slower than core2, vtune capture of issue?

jmx1024 · ‎01-17-2009

I wrote a realtime raytracer sometime last year, and at the core of it is obviously a sphere/ray test. This 1 function that does my tests (4 at a time) is running MUCH slower on the Core i7 (965) compared to my Core 2 (q6600). To narrow things down further, I've clocked both machines at 2700mhz, reduced the raytracer to 1 thread. HT makes no difference in this issue on the Core i7, so for these tests it was left on (I tried it both ways).

The output looks like this:

Here's the code and VTUNE.

Core2:

Core i7:

It's worth noting that if I simply return NULL from my intersect function, the Core i7 runs 75fps, the Core2 runs 40fps...so the Corei7 really does do all the other code faster it seems...it's just down to this one problem area. My raytracers "shading" code seems to run about the same speed on both machines with the i7 winning by a bit. Also worth noting that every benchmark I've run on the 2 machines shows the i7 killing the Core 2.....so its not a problem with my PC.

Also, if I expand the source out into ASM view, the branch mispredictions are of course on the multiple "jbe" instructions within that loop.

Vladimir_T_Intel · ‎01-19-2009

Hi,

I think you'd give us much more information if youprovide the asm for the whole function.
Is the index r passed as the function's parameter?
BTW, why the source code of the function Intersect4SSEis different in those two cases?

jimdempseyatthecove · ‎01-19-2009

You might try the following untested code:

[cpp]if(det > 0.0f)
{
  f32 b = -((f32*)&_ba);
  int r1 = 1 << (r*2);
  if(abs(b) < det)
  {
     f32 i2 = b + det;
     if(i2 < distance)
     {
       distance = i2;
       retval |= r1 + r1;
     }
  }
  else
  {
    if(b >= 0.0f)
    {
      // b >= det
      f32 i1 = b - det;
      if(i1 < distance)
      {
        distance = i1;
        retval |= r1;
      }
    }
    
  }
}
[/cpp]

Jim Dempsey

Vladimir_T_Intel · ‎01-19-2009

Unfortunately, I neither can recognize the code as open source nor find it by goggling. Would the testing make any sense with taking arbitrary input values and array sizes? I dont think so. What Im missing?

jmx1024 · ‎01-19-2009

Quoting - Vladimir Tsymbal (Intel)

Unfortunately, I neither can recognize the code as open source nor find it by goggling. Would the testing make any sense with taking arbitrary input values and array sizes? I dont think so. What Im missing?

I've reconstructed the test in a small VS2005 project which has a #define you can toggle to re-create the issue.
http://jmx.ls1howto.com/core/corei7_bugtest.zip

And interestingly in the test app, it runs slower on the Core2 as well rather than just being something that runs slow on i7.

I ended up writing a branchless SSE version of the function over the weekend so my problems are gone now, but still that original function does run very poorly on *only* my core i7.

Also, the source code was slightly different between those 2 screenshots as I wanted to eliminate the array of results as a source of problems, and instead just made a bitmask for results to be stored in. It made no difference and I forgot to change it back before taking the 2nd screenshot.

Jon

jmx1024 · ‎01-19-2009

Also, I will say once this strange slowdown was "fixed" (it went away after I changed 1 line), the app on the i7 completely blows away the Core2. At full throttle the raytracer can trace half a million pixels at 100+fps, compared to ~55fps on the Core2.

Vladimir_T_Intel · ‎01-20-2009

Yeah.. I identified the problem. Investigating...

Which compiler you were using?

jmx1024 · ‎01-20-2009

Quoting - Vladimir Tsymbal (Intel)

Yeah.. I identified the problem. Investigating...

Which compiler you were using?

I tried in the following 3 compilers (all showed the issue):
VS 2005 win32
VS 2005 x64
VS 2008 win32

I'm less surprised that C code can make a set of bad assembly, I was more surprised that the exact same executable that was fast on my core2 was ultra-slow on the i7. Slight change in branch prediction cause all the fuss?

Vladimir_T_Intel · ‎01-21-2009

Hi,

I found out that all cycles in the loop the value of dets is -1.#IND00
This is due to calculation of _mm_sqrt_ps(_dets), where _dets for some reason contains negative values.
Thus, all the cycles the code executes 'if (-1.#IND00 > 0.0f)' only, which is surprisingly very slow on Corei7.
I have doubt that your results were collected in the same conditions. Would you advise how to fix the code?

Vladimir_T_Intel · ‎01-22-2009

OK. Doesnt mean that any conclusions can be made, just observations.

First of all check you asm generated by your compiler. What youll see might be surprising. What I got:
With MFST compiler I get extremely ineffective FP comparison code (I checked both debug and release). Here the half of the CPU clock ticks are stalled because of this damned fcomp instruction, which modifies C1 FPU flag and the conveyer waits this instruction to be retired. In release the fcom st(1) is being used not even better. There is some contribution to the stalls due to branch misprediction short trip count probably the reason. Besides decent number of PF assists generated to cope with the undefined value.
Not surprising that on my Corei7 this code required 25% more clock ticks. The more pipeline stages, the more penalty for partial flags and misprediction.

Of course I wanted to play with Intel Compiler. Version 11.1 gave me 60x speed up on Core2 (not forgetting that nothing useful is being calculated, just values population and repetition of the cycle with one comparing instruction). This is due to moving all the code to SIMD area (by default Intel Compiler sets up arch:SSE2 option). Instead of FPU comparison we have comiss instruction scalar compare of values in xmm registers.

The interesting things started when I tried the code on Core i7 (although not equal to Core2 by clock tick rate). Here the Intel code is still faster then MSFTs FPU based code, but not that faster and slower then on Core 2 approx. 20x.
Many branch mispredictions and some other stalls that might be due to floating point assists. Unfortunately, I dont see SIMD_ASSIST event on Core i7 and I cant check it.

But again, this not looks like normal execution path, its a kind of anomaly, and Id like to know how to fix the code to try real execution (I could fix it my self, but the values I could choose might not make any sense for the application and expose another anomaly).

TimP · ‎01-22-2009

Quoting - Vladimir Tsymbal (Intel)

Of course I wanted to play with Intel Compiler. Version 11.1 gave me 60x speed up on Core2 (not forgetting that nothing useful is being calculated, just values population and repetition of the cycle with one comparing instruction). This is due to moving all the code to SIMD area (by default Intel Compiler sets up arch:SSE2 option). Instead of FPU comparison we have comiss instruction scalar compare of values in xmm registers.

Am I to understand there was a requirement for specific unstated options in use of the Microsoft compiler, and it may have been the 32-bit one? Apparently, the normal options /arch:SSE2 /fp:fast /Ox aren't acceptable? Those options are more conservative than the Intel compiler defaults.

Vladimir_T_Intel · ‎01-22-2009

Quoting - tim18

Am I to understand there was a requirement for specific unstated options in use of the Microsoft compiler, and it may have been the 32-bit one? Apparently, the normal options /arch:SSE2 /fp:fast /Ox aren't acceptable? Those options are more conservative than the Intel compiler defaults.

I didn't see any requirements to MSFT Compiler, so I took the project as is, may be I messed something. Anyway, with MSFT options set to /arch:SSE2 (/fp:fase as well) I observe almost the same (Although, Intel is"faster" 4x then MSFT on Core2. Yes, looks like MSFT's are more conservative).

jmx1024 · ‎01-27-2009

Quoting - Vladimir Tsymbal (Intel)

Hi,

I found out that all cycles in the loop the value of dets is -1.#IND00
This is due to calculation of _mm_sqrt_ps(_dets), where _dets for some reason contains negative values.
Thus, all the cycles the code executes 'if (-1.#IND00 > 0.0f)' only, which is surprisingly very slow on Corei7.
I have doubt that your results were collected in the same conditions. Would you advise how to fix the code?

Sorry for the delayed response. Got sidetracked on another bit of code, and had no time for this personal realtime raytracing project.

I can confirm that the -1.#IND results are actually happening in the real code I was testing, so while the demo application hits that condition 100% of the time, the test app hits it only a mere 90% of the time. ;) Are you saying that this could be the issue causing the difference between i7 and core2? I suppose that would be more reassuring, since it seems like a poorly programmed edge case. I'm mostly wanting to understand the issue so that it doesn't crop up in other bits of code.

I'd love to try the intel compiler for this code some day, but I suspect it'll be hard to beat the msft compiler for performance. I do 100% gameconsole dev at work, so I've not been able to justify buying the intel compiler and vtune, so I suppose my home projects will have to stay msft for the time being.

I have another week or two on this vtune trial if you'd like for me to investigate the actual code for something specific.

Thanks for your help in figuring this one out.

Vladimir_T_Intel · ‎01-27-2009

Hi,

I did some tests with the code when the -1.#IND values were completely eliminated and defined that the performance of Core2 and Core i7 is practically the same. So, I made up a synthetic test with comiss %xmm1, %xmm0 instruction handling -1.#IND value in one of the operand register and found out that Core i2 is noticeably slower in executing this combination then Core2, most likely due to internal Assists. (No matter which compiler is being used). Although this sounds like a big issue of Core i7, I believe it is not since in real code we should avoid using such values in any calculations (in this example we should check the input for negative values in square root operation). So yes, as you mentioned, this is example of "poorly programmed edge case" and I'm glad you've found another way - that branchless SSE version of the function.

jmx1024 · ‎01-28-2009

Quoting - Vladimir Tsymbal (Intel)

Hi,

I did some tests with the code when the -1.#IND values were completely eliminated and defined that the performance of Core2 and Core i7 is practically the same. So, I made up a synthetic test with comiss %xmm1, %xmm0 instruction handling -1.#IND value in one of the operand register and found out that Core i2 is noticeably slower in executing this combination then Core2, most likely due to internal Assists. (No matter which compiler is being used). Although this sounds like a big issue of Core i7, I believe it is not since in real code we should avoid using such values in any calculations (in this example we should check the input for negative values in square root operation). So yes, as you mentioned, this is example of "poorly programmed edge case" and I'm glad you've found another way - that branchless SSE version of the function.

Indeed I've run a test which tests the cycle count of an instruction with various types of invalid floats passed in:

[plain]base
    1735593 [ base ] : minps
    1735590 [ base ] : cmpeqps
    1735599 [ base ] : comiss
    1735593 [ base ] : ucomiss
denormal rhs
  119225166 [ 68.69] : minps
  137290494 [ 79.10] : cmpeqps
  128877306 [ 74.26] : comiss
  128997459 [ 74.32] : ucomiss
infinite rhs
    1735599 [  1.00] : minps
    1735593 [  1.00] : cmpeqps
    1735593 [  1.00] : comiss
    1735593 [  1.00] : ucomiss
nan on the right
    1735599 [  1.00] : minps
    1735581 [  1.00] : cmpeqps
  150906552 [ 86.95] : comiss
    1735590 [  1.00] : ucomiss[/plain]

Shows comiss with NaN is 87x slower than with a valid float, and on Core2, this penalty does not seem to exist. Thanks again for your help in solving the mystery.