poor code generation; store-forward stall

TimP · ‎06-07-2001

I am investigating some cases where the code generated by CVF6.5A is particularly slow on P4 "NetBurst." I note that it is particularly important to set /architecture:p6, and performance is excellent in many situations. One of the worst situations is where fnstcw (a 16-bit store) is always followed by a 32-bit load but only the 16 bits are used, and modifying all instances of that load instruction may more than double performance.

Are these obstacles to P4 performance already under review; would bug reports be appropriate? I didn't find anything by searching the forum, but the forum response is extremely slow on my home ISP.

Where the NetBurst parallel instructions are needed to achieve the potential of P4, a new architecture switch would be required. Is there any interest in this?

Steven_L_Intel1 · ‎06-07-2001

We're always interested in specific examples of places where we can generate better code - though I think the one you describe is one we already know about. Please send a short example, if you can, and a description of how you think the code should be improved, to us at vf-support@compaq.com We've received a number of examples from folks at AMD - we'd welcome them from Intel as well.

In any event - the next update to CVF will include a P4 architecture switch to specify that the processor is a P4 so that we generate appropriate instructions for it. We have found that Pentium III assumptions don't hold for the P4, which is why you have to say /arch:P6 in CVF 6.5.

I will say, though, that we've found the P4 to be an uneven performer, even using Intel's compiler (which is sometimes better, sometimes worse than CVF on a P4). We have some benchmark programs where a 1.4GHz P4 performs worse than an 850MHz PIII. A 1.1GHz AMD Athlon usually outperforms the 1.4GHz P4 across the board. The P4 is REALLY good at memory-bandwidth-intensive programs, though.

It's not clear to us that generating SSE2 instructions is the key to "achieve the potential of P4" - Intel's own published papers say that SSE2 gained only 5% on the SPEC benchmark tests. Nevertheless, we are quite interested in seeing what is available to boost performance on each of our supported processors - so feel free to send us specific suggestions.

Steve

TimP · ‎06-07-2001

CVF frequently out-performs the P-II-compatible code generated by the Intel compiler, and the Intel compiler sometimes chooses SSE code, when that is enabled, when it is slower than P-II compatible code. The few cases where the Intel compiler generates SSE code which is clearly faster than CVF generic code involve either vectorization, which may produce a 90% improvement, storing real to integer (which would be helped a great deal by fixing the issue raised above), and math functions, where the internal firmware is not the best choice.