Hi - I'm using Visual Studio C++ 2012 in Windows 8 with Intel Compiler 16.0 to develop some code to implement a digital signal processing algorithm. The main loop iterates over received 'symbol' data (1200 symbols), and lends itself well to vectorization. My laptop has an i5-4300U which supports AVX2 instructions.
I've coded up different implementations (different sub-classes of a parent class), which initially were all in one file. Then I split the sub-classes into separate .cpp files, each a compilation unit, and the performance of one of the implementations more than halved. That is, the time to process each symbol dropped from approx 5ns to 14ns (measured using using Windows QueryPerformanceCounter()).
If I remove the #pragma simd before the main loop, the performance is the same for both single and multiple compilation unit builds, which is why I'm raising this as a vectorization issue rather than just a general Intel compiler query.
I've now got the code so that I can select whether to compile the implementations as a single unit or as multiple units depending on the value of a single #define. (When set to singe unit, Visual Studio still actually compiles the same set of files, but all except one are effectively empty.) I've tried comparing assembly listings, and while I can see they are (very) different, I'm not experienced enough to understand which differences (if any) correspond in the big change in performance.
Looking at the vectorization reports in the assembler listings corresponding to the main loop in the implementation, for the single compilation unit version (fast);
; optimization report ; LOOP WAS VECTORIZED ; SIMD LOOP ; VECTORIZATION SPEEDUP COEFFECIENT 5.085938 ; VECTOR TRIP COUNT IS KNOWN CONSTANT ; VECTOR LENGTH 16 ; NORMALIZED VECTORIZATION OVERHEAD 0.187500
And for the multiple units version (slow)
; optimization report ; LOOP WAS VECTORIZED ; SIMD LOOP ; VECTORIZATION SPEEDUP COEFFECIENT 3.699219 ; VECTOR TRIP COUNT IS KNOWN CONSTANT ; VECTOR LENGTH 16 ; NORMALIZED VECTORIZATION OVERHEAD 0.125000
The different speedup coefficient values tie in with the observed behaviour, but why are they different?
Any suggestions gratefully received :)
Thanks Tim. I've turned on optimisation reports, and am currently wading through the output trying to understand what it all means.
I was writing 6502/Z80 assembler ~35 years ago, but have only recently started on IA (and Visual Studio); I'm on a bit of a learning curve at the moment...
Well... the multiple compilation unit build is now as fast as the single unit build. Unfortunately I'm not (yet) sure why.
I don't think it's related to alignment. With the multiple unit build, I can add or remove a '#pragma vector aligned' before the loop, and see from the optimization report that the accesses are aligned or unaligned as expected. But this doesn't affect the performance (to a measurable degree).
I think it might be to do with some of my classes. To allow me to experiment with different storage memory layouts (eg. row or column major), I had defined a base matrix class with pure virtual access functions. This resulted in messages in the optimization report stating that some code couldn't be optimized due to function pointers, so I 'flattened' the matrix classes to only implement specific layouts, with no virtuals.
While I'm happy to accept that function pointers might cause problems for optimization, I don't understand why the compiler might handle the original code differently between single and multiple compilation unit builds. However, it might not have been this causing the problem. But now that it's fixed I'm not going to spend more time getting to the bottom of it, although clearly it would be advantageous to fully understand what's going on for future reference, It's on my to-do list.