Intel® C++ Compiler
Community support and assistance for creating C++ code that runs on platforms based on Intel® processors.
7953 Discussions

Any way of overriding /Qax to force alternate code path generation?

Mike_C1
Beginner
1,648 Views

I'm compiling a relatively simple maths routine that takes 8 bit unsigned image data from a byte buffer, performs some simple linear floating point operations on that data and writes the modified 8 bit data back to the same buffer. There are nested column and row loops to allow for a custom row stride. The routine is written as a template function with two type parameters to test the effects of using different types in the interim calculation (short/int integer calculation, float/double floating point calculation).

I'm using the parallel studio 2015 C++ compiler within Visual Studio 2010, testing on Win7 x64 on an i7-4980HQ 2.8GHz processor compiling code as 32bit optimized release code.

The code is clearly vectorizable and In my performance tests I'm seeing a large 30% performance boost if I compile for SSE 4.1 over compiling for SSE2. Compiling for SSE 4.2/AVX/AVX2 get between the same or 10% worse performance that compiling for SSE 4.1. So it seems that targeting SSE 4.1 is a sweet spot. Great!

However if I using /Qax to create a processor specific branch over a default baseline /Qx then the performance is exactly the same as without /Qax implying that the compiler has decided that the SSE 4.1 processor branch is not worthwhile even though my tests show that it clearly is. Adding #pragma loop_count with a large number of iterations makes no difference.

 I can see why it's extremely difficult for the compiler to guarantee that its decisions for /Qax always give the best performance on all CPUs but it seems like I've got no flexibility when I disagree with the compiler's decision. I can't find any obvious #pragmas or other techniques to make /Qax do what I'd prefer. Any suggestions? Ideally I'd like to be able to specify or override /Qax with a #pragma for more fine grained control. Can I improve the results with /Qax by avoiding inline or template functions or tweaking my C++ coding style?

 

0 Kudos
26 Replies
KitturGanesh
Employee
537 Views

Hi Mike,

Just letting you know I am still investigating on how my small test case worked :-( Yes, specifying the way you've done (which is the correct syntax) won't work. BTW, I am still discussion this with our developers (have filed the issue on this DPD200361474 already with them) and will update you once I am done clarifying a few things on this accordingly. Appreciate your patience till then.

_Kittur

0 Kudos
KitturGanesh
Employee
537 Views

Hi Mike,

OK, here's the workaround that works:

"If the dispatch definition comes first, It fails. If it comes after all the cpu_specific calls are made it works".  So, moving the dispatch function after all the cpu_specific lines works (like below for Linux, you should use __declspec() on Windows like you did before):

--------------------------

__attribute__((cpu_specific(core_2nd_gen_avx, core_i7_sse4_2))) 
void dispatch_func() {
  printf("\nCode for 2nd generation Intel Core processors with support for AVX and SSE4_2 goes here\n");
}

__attribute__((cpu_dispatch(core_2nd_gen_avx, core_i7_sse4_2))) void dispatch_func() {};     

--------------------------

Please try the above workaround and let me know. Of course, an issue has been filed on this as I mentioned before with the developers, fyi.

_Regards, Kittur 

0 Kudos
KitturGanesh
Employee
537 Views

Mike, on windows I tried as follows (moving dispatch function after cpu_specific) and it works:

------------

#include<stdio.h>

__declspec(cpu_specific(generic, core_2_duo_sse4_1)) 
void Process()
{
  printf("code for sse41 and generic here\n");
}

__declspec(cpu_dispatch(generic, core_2_duo_sse4_1)) void Process() {};

int main() {
  Process();
  return 0;
}

-----------

 

0 Kudos
KitturGanesh
Employee
537 Views

Hi Mike,

On the test reproducer, I could only reproduce the case where SSE4.1 was faster than SSE4.2 but AVX was faster than all. Anyways, I've filed an issue with the developers for that as well (DPD200545938) and will update you accordingly as soon as I've an update.

_Kittur

0 Kudos
KitturGanesh
Employee
537 Views

Mike, regarding your question on compiler's decision to go with SSE4.2 instead of say SSE41? Suppose you have multiple targets specified with cpu_specific(core_2_duo_sse4_1, core_i7_sse4_2) for a function or the whole program etc., the compiler will generate the corresponding functions (one for sse41 and sse42) but it always checks the highest level on the system the application is run on and uses that function accordingly (so in this case, it'll be ss42).  Invariably, for the most part the performance should either be better than the earlier instruction set  but depending on the context of the application sometimes there could be a penalty (like say switching over from sse to avx etc) and that needs investigation and an issue needs to be filed accordingly.

Of course, if you find say SSE4.1 faster than others, you can just use /Qx or /Qax for the function or the whole program. That said, given more than one option attached to a function or the program the compiler will select  the highest level available on the system.

Regards, Kittur  

0 Kudos
KitturGanesh
Employee
537 Views

Hi Mike, did you try moving the dispatch function after all the cpu_specific lines and try out? It should work for now until the release with the fix is out.

Thanks, Kittur

0 Kudos
Reply