Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Mike_C1
Beginner
103 Views

Any way of overriding /Qax to force alternate code path generation?

I'm compiling a relatively simple maths routine that takes 8 bit unsigned image data from a byte buffer, performs some simple linear floating point operations on that data and writes the modified 8 bit data back to the same buffer. There are nested column and row loops to allow for a custom row stride. The routine is written as a template function with two type parameters to test the effects of using different types in the interim calculation (short/int integer calculation, float/double floating point calculation).

I'm using the parallel studio 2015 C++ compiler within Visual Studio 2010, testing on Win7 x64 on an i7-4980HQ 2.8GHz processor compiling code as 32bit optimized release code.

The code is clearly vectorizable and In my performance tests I'm seeing a large 30% performance boost if I compile for SSE 4.1 over compiling for SSE2. Compiling for SSE 4.2/AVX/AVX2 get between the same or 10% worse performance that compiling for SSE 4.1. So it seems that targeting SSE 4.1 is a sweet spot. Great!

However if I using /Qax to create a processor specific branch over a default baseline /Qx then the performance is exactly the same as without /Qax implying that the compiler has decided that the SSE 4.1 processor branch is not worthwhile even though my tests show that it clearly is. Adding #pragma loop_count with a large number of iterations makes no difference.

 I can see why it's extremely difficult for the compiler to guarantee that its decisions for /Qax always give the best performance on all CPUs but it seems like I've got no flexibility when I disagree with the compiler's decision. I can't find any obvious #pragmas or other techniques to make /Qax do what I'd prefer. Any suggestions? Ideally I'd like to be able to specify or override /Qax with a #pragma for more fine grained control. Can I improve the results with /Qax by avoiding inline or template functions or tweaking my C++ coding style?

 

0 Kudos
26 Replies
Kittur_G_Intel
Employee
92 Views

Hi Mike,

The /Qax basically lets the compiler generate multiple code paths that are feature specific to the processors if there's a any performance gain in addition to generating the baseline code path, and are auto-dispatched. It's interesting to note that you mention that the performance is better with sse4.1 than others? If you can attach a small test case that shows the performance degradation I can file the issue with the developers accordingly.

Note that by default, the base line code path is SSE4.2 which of course you can change using the /Qx option if needed which will be the minimum needed for that code path to be able to execute. I don't know of any option to override the compiler to generate a different code path with /Qax other than that allowed for options you can give for <x>  it permits.  Yes, you can improve the results tweaking the c++ code using the several optimizations methods including improving vectorization, ipo and so on depending on the context of your application. 

I'll check with my peers on your question as well and ask them to respond if there's more valuable input thereof. Also, you can search in Intel Developer Forum (https://software.intel.com/en-us) for various knowledge base articles on optimizing with intel compilers as well. A few that might of interest:
  - http://software.intel.com/en-us/articles/getting-started-with-intel-cilk-plus-array-notations
  - https://software.intel.com/en-us/articles/how-to-compile-for-intel-avx/ ;
  - http://software.intel.com/en-us/articles/vectorization-essentials ;
  - http://software.intel.com/en-us/articles/getting-started-with-intel-cilk-plus-array-notations
  - http://software.intel.com/en-us/intro-to-vectorization-using-intel-cilk-plus
  - http://software.intel.com/en-us/code-samples/intel-c-compiler/ ;

_Kittur

Mike_C1
Beginner
92 Views

Timings and code:

The timings when compiled for various processors on the main code path are below. These timings are for T=short, FLOAT=float in the template arguments which was the fastest combination of types. The AVX and AVX-I results were noticeably worse than SSE 4.1 or 4.2, AVX2 is almost as fast. The code is single threaded and non-parallel at present with no code running in other threads for this test but other threads would be highly active in real world usage. Timings averaged over many runs using QueryPerformanceCounter.

The buffer is width=1920, height=1080, step=2 * 1920, buffer size = 2 * 1920 * 1080 bytes. The input buffer is filled by srand() then rand() %256 for each pixel rather than operating on a zeroed buffer (in case that makes a difference). 

Leaving /Qx option blank and supplying /Qax for SSE4.1 gives similar timing results as compiling for SSE2 only. All of these command line options were set using the Visual Studio 2010 integration but compiled using the Intel compiler obviously (not the MS compiler which was much slower).

Code pasted below timings. Any hints for making it go even faster gratefully received. Would be useful if I was able to use /Qax to get SSE 4.1 performance with backward compatibility to older CPUs.

I'll read compiling for AVX carefully. Any chance that matrix primitives in IPP or MKL would do a faster job even though the input and outputs are bytes?

Thanks,

 

SSE2 4.5ms

SSE3 4.3ms

SSSE3 4.0ms

SSE4.1 2.9ms

SSE4.2 2.9ms

AVX 3.5ms

AVX-I 3.5ms

AVX2 3.1ms

 

template <typename T, typename FLOAT> void Process(BYTE *pIn, int iInStep, int iWidth, int iHeight)
{
	BYTE *pInRow = pIn;
	const int numBlocks = iWidth / 2;

	for( int iRow=0; iRow<iHeight; iRow++ ) {
		BYTE * pix = pInRow;

		for (int i=0; i<numBlocks; i++) {
			const T cb    = pix[1] - 128;
			const T cr    = pix[3] - 128;
			const T y     = - ((FLOAT)0.115550) * cb - ((FLOAT)0.207938) * cr;
			T cbOut       =   ((FLOAT)1.018640) * cb + ((FLOAT)0.114618) * cr;
			T crOut       =   ((FLOAT)0.075049) * cb + ((FLOAT)1.025327) * cr;
			cbOut        += 128;
			crOut        += 128;
			const T y0    = pix[0] + y;
			const T y1    = pix[2] + y;

			pix[0] = y0	> 255 ? 255 : (y0 < 0 ? 0 : y0);
			pix[2] = y1	> 255 ? 255 : (y1 < 0 ? 0 : y1);
			pix[1] = cbOut	> 255 ? 255 : (cbOut < 0 ? 0 : cbOut);
			pix[3] = crOut	> 255 ? 255 : (crOut < 0 ? 0 : crOut);

			pix += 4;
		}
		pInRow  += iInStep;
	}
}

 

TimP
Black Belt
92 Views

You're probably aware that 32-byte data alignment (as well as larger loop counts) may be required to gain an advantage from AVX2.  If the arrays aren't set up in the same compilation unit, you might expect a need for alignment assertions to take advantage of alignment.  This ought to be discussed in one or more of the references Kittur gave.  If not, search further, e.g.

https://software.intel.com/en-us/articles/practical-intel-avx-optimization-on-2nd-generation-intel-c...

In fact, 32-byte alignment sometimes shows to advantage with SSE.

Prior to AVX2, there wasn't sufficient instruction level support for 8- and 16-bit integers to use AVX rather than SSE4.

SSE4.1 used to show a few significant advantages during unaligned moves over SSE4.2 on CPUs up through Westmere, but I'm not surprised that your tests show no difference on Haswell.

If you are generating code to run on both AVX2 and SSE4.1, you would try options like

/QaxCORE-AVX2 /QxSSE4.1

which sets SSE4.1 as the ISA when AVX2 support isn't detected.  I think this is what Kittur had in mind; it ought to overcome the problem if the compiler sees no advantage in SSE4 over SSE2 (by telling the compiler we don't want SSE2).

Mike_C1
Beginner
92 Views

Thanks for the points about alignment.

I forgot to say that in the results above I'm already aligning the buffer on 64 byte boundary. Adding #pragma vector aligned to both loops makes no difference to performance. In fact un-aligning by 1 byte makes no difference either. 

Converting the code to a non-inline non-template function also makes no difference.

My ideal would be /QxSSE2 /QaxSSE4.1 for best performance with backward compatibility but the compiler seems to decide for me that there's no performance benefit from an SSE 4.1 branch and I can't force it. Would be much flexible if alternate code paths could be specified or overriden by #pragma.

Is it common that compiling for for a higher spec CPU results in slower code? Should I expect to have to test compiling for all CPU variants and manually set the instruction set that works fastest rather than leaving it to the compiler to decide?

 

TimP
Black Belt
92 Views

It's common enough to see a performance deficit in one of the newer ISA options that it may be worth checking both performance and opt-report comparisons.  A case was reported recently on the C++ forum where pragma directed vectorization and alignment were needed to realize a gain rather than a loss by invoking AVX2.  Unfortunately, that slogan about directed vectorization also requires testing each case to see whether the directive helped or hindered, and whether a directive which helps one compiler works with another.

Kittur_G_Intel
Employee
92 Views

Thanks Tim, for the response to several of the Mike's question.  Mike, I've also passed on your feedback to the product team as well. As mentioned before, presently, the only way to generate multiple code paths is to use the /Qax option and add /QxSE4.1 option as  well so it the compiler is forced to execute  SSE4.1 when AVX2 support isn't there on the system.  With what I am aware of there's no pragma or such to force otherwise.  Like Tim mentioned, it's good to check the opt reports and compare and see what's going on.  Mike, is it possible for you to file the issue in https://premier.intel.com and attach a test reproducer as well and engineer can triage and communicate directly with you and can get any other info if needed as well?   

_Kittur

 

 

Mike_C1
Beginner
92 Views

Hi,

I've checked out the optimization reports file (annoyingly I can't get any stats to show in visual studio 2010 in the compiler optimization report window though).

If the loops costs are directly comparable betweeen these reports then it looks like the Intel compiler's estimate of the vector loop cost for AVX and AVX2 is much more optimistic than the results on my CPU.

Using /QxSSE2  /QaxSSE4.1 generates an identical optimization report to /QxSSE2 so it looks like the compiler doesn't give any information about its decision process for /Qax in the diagnostic output.

/QxSSE2
:remark #15300: LOOP WAS VECTORIZED
:remark #15460: masked strided loads: 4 
:remark #15462: unmasked indexed (or gather) loads: 4 
:remark #15475: --- begin vector loop cost summary ---
:remark #15476: scalar loop cost: 92 
:remark #15477: vector loop cost: 81.500 
:remark #15478: estimated potential speedup: 1.120 
:remark #15479: lightweight vector operations: 88 
:remark #15481: heavy-overhead vector operations: 2 
:remark #15487: type converts: 17 
:remark #15488: --- end vector loop cost summary ---

 

/QxSSE4.1
:remark #15300: LOOP WAS VECTORIZED
:remark #15460: masked strided loads: 4 
:remark #15462: unmasked indexed (or gather) loads: 4 
:remark #15475: --- begin vector loop cost summary ---
:remark #15476: scalar loop cost: 92 
:remark #15477: vector loop cost: 32.000 
:remark #15478: estimated potential speedup: 2.800 
:remark #15479: lightweight vector operations: 89 
:remark #15481: heavy-overhead vector operations: 1 
:remark #15487: type converts: 9 
:remark #15488: --- end vector loop cost summary ---

/QxAVX
:remark #15300: LOOP WAS VECTORIZED
:remark #15460: masked strided loads: 4 
:remark #15462: unmasked indexed (or gather) loads: 4 
:remark #15475: --- begin vector loop cost summary ---
:remark #15476: scalar loop cost: 92 
:remark #15477: vector loop cost: 28.250 
:remark #15478: estimated potential speedup: 3.160 
:remark #15479: lightweight vector operations: 89 
:remark #15481: heavy-overhead vector operations: 1 
:remark #15487: type converts: 9 
:remark #15488: --- end vector loop cost summary ---

/QxCORE-AVX2
remark #15300: LOOP WAS VECTORIZED
remark #15460: masked strided loads: 4 
remark #15462: unmasked indexed (or gather) loads: 4 
remark #15475: --- begin vector loop cost summary ---
remark #15476: scalar loop cost: 95 
remark #15477: vector loop cost: 24.000 
remark #15478: estimated potential speedup: 3.530 
remark #15479: lightweight vector operations: 89 
remark #15481: heavy-overhead vector operations: 1 
remark #15487: type converts: 9 
remark #15488: --- end vector loop cost summary ---

 

Mike_C1
Beginner
92 Views

Have also regression tested this and found the performance results are pretty much the same in the latest update of Parallel studio 2013.

TimP
Black Belt
92 Views

The cost assessments evidently are best case predictions, so appear reasonably consistent with your determination that SSE4.1 is favored over AVX.  It does seem that SSE4.1 should not be suppressed for vectorization if you asked for AVX plus SSE4.1 and SSE2 code generation.

Kittur_G_Intel
Employee
92 Views

I agree with Tim that SSE4.1 should not be suppressed for vectorization and needs to be looked at.

> it looks like the compiler doesn't give any information about its decision process for /Qax in the diagnostic output.

Hi Mike, firstly the build log should clearly say if /Qax resulted in the generation of the second code path (unfortunately, the message presently doesn't go into the opt report file). Also in the opt report some messages get duplicated for both code paths, some (such as “LOOP WAS VECTORIZED”, do not; they appear in a separate section that applies to both paths (see example below). I discussed this issue with my peer (who's an expert at this) and here's more info from that conversation. There is actually the manual cpu dispatch feature which yuo can use as well to control what gets generated (which I forgot to let you know in my previous responses).

You can compile one function version with /Qxsse4.1, the other with /arch:sse2 etc. Refer to the user manual under "Targetting processors manually". There you can see how the __declspec(cpu_dispatch(cpuid, cpuid,...)) and the __declspec(cpu_specific(cpuid)) can be used in your code to declare each function version targeted at particular type of processors. There are several examples given in that section on how to use cpu_dispatch and cpu_specific keywords to create function versions for different processors. 

In particular, you don't need separate code for every possible processor type. BTW, “SSE4.1” version will be used for all processors that support SSE4.2 and above; the SSE2 version will be used for all others. 

Refer to a nice article on this at: 
    https://software.intel.com/en-us/articles/how-to-manually-target-2nd-generation-intel-core-processor..., which includes an example that uses keywords such as core_2_duo_sse4_1 and generic.

It's also possible to create processor-specific SIMD-enabled functions, (vector functions), but it doesn’t look like you're using these anyways. BTW, note that the optimized code path is taken only for Intel processors; non-Intel processors always take the default code path.

BTW, here’s a simple example that shows the opt report when two code paths are generated: 
-----------------------------------------
void dispatch_test(double v, double * c) {
  __assume_aligned(c,32);
  for (int i=0; i<8192; i++) c *= v;
}

$ icc -opt-report-file=stderr -opt-report-phase=vec -axavx -c dispatch_test.cpp

Begin optimization report for: dispatch_test(double, double *)

    Report from: Vector optimizations [vec]

LOOP BEGIN at dispatch_test.cpp(3,3)
   remark #15399: vectorization support: unroll factor set to 4
   remark #15300: LOOP WAS VECTORIZED
LOOP END
dispatch_test.cpp(1): (col. 42) remark: _Z13dispatch_testdPd has been targeted for automatic cpu dispatch
------------------------------------------

Kittur_G_Intel
Employee
92 Views

Mike, BTW,  there's  #pragma intel optimization_parameter target_arch  which you can use to flag functions that you want to be executed on particular processors. Again, you'll have to do this manually in the code so there's no one pragma line that you can use for the entire application other than the options we discussed earlier, fyi. This pragma again overrides that of the command line options for those routines you have targeted, fyi. You can look at the user manual for more info

Regards, Kittur

Kittur_G_Intel
Employee
92 Views

Mike, BTW, my peer has filed an issue with our developers to make the opt-report more clear when it comes to manual dispatch logs and will let you know when the release with the fix is out. Also, it'll be nice if you can file an issue with a reproducer for the case where avx is slower than SSE (there could be loops that turn out be short at run time) so we can file the issue with our developers to investigate accordingly. Again, alignment helps but you already say it's taken care of.....

Thanks,
Kittur

Mike_C1
Beginner
92 Views

Most of the relevant code to reproduce is above but I can a small amount of wrapper code to make it self contained. How should I file the report - on this forum?

I've verified the manual dispatch will work for me but it seems to require copying and pasting the same implementation for each cpu_specific function which is pretty ungainly. Calling a separate function even an inline one will presumably result in the code being compiled as default rather than as the cpu_specific directive - calling my template from cpu_specific implementations seems to fail to call the code entirely. The same applies to the #pragma optimization_parameter AFAICT. 

Is there any way of manually dispatching to the same code compiled for different instruction sets without copying and pasting or other ugly hacks (e.g #including a code fragment from a separate file or putting the code in a macro)?

Kittur_G_Intel
Employee
92 Views

Hi Mike,

Thanks for your feedback, I am awaiting a response from our dev team and will update you on your question accordingly. 

With regard to filing the report, forum is fine too if the size of the reproducer is not very large. You can also file in https://premier.intel.com where the support engineer directly communicates with you and gets any further information required etc., and the test attachment can be large too etc. Also, the content and communication is secure and confidential there as well. You can select the product and file the issue with the reproducer attachment. Let me know if you need any further clarification.

Regards,
Kittur

Mike_C1
Beginner
92 Views

Posting reproducer in this post as I'm not allowed access to premier.intel.com AFAICT

This uses the intel timer utility class from https://software.intel.com/en-us/code-samples/intel-c-compiler/utilities/Timer-Utility

Timing results in milliseconds on my system changing /Qx to compile for various targets are below. sse4.1 and sse4.2 significantly faster than all other instruction set targets including avx, avx-i, avx2. Input buffer aligned on 64 byte boundary.

sse2 4.328

sse3 4.336

ssse3 4.146

sse4.1 2.942

sse4.2 2.944

avx 3.597

avx-i 3.599

avx2 3.102

sse2 with /Qax for sse4.1 produces same results same as sse2

I'm using the parallel studio 2015 C++ compiler within Visual Studio 2010, testing on Win7 x64 on an i7-4980HQ 2.8GHz processor compiling code as 32bit optimized release code. Using /03 optimization level. Parallel studio 2013 results are the same. 

 

TimP
Black Belt
92 Views

I might have mentioned that the compiler estimates of best possible timing probably don't assume the same hardware.  That could account for part of the lower gain (or loss) when choosing a newer instruction set.   Some of the gain attributable to the new CPU could be achieved without going to the newer instruction set.

I don't know whether the compiler is sensitive to the exact order of option invocation; I didn't see your exact command line for requesting 3 code paths (e.g. AVX + SSE4 + SSE2).

Mike_C1
Beginner
92 Views

Order of the command line options makes no difference when I change it.

/QxSSE2 /QaxSSE4.1 in either order gives SSE2 performance. 

/QxSSE4.1 /QaxCORE-AVX2 in either order gives AVX2 performance which is 6% slower than SSE 4.1 code on my system which is best compromise for now and may be faster on different hardware. Will post results when I get a chance to test other AVX2 systems. Adding other other /Qax paths still prefers AVX2

So it seems that the /Qax option works fine but that the compiler seems very biased against SSE 4.1 code despite it being the fastest on my system and prefers generating both SSE2 and AVX2 code above SSE 4.1 code when given a choice. As my CPU is relatively new perhaps the compiler cost models don't reflect my CPUs performance characteristics so well.

 

Kittur_G_Intel
Employee
92 Views

Hi Mike,

Thanks for attaching the test reproducer. I'll try that out and file the issue with the developers. BTW, I am still awaiting a response from our developers on whether there is any other way to have the cpu_specific directive with multiple targets tagged to a single function and will let you know as soon as I've an update.

Regards, Kittur 

Kittur_G_Intel
Employee
92 Views

Hi Mike,

BTW, presently with the latest compiler you should be able to add multiple targets with the cpu_specific directive attached to the same routine.  I am testing that as well and it seems to work for me but doing more relevant testing. Please try that out and let me know. I'll have to file an issue with the user documentation to add that change to cpu_specific accordingly. Let me know after you try cpu_specific with multiple targets tagged to the same routine so you now don't have to cut and paste the same code for different targets for the same function, thanks.

_kittur

Mike_C1
Beginner
31 Views

Thanks for the response. Using multiple cpu_specific tags looks like it would be a good workaround but I'm not having much success with using multiple cpu_specifc tags on a single function.

I'm getting the error message "internal error: 010101_46028" in the diagnostic output with a seemingly nonsensical line and column number in the middle of 'char' on the first line of the function.

What syntax are you using? Perhaps I've got my own syntax wrong. I'm currently using the syntax below:

__declspec(cpu_dispatch(generic, core_2_duo_sse4_1)) void Process(unsigned char *pIn, int iInStep, int iWidth, int iHeight) {};
 
__declspec(cpu_specific(generic, core_2_duo_sse4_1)) 
void Process(unsigned char *pIn, int iInStep, int iWidth, int iHeight)
{

  // code as before

}

Incidentally generating a diag file (or perhaps it's some other apsect of diagnostic output) seems to suppress detailed error output to the command line which makes the Intel compiler a pain to use in Visual Studio - you have to go manually trawling through the output logs looking for the actual error message.

I'm also not seeing any output in the compiler inline report and compiler optimization report windows although I've enabled all the diagnostic output AFAICT. I could look in ProcMon to see if the Visual Studio integration is looking in the wrong files for the data in this window - any hints as to where these windows should be getting their information from?

 

Reply