Intel® C++ Compiler
Community support and assistance for creating C++ code that runs on platforms based on Intel® processors.

Vector dependencies

piet_de_weer
Beginner
582 Views

I'm trying to squeeze the last bit of optimization out of a program using Intel C++ 10.1 (because with later versions I'm getting slower code - I'll look into that later).

When looking at the vectorization reports, I noticed 2 things I hadn't expected, and I wonder if they can be solved (without rewriting lots of code - total code base is over 2 MB and I'm working on it alone). I've tried to google them but didn't find any useful answers.

This one seems to be the most important:

fft_abs_sse2[2*cc] = max(fft_abs_sse2[2*cc], strength * m);

.\Clip1Ch.cpp(1999): (col. 13) remark: vector dependence: proven ANTI dependence between fft_abs_sse2 line 1999, and fft_abs_sse2 line 1999.
.\Clip1Ch.cpp(1999): (col. 13) remark: vector dependence: proven ANTI dependence between fft_abs_sse2 line 1999, and fft_abs_sse2 line 1999.
.\Clip1Ch.cpp(1999): (col. 13) remark: vector dependence: proven FLOW dependence between fft_abs_sse2 line 1999, and fft_abs_sse2 line 1999.
.\Clip1Ch.cpp(1999): (col. 13) remark: vector dependence: proven FLOW dependence between fft_abs_sse2 line 1999, and fft_abs_sse2 line 1999.
.\Clip1Ch.cpp(1999): (col. 13) remark: vector dependence: proven ANTI dependence between fft_abs_sse2 line 1999, and fft_abs_sse2 line 1999.
...

While I know that there's an _mm_max_ SIMD instruction. Problem might be the definition of max, I'm using:
#define max(a,b) (((a)>(b)) ? (a) : (b)) 
The compiler might see this as an if instruction if it's unable to optimize everything out. Is there a better definition for max that doesn't cause the compiler to see dependencies where there are none? 

Another situation that occurs very frequently in my code is this:

for (int c=0; c<f1; c++)
{
temp[2*c] *= one_DIV_bass_static_clip_level_dynamic;
temp[2*c+1] *= one_DIV_bass_static_clip_level_dynamic;
}

Clearly, there are no dependencies between temp[2*c] and temp[2*c+1], but the compiler thinks otherwise:

.\Clip1Ch.cpp(797): (col. 9) remark: loop was not vectorized: existence of vector dependence.
.\Clip1Ch.cpp(800): (col. 13) remark: vector dependence: proven FLOW dependence between temp line 800, and temp line 799.
.\Clip1Ch.cpp(800): (col. 13) remark: vector dependence: proven ANTI dependence between temp line 800, and temp line 799.
.\Clip1Ch.cpp(800): (col. 13) remark: vector dependence: proven OUTPUT dependence between temp line 800, and temp line 799.

I think if these two situations are solved at least 50% of the loops that currently don't get vectorized will be. Your help is greatly appreciated :)

0 Kudos
8 Replies
TimP
Honored Contributor III
582 Views
Why not simply use std::max() rather than using macro replacement? If you didn't set /Qansi-alias , does that option help (assuming that you don't violate the standard on aliasing somewhere)?
0 Kudos
SergeyKostrov
Valued Contributor II
582 Views
>>...I'm trying to squeeze the last bit of optimization out of a program... In your code: fft_abs_sse2[2*cc] = max( fft_abs_sse2[2*cc], strength * m ); // (A) there is no need in max macro and one if statement is actually needed instead of if-else statement "hidden" in max macro. Take a look in disassembler how max macro looks like unless it is already optimized by a C++ compiler, like Intel or Watcom. >>...remark: vector dependence: proven ANTI dependence between fft_abs_sse2 line 1999, and fft_abs_sse2 line 1999... It is a message for you that there is a case when variable fft_abs_sse2[2*cc] is assigned to variable fft_abs_sse2[2*cc], that is to itself (!). These two comments are related to the same problem in your code (A). >>...Clearly, there are no dependencies between temp[2*c] and temp[2*c+1], but the compiler thinks otherwise... What do you think about a 2*c value?
0 Kudos
SergeyKostrov
Valued Contributor II
582 Views
>>...Why not simply use std::max() rather than using macro replacement? Please try it and it would be nice if you post a compiler's output. Thanks in advance.
0 Kudos
jimdempseyatthecove
Honored Contributor III
582 Views
Your prior code is trying to improve performance by manually unrolling a loop. In your case it seems to defeated its intended purpose: >> for (int c=0; c < f1; c++) { temp[2*c] *= one_DIV_bass_static_clip_level_dynamic; temp[2*c+1] *= one_DIV_bass_static_clip_level_dynamic; } << int f1x2 = f1 * 2; for (int c=0; c < f1x2; c++) { temp *= one_DIV_bass_static_clip_level_dynamic; } Jim Dempsey
0 Kudos
SergeyKostrov
Valued Contributor II
582 Views
Hi Jim, are you absolutely sure that he wanted to unroll the loop? >>... >>for ( int c=0; c < f1; c++ ) >>{ >>temp[2*c] *= one_DIV_bass_static_clip_level_dynamic; >>temp[2*c+1] *= one_DIV_bass_static_clip_level_dynamic; >>} >>... Ideally, it should look like: for ( int c=0; c < f1; c+=2 ) { temp[2*c] *= one_DIV_bass_static_clip_level_dynamic; temp[2*c+1] *= one_DIV_bass_static_clip_level_dynamic; } and a verification that ( f1 % 2 ) equals to 0 must be done and if it is not 0 additional processing needs to be done.
0 Kudos
piet_de_weer
Beginner
582 Views
Sorry for the delay in responding - I've been releasing a new software version and I've been extremely busy. My goal wasn't to unroll a loop here (I would have expected the compiler to vectorize it), but I was processing audio encoded as left/right/left/right/... . It's definitely possible to rewrite the code and remove what looks like unrolling - but the current code better reflects the intention. I had expected the compiler to be able to figure this out (2*c and 2*c+1 with c increasing by 1 every step isn't rocket science), but apparently it doesn't. About #define max vs. std::max (I didn't even know that that existed... oops): No difference in behavior. But I did notice something else: If I *only* put something like a = max(a, b) in a loop it does vectorize - however if I put more code around it it seems to be getting too complex for the compiler and it starts to complain about dependencies (a vs a).
0 Kudos
piet_de_weer
Beginner
582 Views
O, and /Qansi-alias is set - it did have quite a big effect in the past (before I switched to using IPP and used my own hand-optimized SSE2 FFT implementation) but since I switched to IPP there's no difference anymore, so the only place that was affected (and has a noticeable effect on performance) was in my FFT code.
0 Kudos
SergeyKostrov
Valued Contributor II
582 Views
But, did you try this: >>...fft_abs_sse2[2*cc] = max( fft_abs_sse2[2*cc], strength * m ); without max macro with if statement instead?
0 Kudos
Reply