[beginner question] SSE2 / CPU and dynamic/static arrays (C) performance

vkeller · ‎03-10-2011

Dear colleagues,

I'm a little bit confused with performances achieved with different versions of the Intel compiler (on Intel architectures). In the past, I made quite lots of performance tests comparing dynamic/static declared arrays in C. My conclusion was that the performance is similar. Today I want to go a step beyond by vectorization of codes. I tried a simple operation, the full 2D matrix matrix multiplication in float, double and long double precision. I expected to get Perf(float) = 2*Perf(double)=4*Perf(long double) by using SSE (128 bits) hardware. Right or not ?

So, if I declare the 2D arrays as static, it is exactly what I expected and measure (icc -O3 -align -axSSE4.2 -vec-report3) with the last icc compiler (12.0.2 20110112). On the other hand, if I declare them dynamic, the compiler gives error such as FLOW and ANTI dependences and the performance drops by a factor of 4 (for floats) on my linux box.

[cpp]...

#define ALIGN_16 __attribute__((aligned(64)))
#define _aligned_free(a) _mm_free(a)
#define _aligned_malloc(a, b) _mm_malloc(a, b)

int main(int argc, char *argv[])
{

	float **SAp ALIGN_16, **SBp ALIGN_16, **SCp ALIGN_16;

	SAp =  _aligned_malloc(sizeX * sizeof(float*), 16);
	SBp =  _aligned_malloc(sizeX * sizeof(float*), 16);
	SCp =  _aligned_malloc(sizeX * sizeof(float*), 16);

	for(i=0; i =  _aligned_malloc(sizeX * sizeof(float), 16);
		SBp =  _aligned_malloc(sizeX * sizeof(float), 16);
		SCp =  _aligned_malloc(sizeX * sizeof(float), 16);
	}

	start_perf_counters();
	for (k=0;k=SCp+SAp*SBp;
			}
		}
	}
	stop_perf_counters();

	for(i=0; i);
		_aligned_free(SBp);
		_aligned_free(SCp);
	}
	_aligned_free(SAp);
	_aligned_free(SBp);
	_aligned_free(SCp);
}[/cpp]

My question(s): Is the above pseudo-code correct to use SSE vectorization, if not, what is wrong ? Is there a way to declare 2D **arrays so that the alignement in memory is respected and thus, performance is similar to what is measured with static arrays (compiler didn't "assume" ANTI/FLOW dependences) ? If not, is there another hint or trick to "transform" static arrays in dynamic ones without a lost of performance (the final goal is to optimize real world apps)

Many thanks in advance
Vince

TimP · ‎03-10-2011

As you're experimenting primarily with icc options, you could ask questions about that on the icc forum.
Did you omit the option -ansi-alias intentionally? Without that option, the compiler doesn't take advantage of the rules about type-dependent aliasing, so you would be better off using a recent gcc. I suppose the reason is that this rule is often violated in legacy Microsoft code. In this case, it would confirm your intention that there should be no aliasing among float, float *, and float ** data. The alternative "big hammer" is to set #pragma simd ahead of your inner loop.
Pseudo-code doesn't help, when the questions are about the actual implementation.

vkeller · ‎03-10-2011

Thanks Tim,

Sorry, of course, the -ansi-alias was there.

I'll try on icc forum.

Thanks
Vince