[beginner question] SSE2 / CPU and dynamic/static arrays (C) performance

vkeller · ‎03-10-2011

Dear colleagues,

I'm a little bit confused with performances achieved with different versions of the Intel compiler (on Intel architectures). In the past, I made quite lots of performance tests comparing dynamic/static declared arrays in C. My conclusion was that the performance is similar. Today I want to go a step beyond by vectorization of codes. I tried a simple operation, the full 2D matrix matrix multiplication in float, double and long double precision. I expected to get Perf(float) = 2*Perf(double)=4*Perf(long double) by using SSE (128 bits) hardware. Right or not ?

So, if I declare the 2D arrays as static, it is exactly what I expected and measure (icc -O3 -align -axSSE4.2 -vec-report3 -ansi-alias). On the other hand, if I declare them dynamic, the compiler gives error such as FLOW and ANTI dependences and the performance drops by a factor of 4 (for floats as the following example) on my linux box.

[cpp]#define ALIGN_16 __attribute__((aligned(64)))
#define _aligned_free(a) _mm_free(a)
#define _aligned_malloc(a, b) _mm_malloc(a, b)

int main(int argc, char *argv[])
{

	float **SAp ALIGN_16, **SBp ALIGN_16, **SCp ALIGN_16;

	SAp =  _aligned_malloc(sizeX * sizeof(float*), 16);
	SBp =  _aligned_malloc(sizeX * sizeof(float*), 16);
	SCp =  _aligned_malloc(sizeX * sizeof(float*), 16);

	for(i=0; i =  _aligned_malloc(sizeX * sizeof(float), 16);
		SBp =  _aligned_malloc(sizeX * sizeof(float), 16);
		SCp =  _aligned_malloc(sizeX * sizeof(float), 16);
	}

	start_perf_counters();
	for (k=0;k=SCp+SAp*SBp;
			}
		}
	}
	stop_perf_counters();

	for(i=0; i);
		_aligned_free(SBp);
		_aligned_free(SCp);
	}
	_aligned_free(SAp);
	_aligned_free(SBp);
	_aligned_free(SCp);
}[/cpp]

My question(s): Is the above code correct to use SSE vectorization, if not, what is wrong ? Is there a way to declare 2D **arrays so that the alignement in memory is respected and thus, performance is similar to what is measured with static arrays (compiler didn't "assume" ANTI/FLOW dependences) ? If not, is there another hint or trick to "transform" static arrays in dynamic ones without a lost of performance (the final goal is to optimize real world apps)

Many thanks in advance
Vince

Om_S_Intel · ‎03-10-2011

This may not vectorize as memory stride is alsothe issue.

You may use -guide compiler option to get Guided Auto Parallelization messages.

Also the your code did not compile.

jimdempseyatthecove · ‎03-11-2011

Vince,

In the case of the static 2D array, the compiler knows where the array is located and were the rows of the array (relative to the array) are located. Meaning access to the data within the array may require as little as 1 memory access.

An array of pointers to 1D datagenerally requiresone memory read to get the address of the array of pointers, plus one memory read to get the row addressfrom the array of pointers, then the memory contents from within the array itself. This is potentially 3x the number of memory reference. Note, in a preferred indexed nested loop, the row address will tendto be remembered and the and the average number of memory references might be reduced to 2x. Compiler optimizations, whenfavorablecan reduce this to near 1x.

In yoursample code, your loops are inefficiently layed out.

If you were to move the k loop to the inner most level, then you can greatly reduce the number of memory references to SCp

Furthermore, after rearranging thek loop, if you rewritethe k loopas two k loops, the first k loop to pickup SCp into a 1D array.And the 2nd k loop to perform the DOT then the 2nd loop will vectorize.

Is this a beginning CS course problem?

I suggest you generate the assembler listing files of your test codes, examine the listings for memory reference (those lines containing [something]) and imply count them. Usually the code that has the fewer [...]'s will be faster. Vectorized code is an improvement upon that as well.

Jim Dempsey

TimP · ‎03-11-2011

You're getting a lot of conflicting advice, which you might have avoided by providing a working case. The loop nesting looks OK for vectorization, except that you will have an aliasing concern between the 2 array sections in the inner loop, if the compiler doesn't analyze backwards to see that they are separate data regions according to malloc. If declaring them with restrict keyword isn't applicable, #pragma ivdep should be sufficient, as would the biggest hammer #pragma simd.

Needless to say, if it is a course problem, you are probably expected to consider tiling, organizations to use dot product, ...... and there are plenty of those at the beck and call of your browser.

vkeller · ‎03-18-2011

Dear Jim,

Thanks for your kind reply. Actually it is not a "beginning CS course problem", it is just a test case to understand how the multiple versions of Intel compilers work with a very simple loop. It started from a simple question of one of our users "why the performance of my app is so bad ?".

I attached a working version of the code showing the different cases. If I can explain some of the results with the loop index orders, I'm unable to understand the difference between 32/64, 2D/1D arrays or even static.

I'm sorry to bother you with such dumb questions.

Thanks in advance
Vince

(I reimplemented the example in C++)

(target arch: dual Intel E5620 @2.4 GHz)

[plain]Simple precision (32 bits) MXM (STATIC) 
Order	   GF/s		 time		 FLOPS		 verif 
Static	   1.540221e+01	 1.394270e-01	 2.147484e+09	 3.006477e+10 
Simple precision (32 bits) MXM (ALIGNED PTR 2D)
Order	   GF/s		 time		 FLOPS		 verif 
I-J-K	   2.421391e+00	 8.868800e-01	 2.147484e+09	 3.006477e+10 
I-K-J	   6.100251e+00	 3.520320e-01	 2.147484e+09	 3.006477e+10 
J-I-K	   2.428250e+00	 8.843750e-01	 2.147484e+09	 3.006477e+10 
J-K-I	   2.426250e+00	 8.851040e-01	 2.147484e+09	 3.006477e+10 
K-I-J	   4.068892e+00	 5.277810e-01	 2.147484e+09	 3.006477e+10 
K-J-I	   2.458377e+00	 8.735370e-01	 2.147484e+09	 3.006477e+10 
Simple precision (32 bits) MXM (ALIGNED PTR in 1D)
Order	   GF/s		 time		 FLOPS		 verif 
I-J-K	   1.583235e+01	 1.356390e-01	 2.147484e+09	 3.006477e+10 
I-K-J	   1.567289e+01	 1.370190e-01	 2.147484e+09	 3.006477e+10 
J-I-K	   2.179602e-01	 9.852643e+00	 2.147484e+09	 3.006477e+10 
J-K-I	   1.825269e+00	 1.176530e+00	 2.147484e+09	 3.006477e+10 
K-I-J	   1.425412e+01	 1.506570e-01	 2.147484e+09	 3.006477e+10 
K-J-I	   1.565107e+01	 1.372100e-01	 2.147484e+09	 3.006477e+10 
Double precision (64 bits) MXM (STATIC) 
Order	   GF/s		 time		 FLOPS		 verif 
Static	   7.632567e+00	 2.813580e-01	 2.147484e+09	 3.006477e+10 
Double precision (64 bits) MXM (ALIGNED PTR 2D)
Order	   GF/s		 time		 FLOPS		 verif 
I-J-K	   7.426191e+00	 2.891770e-01	 2.147484e+09	 3.006477e+10 
I-K-J	   2.559725e+00	 8.389510e-01	 2.147484e+09	 3.006477e+10 
J-I-K	   7.447257e+00	 2.883590e-01	 2.147484e+09	 3.006477e+10 
J-K-I	   7.496208e+00	 2.864760e-01	 2.147484e+09	 3.006477e+10 
K-I-J	   7.553105e+00	 2.843180e-01	 2.147484e+09	 3.006477e+10 
K-J-I	   7.445915e+00	 2.884110e-01	 2.147484e+09	 3.006477e+10 
Double precision (64 bits) MXM (ALIGNED PTR in 1D)
Order	   GF/s		 time		 FLOPS		 verif 
I-J-K	   6.020892e+00	 3.566720e-01	 2.147484e+09	 3.006477e+10 
I-K-J	   6.020369e+00	 3.567030e-01	 2.147484e+09	 3.006477e+10 
J-I-K	   2.069380e-01	 1.037743e+01	 2.147484e+09	 3.006477e+10 
J-K-I	   6.636920e-01	 3.235663e+00	 2.147484e+09	 3.006477e+10 
K-I-J	   6.019323e+00	 3.567650e-01	 2.147484e+09	 3.006477e+10 
K-J-I	   6.037328e+00	 3.557010e-01	 2.147484e+09	 3.006477e+10 
[/plain]