A basic question about auto vectorization of 3-level nested loop - Page 2

susangao · ‎02-19-2013

Hi All,

I wish to get your kindly help on this basic question.

I find a nested loop in a paper as following: ("Model-Driven SIMD Code Generation for a Multi-Resolution Tensor Kernel")

for(int i = 0; i < LEN2; i++) {
for(int k = 0; k < LEN2; k++) {
for(int j = 0; j < LEN2; j++) {
cc += aa * bb;
}
}
}

I modifed it as:

for(int i = 0; i < LEN2; i++) {
for(int k = 0; k < LEN2; k++) {
t = bb;
for(int j = 0; j < LEN2; j++) {
cc += aa * t;
}
}
}

I use ICC to test these two pieces of code, I check execution time and the numbers of scalar and vector operations.

The CPU is Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
The ICC version is icc (ICC) 12.1.5 20120612.
OS: GNU/Linux

The output is:

Loop Time(Sec) Checksum
tensor_kernel 2.25 0.210001
PAPI_FP_OPS 8
PAPI_VEC_SP 26369718676

tensor_kernel 20.55 0.210001
PAPI_FP_OPS 9
PAPI_VEC_SP 26264735280

The execution time are quiet different: one is 2.x sec, the other is around 20 sec.

On the other hand, the PAPI result shows that the instruction numbers of scalar and vector instruction operations are similar. To comfirm the result from PAPI, I checked the .s code, and didn't find obviors difference. I dont understand what causes this execution time difference?

Code is as attached, it use the framework of TSVC. (By default the papi version will also be built, you could remove them from list.)

Thank you very much for reading this question.

Best Regards,

Susan

susangao · ‎03-07-2013

Hi Sergey,

Thank you very much for the info. That's really helpful.

Now I understand I should be very careful of this modification, it is better to be used with tiling (or other transformations) to avoid cache miss.

Best Regards,

Susan

SergeyKostrov · ‎04-10-2013

This is a follow up and you could try ( if interested ) a #pragma prefetch [var1 [:hint1 [:distance1]] [, var2 [:hint2 [:distance2]]]...] directive. Values for hint arguments are as follows: - _MM_HINT_T0 - for integer data that will be reused - _MM_HINT_NT1 - for integer and floating point data that will be reused from L2 cache - _MM_HINT_NT2 - for data that will be reused from L3 cache - _MM_HINT_NTA - for data that will not be reused and the directive has to be used as follows ( without hint value in my example ): ... float t; ... #pragma prefetch t for( int k = 0; k < LEN2; k++ ) { t = bb; for( int j = 0; j < LEN2; j++ ) { ...

SergeyKostrov · ‎05-03-2013

I detected a similar problem in a different algorithm and negative effect of such manual-optimizations is significant ( it slows execution in ~2x ). Just take a look at numbers: Intel C++ compiler - /O2 [ Problems with Cache Lines ( after manual-optimization similar to Susan's ) ] ... Matrix Size: 1024 x 1024 ... Calculating... Matrix multiplication - Pass 1 - Completed: 1.40400 secs Matrix multiplication - Pass 2 - Completed: 1.18500 secs Matrix multiplication - Pass 3 - Completed: 1.20200 secs Matrix multiplication - Pass 4 - Completed: 1.20100 secs Matrix multiplication - Pass 5 - Completed: 1.18500 secs ... [ No Problems with Cache Lines - ~50% faster (!) ] ... Matrix Size: 1024 x 1024 ... Calculating... Matrix multiplication - Pass 1 - Completed: 0.82700 secs Matrix multiplication - Pass 2 - Completed: 0.62400 secs Matrix multiplication - Pass 3 - Completed: 0.62400 secs Matrix multiplication - Pass 4 - Completed: 0.63900 secs Matrix multiplication - Pass 5 - Completed: 0.62400 secs ...

SergeyKostrov · ‎05-23-2013

This is a follow up. For example, If these codes: ... // (A) for( j = 0; j < uiSize; j++ ) { tC = ( T )0; for( k = 0; k < uiSize; k += 4 ) { tC += ( ( tA[k ] * tB[k ] ) + ( tA[k+1] * tB[k+1] ) + ( tA[k+2] * tB[k+2] ) + ( tA[k+3] * tB[k+3] ) ); } } ... Changed as follows: ... // (B) { register T tR = ( T )0; for( k = 0; k < uiSize; k += 4 ) { tR += ( ( tA[k ] * tB[k ] ) + ( tA[k+1] * tB[k+1] ) + ( tA[k+2] * tB[k+2] ) + ( tA[k+3] * tB[k+3] ) ); } tC = ( T )tR; } ... Then performance improves by ~3.9% compared to (A). If tB is transposed: ... // (C) { register T tR = ( T )0; for( k = 0; k < uiSize; k += 4 ) { tR += ( ( tA[k ] * tB[j ] ) + ( tA[k+1] * tB[j+1] ) + ( tA[k+2] * tB[j+2] ) + ( tA[k+3] * tB[j+3] ) ); } tC = ( T )tR; } ... Then performance improves by ~30.5% compared to (A).

jimdempseyatthecove · ‎05-24-2013

This may be off point of this thread. In examining your code it is a matrix multiplication where the destination is neither of the sources. It might not hurt to investigate producing ccT, the transposed value of the eventual result, then after complete computation of ccT, transpose it to produce cc.

Although this adds extra work for the final transpose, it removes latencies by improving vectorization and cache utilization.

Also, it wouldn't hurt to add a test using MKL.

Jim Dempsey

SergeyKostrov · ‎05-24-2013

>>...Although this adds extra work for the final transpose, it removes latencies by improving vectorization and >>cache utilization... There is also a special case when a matrix B is symmetric and transpose doesn't need to be done. But, a transposed-based version could be used (!).