- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi All,
I am trying to compile following sample kernel with Intel (ICC) 14.0.0 20130728 (or version > 12 ). I see strange behaviour with vectorization. I have following questions:
- If I change _iml variable type to int instead of long int, compiler doesn't vectorize the code. If I see vectorization report with -vec-report3, I see large report with ANTI and FLOW dependencies which seems correct. But I didn't understand what compiler does to vectorize when I change loop iteration variable type to long int.
- Below example is auto-generated kernel from domain specific language. We have large array and we process 18 elements of array for every iteration (say those 18 elements represent a particle). So iterations are independent. But this memory layout looks similar to AoS (arrya of struct with 18 elements). AoS is not good for vectorization, I want to understant how Intel compiler vectorize this code.
compute() function is actual compute kernel that I want to vectorize. Please follow the comments for more explaination:
#include <math.h> #define AOS_BLOCK 18 void compute(double *pdata, int num_mechs) { double* _p; /* ISSUE : If I change _iml to int instead of long int * compiler doesn't vectorize the code. Why? */ long int _iml; /* for each iteration of loop, we process 18 elements of pdata 1-d array */ for (_iml = 0; _iml < num_mechs; ++_iml) { /* take pointer to start of 18 blocks element */ _p = &pdata[_iml*AOS_BLOCK]; /* below calculations are generanted from DSL to C code converter, looks ugly I know! * we do some computation on those 18 elements only, so you don't need to understand */ if ( _p[16] == - 35.0 ) { _p[16] = _p[16] + 0.0001 ; } _p[8] = ( 0.182 * ( _p[16] - - 35.0 ) )/ ( 1.0 - ( exp ( - ( _p[16] - - 35.0 ) / 9.0 ) ) ) ; _p[9] = ( 0.124 * ( - _p[16] - 35.0 ) ) / ( 1.0 - ( exp ( - ( - _p[16] - 35.0 ) / 9.0 ) ) ) ; _p[6] = _p[8] / ( _p[8] + _p[9] ) ; _p[7] = 1.0 / ( _p[8] + _p[9] ) ; if ( _p[16] == - 50.0 ) { _p[16] = _p[16] + 0.0001 ; } _p[12] = ( 0.024 * ( _p[16] - - 50.0 ) ) / ( 1.0 - ( exp ( - ( _p[16] - - 50.0 ) / 5.0 ) ) ) ; if ( _p[16] == - 75.0 ) { _p[16] = _p[16] + 0.0001 ; } _p[13] = ( 0.0091 * ( - _p[16] - 75.0 ) ) / ( 1.0 - ( exp ( - ( - _p[16] - 75.0 ) / 5.0 ) ) ) ; _p[10] = 1.0 / ( 1.0 + exp ( ( _p[16] - - 65.0 ) / 6.2 ) ) ; _p[11] = 1.0 / ( _p[12] + _p[13] ) ; _p[3] = _p[3] + (1. - exp(0.01*(( ( ( -1.0 ) ) ) / _p[7])))*(- ( ( ( _p[6] ) ) / _p[7] ) / ( ( ( ( -1.0) ) ) / _p[7] ) - _p[3]); _p[4] = _p[4] + (1. - exp(0.01*(( ( ( -1.0 ) ) ) / _p[11])))*(- ( ( ( _p[10] ) ) / _p[11] ) / ( ( ( ( -1.0) ) ) / _p[11] ) - _p[4]); } } int main(int argc, char *argv[]) { int i, n; double * data; if(argc < 2) { printf("\n Pass lenght of an array as argument \n"); exit(1); } n = atoi( argv[1] ); //data = _mm_malloc( sizeof(double) * n, 32); data = (double *) malloc( sizeof(double) * n * AOS_BLOCK); /* main compute function */ compute( data, n); if(argc > 3) for(i=0; i<n ; i++) printf("\t %lf", data); free(data); //_mm_free(data); }
Any comments to understand this code and vectorization is appreciated.
Thanks!
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
To provide more information, here is my compilation/vectorization report:
[kumbhar@dom38 ~]$ icc -vec-report3 vec_test_intel.c vec_test_intel.c(66): (col. 5) remark: LOOP WAS VECTORIZED vec_test_intel.c(69): (col. 9) remark: loop was not vectorized: existence of vector dependence vec_test_intel.c(14): (col. 5) remark: LOOP WAS VECTORIZED
If I change _iml to int, I see
[kumbhar@dom38 ~]$ icc -vec-report3 vec_test_intel.c vec_test_intel.c(14): (col. 5) remark: loop was not vectorized: existence of vector dependence vec_test_intel.c(22): (col. 13) remark: vector dependence: assumed ANTI dependence between _p line 22 and _p line 37 vec_test_intel.c(37): (col. 13) remark: vector dependence: assumed FLOW dependence between _p line 37 and _p line 22 vec_test_intel.c(22): (col. 13) remark: vector dependence: assumed ANTI dependence between _p line 22 and _p line 40 vec_test_intel.c(40): (col. 9) remark: vector dependence: assumed FLOW dependence between _p line 40 and _p line 22 vec_test_intel.c(22): (col. 13) remark: vector dependence: assumed ANTI dependence between _p line 22 and _p line 41 vec_test_intel.c(41): (col. 9) remark: vector dependence: assumed FLOW dependence between _p line 41 and _p line 22 vec_test_intel.c(22): (col. 13) remark: vector dependence: assumed ANTI dependence between _p line 22 and _p line 42 vec_test_intel.c(42): (col. 9) remark: vector dependence: assumed FLOW dependence between _p line 42 and _p line 22
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
To provide more information:
Here is compilation report with long int _iml
[kumbhar@dom38 ~]$ icc -vec-report3 vec_test_intel.c vec_test_intel.c(66): (col. 5) remark: LOOP WAS VECTORIZED vec_test_intel.c(69): (col. 9) remark: loop was not vectorized: existence of vector dependence vec_test_intel.c(14): (col. 5) remark: LOOP WAS VECTORIZED
And this is with int _iml
vec_test_intel.c(14): (col. 5) remark: loop was not vectorized: existence of vector dependence vec_test_intel.c(22): (col. 13) remark: vector dependence: assumed ANTI dependence between _p line 22 and _p line 37 vec_test_intel.c(37): (col. 13) remark: vector dependence: assumed FLOW dependence between _p line 37 and _p line 22 vec_test_intel.c(22): (col. 13) remark: vector dependence: assumed ANTI dependence between _p line 22 and _p line 40 vec_test_intel.c(40): (col. 9) remark: vector dependence: assumed FLOW dependence between _p line 40 and _p line 22 vec_test_intel.c(22): (col. 13) remark: vector dependence: assumed ANTI dependence between _p line 22 and _p line 41 vec_test_intel.c(41): (col. 9) remark: vector dependence: assumed FLOW dependence between _p line 41 and _p line 22 vec_test_intel.c(22): (col. 13) remark: vector dependence: assumed ANTI dependence between _p line 22 and _p line 42
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
sorry about multiple post; I no longer have delete privilege.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Meaning of long int depends on OS and 32/64-bit target identification. Is this Intel64 linux or Mac, where it would be a 64-bit data type?
Available vectorization modes depend on target architecture. Presumably vectorization would use simulated gather/scatter (possibly vgather instructions on corei7-4) so as to pack multiple iterations into simd data and enable use of svml functions (effectively converting to SoA).
The compilers have made significant stability improvements with updates since the initial 14.0.
If you would attach code which could be compiled, we could find out ourselves what happens.
The constants don't look precise enough to justify use of double data types unless to avoid overflow in the exponentials at those arbitrarily shifted singularity points.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks Tim for clarification. Details you asked are:
- x86_64 GNU/Linux
- Xeon(R) CPU E5-2670 0 @ 2.60GHz
- Compilers I tested: icc (ICC) 13.1.0 20130121, (ICC) 14.0.0 20130728
I have attached code here (basically same example I posted above). I am implementing SoA version of the same kernel and want to know the performance of AoS and SoA memory layout for above kernel. So it will be helpful if you confirm what is happening with vectorization in the attached kernel (as I am not an assembly expert! :) ).
Thanks!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I took a quick look with 14.0.2 compiler on Windows, as that's the only AVX(2) machine I have. I see the effect you reported of vectorization being enabled when I set long long int iml_ (Windows compiler ignores plain long). The vectorization is done with mostly scalar memory accesses (on account of the AoS data) to make short vectors in registers, enabling use of the short vector exp() calls. Spills and reloads are done with AVX128 memory access.
The scalar exp calls are also made (for consistency) to the short vector library, discarding the extra slots
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Could you try -opt-subscript-in-range compiler option?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
om-sachan (Intel) wrote:
Could you try -opt-subscript-in-range compiler option?
Yes, compiler vectorize loop if I use -opt-subscript-in-range compilation flag. I see following explanation for this flag:
If you specify -opt-subscript-in-range (Linux* OS and OS X*) or /Qopt-subscript-in-range (Windows* OS), the compiler assumes that there are no "large" integers being used or being computed inside loops. A "large" integer is typically > 231. This feature can enable more loop transformations.
But I didn't understand completely. In the following example, could you explain how compiler decide to vectorize when we use above option -opt-subscript-in-range? (If you look at the example I posted, _iml is used only for loop iteration and calculating offset)
/* this is not vectorizable */ int _iml; for (_iml = 0; _iml < 1000000; ++_iml) { .............. } /* this is vectorizable */ long int _iml; for (_iml = 0; _iml < 1000000; ++_iml) { .............. }
This will be great help to understand this better! (full example of above loops is attached in top posts).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Maybe compiler took into account the magnitude of _iml variable,but it also doesn't make sense because _iml < 1e+6 and still it is inside its range.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
im1_ is multiplied explicitly by 18 in the posted sample so there may be a concern about signed overflow. The compiler must perform some kind of strength reduction in order to set up the gather of several iterations at stride 18 into parallel simd. It ought to pay off by the use of parallel simd divide and short vector exp function.
In my simpler example, icc under-performs gcc when a subscript is calculated by multiplying the for index. gcc performs classical strength reduction (and does not attempt vectorization), but icc "optimizes" to an lea chain and "vectorizes" to avx-128 by simulated gather-scatter (it not longer performs the multiplication inside the inner loop). opt-subscript-in-range makes no difference.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I agree with Tim.
Om
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page