I would like to vectorize this code below (just for an example), just assume somehow I should write an array inside an array.
PROGRAM TEST IMPLICIT NONE REAL, DIMENSION(2000):: A,B,C !100000 INTEGER, DIMENSION(2000):: E REAL(KIND=8):: TIME1,TIME2 INTEGER::I DO I=1, 2000 !Actually only this loop could be vectorized B(I)=100.00 !by the compiler C(I)=200.00 E(I)=I END DO !Computing computer's running time (start) CALL CPU_TIME (TIME1) DO I=1, 2000 !This is the problem, somehow I should put A(E(I))=B(E(I))*C(E(I)) !an integer array E(I) inside an array END DO !I would like to vectorize this loop also, but it didn't work PRINT *, 'Results =', A(2000) PRINT *, ' ' !Computing computer's running time (finish) CALL CPU_TIME (TIME2) PRINT *, 'Elapsed real time = ', TIME2-TIME1, 'second(s)' END PROGRAM TEST
I thought at first time, that compiler could understand what I want which somehow be vectorized like this:
DO I=1, 2000, 4 !Unrolled 4 times for example A(E(I))=B(E(I))*C(E(I)) A(E(I+1))=B(E(I+1))*C(E(I+1)) A(E(I+2))=B(E(I+2))*C(E(I+2)) A(E(I+3))=B(E(I+3))*C(E(I+3)) END DO
but I was wrong. Do you have any idea how I could vectorize it? Or can't it be vectorized at all? Thanks in advance.
When unrolling is combined with vectorization, supposing that you are targeting SSE4, where the vector register length for default real is 4, you have
You are correct that unrolling on top of vectorization is likely to be beneficial, particularly with AVX2.
Automatic unrolling of this nature is controlled by ifort option e.g. -unroll=4 or loop level directive e.g.
!dir$ unroll(4). These are only suggestions to the compiler.
Vectorization of indirect access is unlikely to be effective unless you target the specific ISA of your CPU, e.g. by compiler option -xHost. If you want vectorization regardless of whether the compiler expects it to perform well, you have available the directives such as !dir$ vector always or !$omp simd.
If you wish to see what the compiler says about potential for vectorization with AVX512, you can set one of the AVX512 options along with -qopt-report=4 and look at the optrpt. It is claimed that the beta vectorization Advisor should be able to predict AVX512 performance by running on an older CPU with a build with options such as -axCORE-AVX512 -msse4.1 -debug inline-debug-info, but this doesn't work for me.
In your previous posts on this subject, responders have brought up the question of possible repeated entries in your index arrays. This would raise the likelihood of indeterminate results with vectorization. However, apparently, the methods used by compilers for IA targets will preserve the order of storing of results. In AVX512, the compiler can use the conflict resolution instructions if it sees the need for them. In AVX256, the storage is done sequentially with simulated scatter even though the compiler may report successful vectorization, with reads either by simulated gather or AVX2 gather instructions.
@Tim.P: Thanks for the answer. Actually I have tried with $omp simd (and compile with flag omp for sure), like this following:
$OMP SIMD DO I=1, 2000 !But it still didn't work A(E(I))=B(E(I))*C(E(I)) END DO $OMP END SIMD
but it still didn't work. Should I conclude that in my case which is indirect indexing of the loop:
I can vectorize the loop using either SIMD or Openmp SIMD directives.
Your declaration of OMP simd contains a typo, which should be !$OMP instead of $OMP.
Note that update with indirect index cannot be auto-vectorized due to potential index conflict in one vector. See if E(I) =1, E(I+1)=2, E(I+2)=1, E(I+3)=3 which may contain conflict index and cause dependence in one vector.
The potential dependence can be ignored using simd, as in your case value of E(I) is continuous without conflict.
See the optimization report using simd:
LOOP BEGIN at testind.f90(19,3) remark #15388: vectorization support: reference e has aligned access [ testind.f90(20,6) ] remark #15388: vectorization support: reference e has aligned access [ testind.f90(20,6) ] remark #15388: vectorization support: reference e has aligned access [ testind.f90(20,6) ] remark #15329: vectorization support: scatter was emulated for the variable a: indirect access [ testind.f90(20,6) ] remark #15328: vectorization support: gather was emulated for the variable b: indirect access [ testind.f90(20,14) ] remark #15328: vectorization support: gather was emulated for the variable c: indirect access [ testind.f90(20,22) ] remark #15305: vectorization support: vector length 4 remark #15301: OpenMP SIMD LOOP WAS VECTORIZED remark #15448: unmasked aligned unit stride loads: 3 remark #15458: masked indexed (or gather) loads: 2 remark #15459: masked indexed (or scatter) stores: 1 remark #15475: --- begin vector loop cost summary --- remark #15476: scalar loop cost: 12 remark #15477: vector loop cost: 32.250 remark #15478: estimated potential speedup: 0.370 remark #15488: --- end vector loop cost summary --- remark #25456: Number of Array Refs Scalar Replaced In Loop: 8 remark #25015: Estimate of max trip count of loop=500 LOOP END
Compile option I used:
$ ifort -V Intel(R) Fortran Intel(R) 64 Compiler for applications running on Intel(R) 64, Version 184.108.40.206 Build 20160204 Copyright (C) 1985-2016 Intel Corporation. All rights reserved. $ ifort -O2 -qopt-report5 -qopenmp testind.f90 ifort: remark #10397: optimization reports are generated in *.optrpt files in the output location
Hope this helps.
In addition, if you change the loop as following using direct index to update array A, compiler will auto-vectorize the loop. This is because there's no dependence from reads of B(E(I)) and C(E(I)) between iterations.
DO I=1, 2000 A(I)=B(E(I))*C(E(I)) END DO
@Chen, @Tim.P: Thanks for the answer. I found these interesting cases (I write only the second looping of my case above):
!$OMP SIMD DO I=1, 2000 A(E(I))= B(E(I))*C(E(I)) !assumed to be Case A A(I) = B(E(I))*C(E(I)) !assumed to be Case B END DO !$OMP END SIMD
FYI, I am using Ifort 16.0.3 and as comparison I'm also using Gfortran 5.4.0. I had tried some combinations using these 2 compilers) like these:
I compiled with the options:
The CPU's time results are:
Case B without !$OMP SIMD
Ifort 17.469 s
Gfortran 10.727 s
Note: For Ifort it was stated that PERMUTED vectorization was successfully done meanwhile for Gfortran it wasn't.
Now, I am a bit confused particularly for these points:
Do I have the wrong interpretation of "vectorizing way"? I mean (regardless how fast or slow Gfortran is and I am not comparing the speed between Intel and Gfortran) that my interpretation of vectorization actually is explained better by Gfortran's results, where direct indexing would be always faster than the indirect one.
Could you please explain me? Any helps will be highly appreciated. Thanks in advance.
What's your targeting platform?
Performance result for Case A result looks as expected. Vectorized version is inefficient for emulated gather/scatter. You can find the estimated speedup from opt-report. Unless you use -xCORE-AVX512 or -xMIC-AVX512, you will get an estimated speedup more than 2x with hardware support on gather/scatter.
The result for Case B does look unexpected. How did you compare the speed in seconds? Actually for this small loop counts 2000, it runs within millisecond in my test.
@Chen: Do you mean that targeting platform is my machine? I am working on DELL PC with Processor Intel® Core™ i7-4790 CPU @ 3.60GHz × 8.
Yes indeed, it was simulated in milliseconds, that's why I put looping for K=1 until 10000000 as written below (sorry I forgot to tell you before)
DO K=1, 10000000 !$OMP SIMD DO I=1, 2000 A(E(I))= B(E(I))*C(E(I))
AA(I) = B(E(I))*C(E(I))
BEND DO !$OMP END SIMD END DO
Yes, you're right that the case A looks like what I've expected, but case B doesn't. Could you please explain me, why this could happen?
PS: Actually all codes I've posted above were only example of my true codes, which most parts are with the indirect indexing. In my true codes, I also compiled with Ifort and Gfortran on the machine I wrote above, and I was surprised since I found that Ifort gives the results 7 times faster than Gfortran (for single core computation). I've seen part per part of the optimization report of Ifort, where it was stated that all parts which look like Case B above are successfully auto-vectorized by Ifort meanwhile no parts could be auto-vectorized by Gfortran, even when using !$OMP SIMD. That's why I just think simply that for Case B, Ifort will always give the faster results than Case A, even than Gfortran. But when I run this simple test Case B, that is totally different.
Hopefully I could know the answer. Thanks in advance.
I can reproduce the issue with your updated code for Case B.
Your machine supports AVX2, using -fast compiler will generate code targeting the latest instruction sets on your machine for AVX2.
It's interesting if I don't use -fast, I will get a even better performance for this specific case.
I checked the generated assembly and opt report for both Case A and Case B. It looks like for this small case, scalar loop works better than the vectorized version.
The major difference for Case B with/without SIMD is that compiler generates hardware vgather instructions, which are not quite effective for this case. Case A with SIMD used emulated gather instead of vgather which is still better.
I guess it might be due to value of index E(I) is continuous instead of distinct and vgather is not effective for such case. I need double check on this though. How about your true code, are those indirect index actually continuous or random?
It's mysterious (to me) how a compiler decides between "simulated" (or "emulated") gather and AVX2 vgather.
In my own examples, it appears that AVX2 vgather may be slower than simulated gather when there is insufficient unrolling. I think it may not be surprising if simulated gather doesn't benefit from unrolling. If you suspect vgather, you might compare AVX vs. AVX2.
OP does appear to be interested in performance (not only to get a "vectorized" diagnosis from the compiler), which was one of my questions at the beginning of this thread. So it would be important to consider the full group of compile options, which OP doesn't agree to divulge.
For ifort, I would try -xHost -O3 -unroll4 -align array32byte (although the alignment is possibly counter-productive for scatter store, and not available for gfortran).
For gfortran, -march=native -O3 -funroll-loops --param max-unroll-times=4 -ffast-math (or replace -O3 by -O2 -ftree-vectorize).
The "fast" options are there mainly to work under the case where there is an arbitrary limit on the number of command line options.
Failing to request similar optimizations on command line is one of the hidden methods used to boost performance of one compiler over the other. If there is complex arithmetic, I prefer to set or unset the limited range options explicitly, but others prefer to set them the opposite ways when comparing 2 compilers.
@Chen:Thanks for the answer. "I guess it might be due to value of index E(I) is continuous instead of distinct ...", so continuous here means that E(I=1,2,3,...) = 1,2,3..., doesn't it? Actually I didn't know before that there will be significant effects for the continuous or non-continuous values of arrays with indirect indexing, even I thought before that the continuous case will be faster than non-continuous one, that's why I just tried and wrote the sample code above. My true code has many indirect indexing but ALL values are non-continuous. That's why I wrote them as the indirect indexing, if I could, I wouldn't write indirect indexing at all. For example:
!Most of parts of my true code look like this: !For looping I=1,2,... A(E(1)) = C(1)*D(1)*F(E(1)) A(E(2)) = C(2)*D(2)*F(E(2)) ... !Here for example the value of E(1) and E(2) will point to 3 and 7, which means !the computations above will save the values for A(3) and A(7)
As my major is not a computer science, actually I don't understand too much about the syntax itself. My first aim is only to solve a real problem with numerical method as fast as possible. That's why I prefer to choose flag -fast, since it performs 2-3 times faster on my true code than using flag -O2 or -O3. So, since you were saying that you get the better performance for this example when you don't use flag -fast, now I'm totally confused, I mean, I prefer to have a "convergent way", which means I would like to have faster and better performance in a linear way starting with -O2, -O3, -fast for all cases, but I know know that I'm wrong.
@Tim. P: Thanks for the great explanation. Actually (like I wrote above about my main aim), I don't know too much about how to get the better performance with "explicit way" like you told. For sure I can read all of optimization flags provided by these both compilers and I might understand them all, but somehow it would just make me confused at the end, since there would be too many combinations I should try. That's why I prefer just using these 2 flags which are -O3 and -fast. Do you have any suggestions, which the most common or important flags should be tested explicitly (based on your experiences)? Thanks in advance.
>>As my major is not a computer science, actually I don't understand too much about the syntax itself. My first aim is only to solve a real problem with numerical method as fast as possible. That's why I prefer to choose flag -fast, since it performs 2-3 times faster on my true code than using flag -O2 or -O3.
You also must bear in mind that different numerical methods, that produce equivalent results, may have drastically different execution times. The compiler will not choose a better (performing) numerical method.
I verified this issue will be resolved in 17.0.
With "-fast -qopt-report5 -qopenmp", case B will not get auto-vectorized with vgather.
Performance is improved from 26 s to 5.47 s in my test. From optimization report in ipo_out.optrpt
LOOP BEGIN at test_ind2.f90(22,5) remark #25444: Loopnest Interchanged: ( 1 2 ) --> ( 2 1 ) remark #15541: outer loop was not auto-vectorized: consider using SIMD directive [ test_ind2.f90(22,5) ] LOOP BEGIN at test_ind2.f90(20,3) remark #15335: loop was not vectorized: vectorization possible but seems inefficient. Use vector always directive or -vec-threshold0 to override remark #15305: vectorization support: vector length 4 remark #15399: vectorization support: unroll factor set to 4 remark #15309: vectorization support: normalized vectorization overhead 0.417 remark #15475: --- begin vector cost summary --- remark #15476: scalar cost: 10 remark #15477: vector cost: 0.750 remark #15478: estimated potential speedup: 13.330 remark #15488: --- end vector cost summary --- remark #25438: unrolled without remainder by 2 LOOP END LOOP END
Your problem in case B with ifort 16.0.3 should be a compiler issue in correctly estimating the potential speedup and made a wrong decision in this case. I hope this can help resolve your confusion in choosing optimization options.
On Linux, -fast sets -ipo, -O3, -no-prec-div, -static, -fp-model fast=2, and -xHost for short. Generally it should help improve performance for better optimization. However when it does not, you may need to evaluate them separately. Loop vectorization is not always helpful in all situations considering vector cost, compiler use its heurisitic to make decision whether to vectorize it or not.
Hope this helps.