Solved: Thank you Tim, I meant 'gfort

Sidharth_Kashyap · ‎03-25-2016

Intel Compiler looks to be behaving oddly for the loop structure below when compared to GCC.

!$OMP DO collapse(3) private(l,j,k)

DO l=1,n

DO k=1,n

DO j=1,n

a (j,k,l)=a(j,k,l)*b(j,k,l)

ENDDO

!$OMP END DO

I am using the vectorization flags (-xavx).

Are there any Best Known Methods when using the 'collapse' clause with Intel Fortran Compilers with Vectorization?

Regards,

Sid

TimP · ‎03-25-2016

It's hard to guess in what way you would expect ifort to match behavior of gcc, not knowing how you are nesting the loops for C and whether you asked gcc to vectorize. In my experience, gcc doesn't apply both vectorization and parallelization to the same level of loops, even if you ask it to do so in a case where it makes sense.

I assume you didn't set any option to ifort which would attempt to prevent vectorization or peeling for alignment, so it doesn't make sense to thread parallelize the inner loop when you have correctly set up your code to facilitate inner loop simd vectorization. It's difficult to conceive of a case where i*k is not large compared to the number of cores you would have available yet is large enough that parallelization would be desirable.

Adding simd clause to your omp would confirm that you intended vectorization of the inner loop, but omitting it doesn't imply that you don't want simd (unless you have set an option such as -fno-vec). Still, when you ask the compiler to do something which doesn't make sense and may not have been adequately tested, it's difficult to form expectations about what will happen. In spite of the ads about ifort supporting directive based vectorization, I haven't seen general usefulness of such an approach. There isn't even consensus among experts on optimum ways to split a loop in OpenMP chunks in combination with simd vectorization, but experiments have been done to assure that important customer cases get satisfactory treatment.

You may notice that Intel has revived the slogan of 25 years ago "concurrent outer vector inner" in a slightly different form "vector inner parallel outer." This is so often the best strategy for nested loops, and has been so for decades, that it certainly deserves to be called a Best Known Method. In the case of 3 nested loops, my own opinion is that you would not use collapse at all unless you know that the number of cores to be used will be large enough that parallelization of outer loop alone will create work imbalance.

View solution in original post

TimP · ‎03-25-2016

It's hard to guess in what way you would expect ifort to match behavior of gcc, not knowing how you are nesting the loops for C and whether you asked gcc to vectorize. In my experience, gcc doesn't apply both vectorization and parallelization to the same level of loops, even if you ask it to do so in a case where it makes sense.

I assume you didn't set any option to ifort which would attempt to prevent vectorization or peeling for alignment, so it doesn't make sense to thread parallelize the inner loop when you have correctly set up your code to facilitate inner loop simd vectorization. It's difficult to conceive of a case where i*k is not large compared to the number of cores you would have available yet is large enough that parallelization would be desirable.

Adding simd clause to your omp would confirm that you intended vectorization of the inner loop, but omitting it doesn't imply that you don't want simd (unless you have set an option such as -fno-vec). Still, when you ask the compiler to do something which doesn't make sense and may not have been adequately tested, it's difficult to form expectations about what will happen. In spite of the ads about ifort supporting directive based vectorization, I haven't seen general usefulness of such an approach. There isn't even consensus among experts on optimum ways to split a loop in OpenMP chunks in combination with simd vectorization, but experiments have been done to assure that important customer cases get satisfactory treatment.

You may notice that Intel has revived the slogan of 25 years ago "concurrent outer vector inner" in a slightly different form "vector inner parallel outer." This is so often the best strategy for nested loops, and has been so for decades, that it certainly deserves to be called a Best Known Method. In the case of 3 nested loops, my own opinion is that you would not use collapse at all unless you know that the number of cores to be used will be large enough that parallelization of outer loop alone will create work imbalance.

Sidharth_Kashyap · ‎03-25-2016

Thank you Tim, I meant 'gfort' when I mentioned that gcc was doing the right thing, I guess your comments holds true for both the cases.

I will experiment with -fno-vec to check correctness of the operation. In which case, I will have to drop the collapse clause as you suggest and expect that the first loop creates enough work items to be shared amongst all the threads.

OpenMP Collapse(n)