I am experiencing some strange behavior when I use the OMP SIMD directive in a couple of loops with inlined subroutines and I am getting more and more convinced that this might be a bug in the compiler. If it is not, I would be really thankful if someone could point me to the error in my code.
The bug happens in four different pieces of the complex CFL code I am currently working, but I have been able to reproduce it on a smaller code (main.f90 attached). Weirdly, this example seems to work fine on Intel 2017, but not on 2016 nor 2018. The actual code I am working on presents errors on all three compiler versions.
The code is basically a loop to solve a list of Riemann problems using the given subroutines. In this example the problem is always the same, so it is easy to check if the solution is correct. I suggest the following tests, so you can check it for yourselves (use ifort 2016 or 2018):
1) Compile with: ifort -O2 -fopenmp -cpp main.f90
- This does not vectorize the main loop and gives correct results for all problems: "-427.4213 -106.8553 213.7107".
2) Compile with: ifort -O2 -fopenmp -cpp -D_VEC -D_INLINE main.f90
- This does vectorize the main loop, inlines the subroutine into it, but gives WRONG results (0.0000 0.0000 0.0000)
3) Remove either vectorization (-D_VEC) or inlining (-D_INLINE) and you will notice that the code gives correct results again.
Actually, I've already found a workaround for this bug, but it is quite ugly. You can check that in main_scalars.f90. The workaround consists in replacing the arrays that are passed to the inlined subroutine with scalars, such that all parameters are scalars. For the example code that means sw(3) --> s1,s2,s3 and fw(3,3) --> fw11,fw12,fw13,fw21,fw22,fw23,fw31,fw32,fw33. It is awful, but the code works perfectly, with vectorization, and inlining. Unfortunately, other pieces of my code contain 6x6 matrices, and using this "solution" would be even more awful for those pieces!
I have also tried using vector functions instead of inlining, but the performance is a lot higher using the inlined code (with the workaround above). In my experiments, the improvement in the loop performance is about 40% on Haswells and 95% on KNLs. Thus, I would really like to have them inlined into the loop.
Any help on this issue would be highly appreciated. Thanks in advance,
While the compiler diagnostic may state the inlined function vectorizes, the data layout of s and fwave do not lend themselves to vectorization without scatter-gather. I suggest you rewrite the index order:
double precision s(3,num_problems), fwave(3,3,num_problems)
Then recode references to s and fwave accordingly (change index order).
First of all, thank you for your answer.
But could you please double check if your suggestion is correct? Because Fortran uses column-major order, I assumed that s(num_problems,3) and f(num_problems,3,3) would be the right order for vectorization. E.g., in that way, all s(: ,1) values will be stored contiguously, allowing vectorization without gather-scatter. Please let me know if I misunderstood that.
The compiler reports seem to confirm this. This can be found in the report for the code I've posted here:
remark #15301: OpenMP SIMD LOOP WAS VECTORIZED
remark #15448: unmasked aligned unit stride loads: 10
remark #15449: unmasked aligned unit stride stores: 3
And this can be found in the report after I've followed your suggestion and changed the array order:
remark #15301: OpenMP SIMD LOOP WAS VECTORIZED
remark #15448: unmasked aligned unit stride loads: 10
remark #15453: unmasked strided stores: 3
Also, reordering the arrays did not fix the bug - I still get incorrect results.
Sorry for the delay in getting back to you.
My normal access to this forum is by way of MS Internet Explorer 11 on a Windows 7 x64 system. Lately I have been unable to log in to the IDZ site. I get those goofy internet failure, ransomware and other cutezy graphics (alian abduction).
Today, on a whim, after failure I went to the site using Mozilla Firefox. I am glad to say I can now log in. ergo, the login failures is due to a recent compatibility issue introduced that affects MS IE 11 logins.
On to your post.
My original comment addressed the outer most OMP SIMD statement .NOT. for the (potentially) inlined subroutine, but rather for the loop that followed the call.
In Fortran Array(I,J,K) and Array(I+1,J,K) are adjacent elements in memory. Therefore, for optimization purposes, it is desireable to have the innermost loop index be the leftmost index of the array.
Due to the fact that the loop following the first call level is negligible (compute wise) as compared to the called level code (inlined or not), my suggestion would have been of little value (performance wise).
This said, placing OMP SIMD on the outer most loop is not practical for your program. The only practical placements would be on loops that do not contain calls to code of any complexity. e.g. subroutines that are not declared (and capable of) as operating on vector arguments (those with ELEMENTAL attributes).
Note, elemental routines (generally) cannot contain other than trivial IF-ENDIF, IF-THEN-ENDIF or IF-THEN-ELSE-ENDIF. IOW thos that can be implemented with a SIMD vector masked instruction.
I think this may be a case of "SIMD is good" ergo declare everything SIMD.
I suggest you remove all SIMD directives and use VTune to sample a truely representative problem. Look at the hot spots, and then determine if and where SIMD vectorization can be improved through use of inlining and/or vectorization.
riemanntype does not appear suitable for ELEMENTAL thus solve_rpn2 (via INLINED is not suitable for OMP SIMD)
rienman_aug_JCP suffers the same problem and cannot be (succeffully) declared OMP SIMD
BTW, I notice that your convergence routines have maxiter=1. If this is for testing purposes, then I suggest using the correct value whey you VTune otherwise the profile data will not be representative of the application under actual use conditions.
while I appreciate your tips, I think you are missing the point here. Please think of this thread as a potential bug report, not as a request for help on optimizing the code. I apologize for not making that clearer before. Maybe I should have used a different channel? Should I have submitted a ticket instead?
I disagree when you say that SIMD is not suitable for the outter loop in my example. Indeed, my experiments with that loop show speedups of 1.11x on Haswells and 1.98x on KNLs, when I use OMP SIMD and DECLARE SIMD in that code. When I use the ugly workaround I've shown in main_scalars.f90 (replacing PRIVATE arrays with lots of scalars), these speedups improve to 1.57x and 3.83x, respectively. Even if these are far form the 4x and 8x that would have been ideal for these machines, it is clear that vectorization still pays off for this loop, especially when the subroutine is inlined into it.
My point is: while vectorization (especially with SIMD directives) may deliver wrong results, I don't think that inlining a routine should change the computed results (please let me know if I am mistaken here). And that is exactly what is happening here. Could you think of a good reason why inlining that particular subroutine into that particular SIMD loop should deliver different results than when not inlining it?
With my code main_scalars.f90, I've shown that this routine can indeed be inlined and the loop vectorized with correct results and a quite good increase in performance. "All" I had to do was get rid of the PRIVATE arrays, as I explained before. Now, why this one works and the original one does not, is something I can't explain. That is the reason I am starting to believe that there might be a bug in the compiler optimizations.
Wow, that is an enormous amount of inlining. And not all calls in the loop are inlined (e.g. riemanntype).
Whilst inlining of functions should not of itself change results, when you use OMP SIMD, you are asserting that there are no dependencies other than the ones you declare. It's a bit hard to check all that inlined code - at the least, there's a large number of local scalars and arrays that the compiler needs to privatize.
When the functions are large, SIMD functions may be a good alternative. They will be more efficient, on platforms that support Intel AVX and higher, if compiled with -vecabi=cmdtarget (otherwise, the ABI requires use of SSE registers, not AVX ones).
Hi Martyn, thank you for looking into this. Some considerations:
- I tried your suggestion of compiling with -vecabi=cmdtarget and that improved the performance of my code with a SIMD function (instead of inlining) by 4% on the Haswell and by 18% on the KNL. Still, the code with the inlined function and the workaround I've posted here is still a lot faster in both architectures.
- When simplifying the code to post here, I wanted to get rid of those variables (mu and nv) to make it simpler, but I missed that spot. Sorry for that. That is not a problem in the original code, because there they are properly initialized.
- There is no dependence between different iterations of the loop. The way it is organized, with PRIVATE variables and a function wrapping almost all computations makes it easy to notice that. Also, if that was not the case, the code where I avoided using PRIVATE arrays would probably not work correctly, since it is a straightforward adaptation of the original code.
I would also like to add that I've tried to "manually inline" all function calls into the loop (i.e., such that there is no call to other functions there) and declare all additional variables as PRIVATE. However, that led to the same errors. Thus, the problem is not with the actual inlining, but seems to be with the way the compiler deals with private arrays. But that is not always the case, as I've tried other simple SIMD loops with private arrays and they worked fine.
Thank you for reporting this here. If you would be so kind and submit a reproducible test case and summary of the issue described in Online Service Center via https://supporttickets.intel.com/?lang=en-US This will be investigated further and deeper.