I'm attempting to understand how to optimize for cache performance but am being side-tracked by a puzzling problem while using the roofline analysis in Advisor. The attached code is run twice, first with the pragma active and the second time with it commented out. The code where the pragma is commented out is nearly 6x faster than the code where the pragma is uncommented. The intel Advisor shows a 60X speed up with the no-pragma result and only 10X with it enabled.
First of all, how can the speedup be greater than 8? My machine is a Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz with 4 physical cores (hyperthreading is disabled) and an L3 cache of 8388608 bytes. It's capable of AVX2 instructions, which should give a vector speedup of 8. Both results exceed my expectation for a single threaded job.
Second of all, why does the pragma slow things down? For the pragma test, I can see in the roofline analysis generated by Advisor that the memory speed is approaching the L3 bandwidth limit. I don't expect this, as the array is far larger than the L3 cache. The really puzzling thing is that the roofline analysis for the no-pragma case shows memory bandwidth far below the DRAM bandwidth line and yet it's speed is 6X greater. Is Advisor telling me lies?
Sorry if any of this is unclear or if I'm making some stupid assumptions somewhere. I think the Advisor has the makings of a fantastic tool but I'm puzzled by these results. I'm happy to clarify anything that might be pointed out to me.
Thanks in advance,
I forgot to mention that my compile line is a follows:
icc -O2 -xCORE_AVX2 -g -fargument-noalias -qopt-subscript-in-range -ansi-alias -par-affinity=scatter -qopenmp cachetest2.c -o cachetest2
There are some openmp flags in there simply because my real problem is threaded and I'm experimenting with loop tiling to improve cache performance. The pragma issue I've posted is a sidetrack from this primary problem.
In looking at my code, I realized that I forgot to include math.h and use -lm on the command line. In addition, I considered the possibility that the optimizer is smarter than I am and only calculated the last element of my array, as this is all I referenced after the loop. I corrected the math.h and -lm omissions and scan the entire array for the maximum value, which is printed. This didn't impact the runtime of the two tests. Still puzzled. New file attached.
1. Your code is single thread, so number of cores should not impact
2. The pragma you are using #pragma omp simd tells the compiler to vectorize the code whether it results in legal code or faster code.
This can definitely slow down the code if you use it improperly
Since your have a function call in the loop, the compiler may be serializing the call in order to vectorize... but math functions typically do have vector versions... so this may not be what is happenning
You can look at Advisor recommendations and if you need more information.
You should look at the instructions generated and also the compiler optimization reports https://software.intel.com/en-us/cpp-compiler-developer-guide-and-reference-optimization-reports-viewing.
Thanks for the question
Thanks for the quick reply. I carefully chose all constants to be a multiple of 8, so vectorization should be legal. It's easy to hand calculate the maximum value so I know it's correct.
Good comment on my use of sqrtf. I'm pretty sure it vectorizes, as I use it in other code (also, sqrtf doesn't show up on the roofline analysis, so it's not directly used). To be sure, I rewrote the loop to avoid calling sqrtf (I use j*j*j/1.e12. I also converted things to double so that I could get more exact answers (code attached). The runtime difference is even more dramatic. 28 seconds for pragma case, 1.6 seconds for no-pragma case! New code attached.
I don't think a lack of vectorization is the problem here. It's that I'm gettting more than I expect. In this latest version of my test, Advisor reports a 24.83x speed up for that loop when no pragma is used and 3.4x when it is used. I expect a maximum of 4X on doubles using AVX2. I'd love to get this kind of performance in my real code! I'm questioning Advisors reported vectorization gain, as it seems unreasonable. Something magic is happening and both Advisor and myself are being confused by it.
I've never learned to read assembly code but I may have to do so to get to the bottom of this.
FYI, I'm running these tests on Centos 7.6.
I followed up on your suggestion to look at optimization reports using the option -qopt-report. Somehow, the no-pragma case decided to switch the order of the two loops. This is the big difference.
Few quick questions:
1) Why do you need iblock? Innermost loop doesn't seem to be function of iblock and could be "optimized out" as invariant (and it might lead to numerous "surprising effects", unless I missed something). So basically, without pragma, your inner j-loop might be executed only once, because compiler was "smart" enough that there is no need to repeat the innermost loop 200 times... And with pragma it could be harder by chance to do such an aggressive optimization.
Also, could you please share comparison of performance when compiling this code with -O0 vs the version which is without the pragma.
2) Could you please share the fully expanded Top-Down view from Intel Advisor (screenshots are enough) - I basically would like to see the Advisor Top-Down for outer, inner loops and for sqrtf call, seeing in particular Time and "Type" columns values
Also I would be interested to see "Code Optimziationz Details" in Code Analytics section for i-loop and j-loop.
Thanks for the questions Zakhar. As you point, out the inner loop could be optimized away. I suspect that without the pragma in place, the compiler was free to swap the order of the two loops behind my back. The optimization report didn't mention eliminating one of the loops though. I think this order swap is where that ridiculous 60X speedup came from. The outer loop became the inner loop and I don't think advixe-gui was clever enough to keep up with the compiler. It annotated the original loop but timed the switched version! With the pragma in place, the compiler must have decided not to bother with switching loops.
Since posting the last version of that code, I modified it so that the block elements are summed at the end of the function and then I print this sum out. It should ensure that all iterations of both loops are executed. My crazy 60X speedup disappears with these changes - I suspect the loop switch optimization isn't possible with this change. I've reposted the current copy below. Note that I added the requirement to code a command line argument, so that I could increase size of allocated memory by the coded multiple.
Given that the problem disappears with this lastest version, I suspect that you aren't interested in all the displays you've requested. I'll wait to hear from you on this. My gut feeling is that this issue is explained, although it might be nice for others not to have a 60X speed up reported, if they're careless enough to stumble like I did in an attempt to increase the amount of work being performed.
Thanks for looking into this.