Hi, Intel technician,
when I build my application with the 2017 Fortran compiler in case of the use of AVX2 instruction and default SSE2 on a Broadwell computer, the application built with AVX2 is not as fast as that with SSE2 instructions on the Broadwell computer. My application is OpenMP-based parallel Fortran codes that are vectorized and threaded. Could you tell me when I can use AVX2 instead of SSE2 while building my application with Intel 2017 Fortran complier? My target is always to run my application as fast as possible. Which instruction is advised to use in order to achieve the best performance of my application on the Broadwell Computer?
I look forward to hearing from you. Thanks in advance,
I assume you use -axcore-avx2 compiler option.
This is a suggestion or recommendation to the compiler. This says it MAY use up-to AVX2 if the compiler thinks it is beneficial.
Performance is complicated. going to AVX2 does not guarantee better performance. If your application is memory bandwidth bound, AVX2 will only put more pressure on the memory subsystem and slow performance. This is one scenario where AVX2 may be slower. There are others. I would recommend profiling your code with SSE2 and AVX2 under Vtune Amplifier XE to see memory bandwidth you're driving in both cases. And to see where in the code performance is slowing, and dive in to determine why.
Please can you tell us your environment and exactly what you are comparing? So exact command lines, type of application, OS (I presume this is 64 bit?), number of threads, whether hyperthreading is enabled, compiler version?
Certainly, Intel AVX2 is the best performing instruction set for most applications running on Broadwell. Different applications have different characteristics and in some circumstances, AVX2 might not run fastest, as Ron indicated. If you can locate the bottleneck, you might be able to find ways to work around it.
|Run time costs (seconds) in case of 32 cores used|
|Windows Edition:||Windows Server 2012 R2 Standard|
|Processor:||Intel®Xeon® CPU E5-2697A V4 @2.60GHz 2.59GHz (2 processors)|
|System type:||64-bit OS, X64-based processor|
Thanks. You don't give the command lines. I suppose that you compile with /Qxcore-avx2 in one case, and with /Qxsse2 or with no /Qx switch in the other case? Do you compile with /O3? (if not, you should try it). Also, aligning your data with /align:array32byte might help a little. Incidentally, you're submitting an issue seen on Windows to a forum for Linux and Mac OS, but that's OK. The OS may not make much difference for this sort of issue.
Reservoir simulation is typically very memory intensive, with performance limited by bandwidth to memory, so your application probably does fit the case described by Ron. In particular, even if your hot loops are vectorized, doing the arithmetic computations (roughly) twice as fast with AVX2 instructions compared to SSE2 is of no benefit if the time saved is all spent waiting for more data to arrive. If you can find a way to modify your application so that there is more data reuse and therefore less memory traffic, you may start to see a benefit from Intel AVX or AVX2, as well as some speed-up for both SSE and AVX2. Intel VTune Amplifier can help you look for memory bottlenecks; VTune or Intel Advisor can help you to see whether your hot loops are vectorized. The potential gains here, if you can reduce memory traffic, could be considerably greater than the small differences you are reporting, which are all less than 3%. (and may be sensitive to workload and trip counts at run-time).