Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.

when is AVX2 used in building excutable files?

dingjun_chencmgl_ca
1,006 Views

Hi, Intel technician,

when I build my application with the 2017 Fortran compiler in case of the use of AVX2 instruction and default SSE2 on a Broadwell computer,  the application built with AVX2 is not as fast as that with SSE2 instructions on the Broadwell computer. My application is OpenMP-based parallel Fortran codes that are vectorized and threaded. Could you tell me when I can use AVX2 instead of SSE2 while building my application with Intel 2017 Fortran complier? My target is always to run my application as fast as possible. Which instruction is advised to use in order to achieve the best performance of my application on the Broadwell Computer?

I look forward to hearing from you. Thanks in advance,

 

Best regards,

 

Dingjun

0 Kudos
10 Replies
Ronald_G_2
Beginner
1,006 Views

I assume you use -axcore-avx2 compiler option.

This is a suggestion or recommendation to the compiler.  This says it MAY use up-to AVX2 if the compiler thinks it is beneficial. 

Performance is complicated.  going to AVX2 does not guarantee better performance.  If your application is memory bandwidth bound, AVX2 will only put more pressure on the memory subsystem and slow performance.  This is one scenario where AVX2 may be slower.  There are others.  I would recommend profiling your code with SSE2 and AVX2 under Vtune Amplifier XE to see memory bandwidth you're driving in both cases. And to see where in the code performance is slowing, and dive in to determine why.

Ron

0 Kudos
Martyn_C_Intel
Employee
1,006 Views

Hi Dingjun,

                  Please can you tell us your environment and exactly what you are comparing? So exact command lines, type of application, OS (I presume this is 64 bit?), number of threads, whether hyperthreading is enabled, compiler version?

Certainly, Intel AVX2 is the best performing instruction set for most applications running on Broadwell. Different applications have different characteristics and in some circumstances, AVX2 might not run fastest, as Ron indicated. If you can locate the bottleneck, you might be able to find ways to work around it.

 

0 Kudos
dingjun_chencmgl_ca
1,006 Views

Hi, Martyn,

Good questions. I will answer your questions soon.

Thanks,

Dingjun 

0 Kudos
dingjun_chencmgl_ca
1,006 Views
             
        Run time costs (seconds) in case of 32 cores used
        SSE2   AVX2
      Run1 2166.37   2210.81
      Run2 2169.29   2207.83
      Run3 2166.39   2213.39
      Run4 2165.35   2208.22
      Run5 2168.62   2227.15
      Average 2167.2   2213.48
             
  Windows Edition:  Windows Server 2012 R2 Standard
             
  System:           
    Processor:  Intel®Xeon® CPU E5-2697A V4 @2.60GHz 2.59GHz (2 processors)
    Installed RAM:  512GB    
    System type: 64-bit OS, X64-based processor
0 Kudos
dingjun_chencmgl_ca
1,006 Views
Run time costs (seconds) in case of 32 cores used SSE2 AVX2 Run1 2166.37 2210.81 Run2 2169.29 2207.83 Run3 2166.39 2213.39 Run4 2165.35 2208.22 Run5 2168.62 2227.15 Average 2167.2 2213.48 Windows Edition: Windows Server 2012 R2 Standard System: Processor: Intel®Xeon® CPU E5-2697A V4 @2.60GHz 2.59GHz (2 processors) Installed RAM: 512GB System type: 64-bit OS, X64-based processor Compiler: 2017 Intel Fortran Compiler 2017.1.143
0 Kudos
dingjun_chencmgl_ca
1,006 Views
Run time costs (seconds) in case of 32 cores used SSE2 AVX2 Run1 2166.37 2210.81 Run2 2169.29 2207.83 Run3 2166.39 2213.39 Run4 2165.35 2208.22 Run5 2168.62 2227.15 Average 2167.2 2213.48 Windows Edition: Windows Server 2012 R2 Standard System: Processor: Intel®Xeon® CPU E5-2697A V4 @2.60GHz 2.59GHz (2 processors) Installed RAM: 512GB System type: 64-bit OS, X64-based processor Compiler: 2017 Intel Fortran Compiler 2017.1.143
0 Kudos
dingjun_chencmgl_ca
1,006 Views

hyperthreading is disabled

the type of application: Heavy Oil reservoir simulator IMEX

0 Kudos
dingjun_chencmgl_ca
1,006 Views

hyperthreading is disabled

the type of application: Heavy Oil reservoir simulator IMEX

0 Kudos
dingjun_chencmgl_ca
1,006 Views

hyperthreading is disabled

the type of application: Heavy Oil reservoir simulator IMEX

0 Kudos
Martyn_C_Intel
Employee
1,006 Views

Thanks. You don't give the command lines. I suppose that you compile with /Qxcore-avx2 in one case, and with /Qxsse2 or with no /Qx switch in the other case? Do you compile with /O3? (if not, you should try it). Also, aligning your data with /align:array32byte might help a little. Incidentally, you're submitting an issue seen on Windows to a forum for Linux and Mac OS, but that's OK. The OS may not make much difference for this sort of issue.

Reservoir simulation is typically very memory intensive, with performance limited by bandwidth to memory, so your application probably does fit the case described by Ron. In particular, even if your hot loops are vectorized, doing the arithmetic computations (roughly) twice as fast with AVX2 instructions compared to SSE2 is of no benefit if the time saved is all spent waiting for more data to arrive. If you can find a way to modify your application so that there is more data reuse and therefore less memory traffic, you may start to see a benefit from Intel AVX or AVX2, as well as some speed-up for both SSE and AVX2. Intel VTune Amplifier can help you look for memory bottlenecks; VTune or Intel Advisor can help you to see whether your hot loops are vectorized. The potential gains here, if you can reduce memory traffic, could be considerably greater than the small differences you are reporting, which are all less than 3%.  (and may be sensitive to workload and trip counts at run-time).

0 Kudos
Reply