when is AVX2 used in building excutable files?

dingjun_chencmgl_ca · ‎01-09-2017

Hi, Intel technician,

when I build my application with the 2017 Fortran compiler in case of the use of AVX2 instruction and default SSE2 on a Broadwell computer, the application built with AVX2 is not as fast as that with SSE2 instructions on the Broadwell computer. My application is OpenMP-based parallel Fortran codes that are vectorized and threaded. Could you tell me when I can use AVX2 instead of SSE2 while building my application with Intel 2017 Fortran complier? My target is always to run my application as fast as possible. Which instruction is advised to use in order to achieve the best performance of my application on the Broadwell Computer?

I look forward to hearing from you. Thanks in advance,

Best regards,

Dingjun

Ronald_G_2 · ‎01-09-2017

I assume you use -axcore-avx2 compiler option.

This is a suggestion or recommendation to the compiler. This says it MAY use up-to AVX2 if the compiler thinks it is beneficial.

Performance is complicated. going to AVX2 does not guarantee better performance. If your application is memory bandwidth bound, AVX2 will only put more pressure on the memory subsystem and slow performance. This is one scenario where AVX2 may be slower. There are others. I would recommend profiling your code with SSE2 and AVX2 under Vtune Amplifier XE to see memory bandwidth you're driving in both cases. And to see where in the code performance is slowing, and dive in to determine why.

Ron

Martyn_C_Intel · ‎01-09-2017

Hi Dingjun,

Please can you tell us your environment and exactly what you are comparing? So exact command lines, type of application, OS (I presume this is 64 bit?), number of threads, whether hyperthreading is enabled, compiler version?

Certainly, Intel AVX2 is the best performing instruction set for most applications running on Broadwell. Different applications have different characteristics and in some circumstances, AVX2 might not run fastest, as Ron indicated. If you can locate the bottleneck, you might be able to find ways to work around it.

dingjun_chencmgl_ca · ‎01-09-2017

Hi, Martyn,

Good questions. I will answer your questions soon.

Thanks,

Dingjun

dingjun_chencmgl_ca · ‎01-10-2017


			Run time costs (seconds) in case of 32 cores used
			SSE2	AVX2
		Run1	2166.37	2210.81
		Run2	2169.29	2207.83
		Run3	2166.39	2213.39
		Run4	2165.35	2208.22
		Run5	2168.62	2227.15
		Average	2167.2	2213.48

Windows Edition:		Windows Server 2012 R2 Standard

System:
	Processor:		Intel®Xeon® CPU E5-2697A V4 @2.60GHz 2.59GHz (2 processors)
	Installed RAM:		512GB
	System type:		64-bit OS, X64-based processor

dingjun_chencmgl_ca · ‎01-10-2017

Run time costs (seconds) in case of 32 cores used SSE2 AVX2 Run1 2166.37 2210.81 Run2 2169.29 2207.83 Run3 2166.39 2213.39 Run4 2165.35 2208.22 Run5 2168.62 2227.15 Average 2167.2 2213.48 Windows Edition: Windows Server 2012 R2 Standard System: Processor: Intel®Xeon® CPU E5-2697A V4 @2.60GHz 2.59GHz (2 processors) Installed RAM: 512GB System type: 64-bit OS, X64-based processor Compiler: 2017 Intel Fortran Compiler 2017.1.143

dingjun_chencmgl_ca · ‎01-10-2017

Run time costs (seconds) in case of 32 cores used SSE2 AVX2 Run1 2166.37 2210.81 Run2 2169.29 2207.83 Run3 2166.39 2213.39 Run4 2165.35 2208.22 Run5 2168.62 2227.15 Average 2167.2 2213.48 Windows Edition: Windows Server 2012 R2 Standard System: Processor: Intel®Xeon® CPU E5-2697A V4 @2.60GHz 2.59GHz (2 processors) Installed RAM: 512GB System type: 64-bit OS, X64-based processor Compiler: 2017 Intel Fortran Compiler 2017.1.143

dingjun_chencmgl_ca · ‎01-10-2017

hyperthreading is disabled

the type of application: Heavy Oil reservoir simulator IMEX

dingjun_chencmgl_ca · ‎01-10-2017

hyperthreading is disabled

the type of application: Heavy Oil reservoir simulator IMEX

dingjun_chencmgl_ca · ‎01-10-2017

hyperthreading is disabled

the type of application: Heavy Oil reservoir simulator IMEX

Martyn_C_Intel · ‎01-10-2017

Thanks. You don't give the command lines. I suppose that you compile with /Qxcore-avx2 in one case, and with /Qxsse2 or with no /Qx switch in the other case? Do you compile with /O3? (if not, you should try it). Also, aligning your data with /align:array32byte might help a little. Incidentally, you're submitting an issue seen on Windows to a forum for Linux and Mac OS, but that's OK. The OS may not make much difference for this sort of issue.

Reservoir simulation is typically very memory intensive, with performance limited by bandwidth to memory, so your application probably does fit the case described by Ron. In particular, even if your hot loops are vectorized, doing the arithmetic computations (roughly) twice as fast with AVX2 instructions compared to SSE2 is of no benefit if the time saved is all spent waiting for more data to arrive. If you can find a way to modify your application so that there is more data reuse and therefore less memory traffic, you may start to see a benefit from Intel AVX or AVX2, as well as some speed-up for both SSE and AVX2. Intel VTune Amplifier can help you look for memory bottlenecks; VTune or Intel Advisor can help you to see whether your hot loops are vectorized. The potential gains here, if you can reduce memory traffic, could be considerably greater than the small differences you are reporting, which are all less than 3%. (and may be sensitive to workload and trip counts at run-time).