Solved: AVX2 and FMA3 support

MSimm2 · ‎07-07-2013

The FAQ states "Yes, Intel OpenCL* SDK 2013 introduces performance improvements that include full code generation on the Intel Advanced Vector Extensions (Intel AVX and Intel AVX2)."

I'm trying to get it to produce code that utilises the AVX2 FMA3 instructions.

I'm using the Kernel Builder (CPU - 64 bit AVX2) i.e. target set for AVX2 instruction set.

-------------
__kernel void dofma(const global float *a, const global float *b, const global float *c, global float *out)
{
uint gid= get_global_id(0);
float fa = a[gid];
float fb = b[gid];
float fc = c[gid];
fa = mad(fa,fb,fc);
out[gid] = fa;
}
------------------

Gives code that uses vmulps and vaddps but not VFMADD213 type code

using fa = fma(fa,fb,fc);
produces alot more code and a function call for the fma which results in very low performance.

MSimm2 · ‎05-21-2014

YES!

Intel opencl sdk 2014 64bit CPU runtime

FMA working.

Its generating vfmadd213ps %ymm0, %ymm1, %ymm2 instructions for both mad() and fma()

View solution in original post

Raghupathi_M_Intel · ‎07-12-2013

Thanks for the code sample. I'll take a look and get back to you. Just to clarify FMA3 is only supported in 4th Gen Intel Core Processors. What is your CPU config?

Thanks,
Raghu

MSimm2 · ‎07-12-2013

Raghu Muthyalampalli (Intel) wrote:
Thanks for the code sample. I'll take a look and get back to you. Just to clarify FMA3 is only supported in 4th Gen Intel Core Processors. What is your CPU config?

i7-4770 no K

However that shouldn't matter if the kernel builder build options are set to target AVX2 instruction set.

MSimm2 · ‎08-01-2013

The Intel SPMD Program Compiler does emit fma instructions (vfmadd213ps %ymm0, %ymm1, %ymm2)
It's an example of how the opencl asm should appear

However this isn't useful to me since I need to target both CPU's and GPU (and GPUs have more Gflops) and I don't want to maintain the code in two different apis.

e.g. with a file Test.ispc as below and the command

ispc -O2 Test.ispc -o Test.asm -h Test_ispc.h --target=avx2 --emit-asm

------------------------------------------
export void simple(uniform float a[],uniform float b[] ,uniform float c[] ,uniform float out[], uniform int count)
{
    foreach (index = 0 ... count)
   {
        float fa = a[index];
       float fb = b[index];
       float fc = c[index];
       fa = fb * fc + fa;
        out[index] = fa;
    }
}

MSimm2 · ‎12-18-2013

Still does not use AVX2 FMA instructions... Isn't this like, an obvious thing to implement!

I'm still getting

vmovups YMM1, YMMWORD PTR [R11 + 4*RDI]
vmulps YMM0, YMM1, YMM0
vmovups YMM1, YMMWORD PTR [R9 + 4*RDI]
vaddps YMM0, YMM0, YMM1
vmovups YMMWORD PTR [R8 + 4*RDI], YMM0

Where is a VFMADD213!

-------------------------------------

Using build options: -cl-unsafe-math-optimizations -cl-fast-relaxed-math -cl-mad-enable

Setting target instruction set architecture to: Advanced Vector Extension 2 (AVX2)
Intel OpenCL Intel CPU device was found!
Device name: Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz
Device version: OpenCL 1.2 (Build 78712)
Device vendor: Intel(R) Corporation
Device profile: FULL_PROFILE
Compilation started
Compilation done
Linking started
Linking done
Kernel <dofma> was successfully vectorized
Done.
Build succeeded!

MSimm2 · ‎05-21-2014

YES!

Intel opencl sdk 2014 64bit CPU runtime

FMA working.

Its generating vfmadd213ps %ymm0, %ymm1, %ymm2 instructions for both mad() and fma()