- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The FAQ states "Yes, Intel OpenCL* SDK 2013 introduces performance improvements that include full code generation on the Intel Advanced Vector Extensions (Intel AVX and Intel AVX2)."
I'm trying to get it to produce code that utilises the AVX2 FMA3 instructions.
I'm using the Kernel Builder (CPU - 64 bit AVX2) i.e. target set for AVX2 instruction set.
-------------
__kernel void dofma(const global float *a, const global float *b, const global float *c, global float *out)
{
uint gid= get_global_id(0);
float fa = a[gid];
float fb = b[gid];
float fc = c[gid];
fa = mad(fa,fb,fc);
out[gid] = fa;
}
------------------
Gives code that uses vmulps and vaddps but not VFMADD213 type code
using fa = fma(fa,fb,fc);
produces alot more code and a function call for the fma which results in very low performance.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
YES!
Intel opencl sdk 2014 64bit CPU runtime
FMA working.
Its generating vfmadd213ps %ymm0, %ymm1, %ymm2 instructions for both mad() and fma()
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for the code sample. I'll take a look and get back to you. Just to clarify FMA3 is only supported in 4th Gen Intel Core Processors. What is your CPU config?
Thanks,
Raghu
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Raghu Muthyalampalli (Intel) wrote:
Thanks for the code sample. I'll take a look and get back to you. Just to clarify FMA3 is only supported in 4th Gen Intel Core Processors. What is your CPU config?
i7-4770 no K
However that shouldn't matter if the kernel builder build options are set to target AVX2 instruction set.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The Intel SPMD Program Compiler does emit fma instructions (vfmadd213ps %ymm0, %ymm1, %ymm2)
It's an example of how the opencl asm should appear
However this isn't useful to me since I need to target both CPU's and GPU (and GPUs have more Gflops) and I don't want to maintain the code in two different apis.
e.g. with a file Test.ispc as below and the command
ispc -O2 Test.ispc -o Test.asm -h Test_ispc.h --target=avx2 --emit-asm
------------------------------------------
export void simple(uniform float a[],uniform float b[] ,uniform float c[] ,uniform float out[], uniform int count)
{
foreach (index = 0 ... count)
{
float fa = a[index];
float fb = b[index];
float fc = c[index];
fa = fb * fc + fa;
out[index] = fa;
}
}
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Still does not use AVX2 FMA instructions... Isn't this like, an obvious thing to implement!
I'm still getting
vmovups YMM1, YMMWORD PTR [R11 + 4*RDI]
vmulps YMM0, YMM1, YMM0
vmovups YMM1, YMMWORD PTR [R9 + 4*RDI]
vaddps YMM0, YMM0, YMM1
vmovups YMMWORD PTR [R8 + 4*RDI], YMM0
Where is a VFMADD213!
-------------------------------------
Using build options: -cl-unsafe-math-optimizations -cl-fast-relaxed-math -cl-mad-enable
Setting target instruction set architecture to: Advanced Vector Extension 2 (AVX2)
Intel OpenCL Intel CPU device was found!
Device name: Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz
Device version: OpenCL 1.2 (Build 78712)
Device vendor: Intel(R) Corporation
Device profile: FULL_PROFILE
Compilation started
Compilation done
Linking started
Linking done
Kernel <dofma> was successfully vectorized
Done.
Build succeeded!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
YES!
Intel opencl sdk 2014 64bit CPU runtime
FMA working.
Its generating vfmadd213ps %ymm0, %ymm1, %ymm2 instructions for both mad() and fma()

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page