float4 kernel is at https:/

krishnaraj · ‎12-09-2013

I have written a benchmarking application for opencl https://github.com/krrishnarraj/clpeak . One of the tests include measuring compute capacity(gflops) of the device. When run on windows 32, it gives expected results on sandybridge as

Platform: Intel(R) OpenCL
Device: Intel(R) Core(TM) i7-3630QM CPU @ 2.40GHz
Driver version: 1.2 (Win32)

Single-precision compute (GFLOPS)
float : 25.19
float2 : 50.48
float4 : 50.37
float8 : 51.75
float16 : 51.85

Theoratical peak of this device is 76.8 gflops

But when same code runs on 64 bit, it gives a different result

Platform: Intel(R) OpenCL
Device: Intel(R) Core(TM) i7-3630QM CPU @ 2.40GHz
Driver version: 1.2 (Win64)

Single-precision compute (GFLOPS)
float : 25.15
float2 : 99.25
float4 : 172.25
float8 : 80.07
float16 : 96.42

Looks like vector code(float2, float4) has been optimized out to float or some out-of-order optimization has happend. Not sure what is happening!!

ASM output from kernel-analyzer has properly generated all fmad & fmul. Is there any optimization that is specific to x64? anything advanced?

krishnaraj · ‎12-09-2013

float4 kernel is at https://github.com/krrishnarraj/clpeak/blob/master/src/kernels/compute_sp_kernels.cl#L54

krishnaraj · ‎12-16-2013

Anyone there?

Arik_N_Intel · ‎12-18-2013

Hello Krishnaraj,

Please note that the OpenCL compiler implicitly vectorizes the kernel for you. It does that along the dimension zero of workgroup work-items. Along this vectorization process, user explicit vectors (e.g. float2, float4) are broken to scalars and then re-vectorized along the work-items space.

With the above in mind, at the vector assembly level, in the compute_sp_v1 kernel, each instruction is data dependent on the previous one. On the other kernels (using vector type), after breaking the operations to scalars, we get two separated dependency chains. This allows the compiler to schedule independent instructions nearby and benefit from the processor instruction level parallelism.

Having said that, for 64 bit mode, our compiler manages to expose this parallelism, while for 32 bit mode, the compiler doesn't. I assume that this difference is due to fewer registers in 32 bit mode. This explains the higher performance that you observed with the 64 bit mode.

The theoretical peak GFLOPS of your CPU is higher than 150.

Arik

krishnaraj · ‎12-18-2013

Thank you for the reply

Few questions:

http://download.intel.com/support/processors/corei7/sb/core_i7-3600_m.pdf says that max cpu flops of 3630QM is 76GFLOPS. Confused!

flops = 4 cores * 8 avx * 2.4 GHz * 1 mul/add per clock = 76.8 GFLOPS

parallelism in this context means pipeline right? because in float8 kernel, you already have avx instructions(ilp exploited). Only when pipeline is busy you get max throughput. Right?

Thanks again

Arik_N_Intel · ‎12-25-2013

Hello Krishnaraj,

Sorry for the delay,

Unfortunately, there is amistake in that document.

The correct peak GFLOPS calculation is: 4 cores * 2 AVX ALUs * 8 avx * 2.4 GHz * 1 mul/add per clock = 153.6 GFLOPS

The actual frequency might be higher with turbo. (But I cannot calculate it as it depends on too many factors).

Parallelism in this context is about using both of the AVX ALUs.

Arik

krishnaraj · ‎12-25-2013

Thanks Arik. I know its x'mas eve

well that explains everything. So 172 GFLOPS is the effect of turbo mode

unknown optimization on x64