- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have written a benchmarking application for opencl https://github.com/krrishnarraj/clpeak . One of the tests include measuring compute capacity(gflops) of the device. When run on windows 32, it gives expected results on sandybridge as
Platform: Intel(R) OpenCL
Device: Intel(R) Core(TM) i7-3630QM CPU @ 2.40GHz
Driver version: 1.2 (Win32)Single-precision compute (GFLOPS)
float : 25.19
float2 : 50.48
float4 : 50.37
float8 : 51.75
float16 : 51.85
Theoratical peak of this device is 76.8 gflops
But when same code runs on 64 bit, it gives a different result
Platform: Intel(R) OpenCL
Device: Intel(R) Core(TM) i7-3630QM CPU @ 2.40GHz
Driver version: 1.2 (Win64)Single-precision compute (GFLOPS)
float : 25.15
float2 : 99.25
float4 : 172.25
float8 : 80.07
float16 : 96.42
Looks like vector code(float2, float4) has been optimized out to float or some out-of-order optimization has happend. Not sure what is happening!!
ASM output from kernel-analyzer has properly generated all fmad & fmul. Is there any optimization that is specific to x64? anything advanced?
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Anyone there?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Krishnaraj,
Please note that the OpenCL compiler implicitly vectorizes the kernel for you. It does that along the dimension zero of workgroup work-items. Along this vectorization process, user explicit vectors (e.g. float2, float4) are broken to scalars and then re-vectorized along the work-items space.
With the above in mind, at the vector assembly level, in the compute_sp_v1 kernel, each instruction is data dependent on the previous one. On the other kernels (using vector type), after breaking the operations to scalars, we get two separated dependency chains. This allows the compiler to schedule independent instructions nearby and benefit from the processor instruction level parallelism.
Having said that, for 64 bit mode, our compiler manages to expose this parallelism, while for 32 bit mode, the compiler doesn't. I assume that this difference is due to fewer registers in 32 bit mode. This explains the higher performance that you observed with the 64 bit mode.
The theoretical peak GFLOPS of your CPU is higher than 150.
Arik
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you for the reply
Few questions:
http://download.intel.com/support/processors/corei7/sb/core_i7-3600_m.pdf says that max cpu flops of 3630QM is 76GFLOPS. Confused!
flops = 4 cores * 8 avx * 2.4 GHz * 1 mul/add per clock = 76.8 GFLOPS
parallelism in this context means pipeline right? because in float8 kernel, you already have avx instructions(ilp exploited). Only when pipeline is busy you get max throughput. Right?
Thanks again
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Krishnaraj,
Sorry for the delay,
Unfortunately, there is amistake in that document.
The correct peak GFLOPS calculation is: 4 cores * 2 AVX ALUs * 8 avx * 2.4 GHz * 1 mul/add per clock = 153.6 GFLOPS
The actual frequency might be higher with turbo. (But I cannot calculate it as it depends on too many factors).
Parallelism in this context is about using both of the AVX ALUs.
Arik
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks Arik. I know its x'mas eve
well that explains everything. So 172 GFLOPS is the effect of turbo mode
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page