OpenCL* for CPU
Ask questions and share information on Intel® SDK for OpenCL™ Applications and OpenCL™ implementations for Intel® CPU.
Announcements
This forum covers OpenCL* for CPU only. OpenCL* for GPU questions can be asked in the GPU Compute Software forum. Intel® FPGA SDK for OpenCL™ questions can be ask in the FPGA Intel® High Level Design forum.
1718 Discussions

Poor performance with opencl CPU driver

Ilia_E_
Beginner
465 Views

Link to source
http://pastebin.com/FyZkMrvQ

Used Intel® software was OpenCL CPU driver opencl_runtime_15.1_x64_5.0.0.57 from https://software.intel.com/en-us/articles/opencl-drivers#lin64

Compare Beignet (GPU, id 0) vs Intel® proprietary driver (CPU, id 1) vs pocl (CPU, id 2)

user@host:~/.dev/OpenCL$ gcc perftest.c -std=c11 -O2 -lOpenCL -o perftest
user@host:~/.dev/OpenCL$ for id in 0 1 2; do time ./perftest $id; done
Succeeded to create a device group!
    Device: 0
        Name:                Intel(R) HD Graphics Haswell Ultrabook GT2 Mobile
        Vendor:                Intel
        Available:            Yes
        Compute Units:            20
        Clock Frequency:        1000 mHz
        Global Memory:            2048 mb
        Max Allocateable Memory:    1024 mb
        Local Memory:            65536 kb

Succeeded to create a compute context!
Succeeded to create a command commands!
Succeeded to create compute program!
Succeeded to create program executable!
Succeeded to create compute kernel!

real    0m25.741s
user    0m0.604s
sys    0m17.796s
Succeeded to create a device group!
    Device: 1
        Name:                Intel(R) Core(TM) i5-4200U CPU @ 1.60GHz
        Vendor:                Intel(R) Corporation
        Available:            Yes
        Compute Units:            4
        Clock Frequency:        1600 mHz
        Global Memory:            5664 mb
        Max Allocateable Memory:    1416 mb
        Local Memory:            32768 kb

Succeeded to create a compute context!
Succeeded to create a command commands!
Succeeded to create compute program!
Succeeded to create program executable!
Succeeded to create compute kernel!

real    0m50.082s
user    1m21.951s
sys    0m40.065s
Succeeded to create a device group!
    Device: 2
        Name:                pthread-Intel(R) Core(TM) i5-4200U CPU @ 1.60GHz
        Vendor:                GenuineIntel
        Available:            Yes
        Compute Units:            4
        Clock Frequency:        2600 mHz
        Global Memory:            5664 mb
        Max Allocateable Memory:    5664 mb
        Local Memory:            1643847680 kb

Succeeded to create a compute context!
Succeeded to create a command commands!
Succeeded to create compute program!
Succeeded to create program executable!
Succeeded to create compute kernel!

real    0m28.620s
user    0m49.843s
sys    0m4.252s


My clinfo output: http://pastebin.com/30jkBzzs
Looks strange - open source library pocl (http://portablecl.org) beats official Intel® software in such simple test case (don't look at "Clock Frequency" reported - when loaded it runs at 2300 MHz in both cases). If it isn't bug in my system - maybe it will be better for Intel® to support pocl (which still has a lot of problem with standards support and stability) in stead of development own driver?

0 Kudos
1 Reply
Robert_I_Intel
Employee
465 Views

Ilia,

You are measuring total execution time of the program that has a number of issues:

1. You are allocating and deallocating buffers in a loop, which is highly undesirable. Recommendation is typically to do buffer allocations outside of the loop

2. You are allocating buffers the wrong way for our platforms: you need to use CL_USE_HOST_PTR flag, create arrays with aligned_alloc with 4096 byte alignment and size your buffers in multiples of 64 bytes.

3. You shouldn't use clEnqueueReadBuffer and clEnqueueWriteBuffer: use clEnqueueMapBuffer, which should result in no copies to/from the device and almost instant execution

Please check this article on how to do performance measurements for OpenCL https://software.intel.com/en-us/articles/intel-sdk-for-opencl-applications-performance-debugging-intro and this article on how to allocate "zero-copy" buffers https://software.intel.com/en-us/articles/getting-the-most-from-opencl-12-how-to-increase-performance-by-minimizing-buffer-copies-on-intel-processor-graphics

Bottom line: you should be measuring kernel performance. What you are measing is program build time, buffer allocation/deallocation, and copying data back and forth and a little bit of kernel performance.

0 Kudos
Reply