How to know the number of compute units used when using a CPU as an OpenCL device

LSolis · ‎10-05-2016

I am running a program using Intel OpenCL 1.2. My OpenCL device is a CPU:

[lvs@eredmithrim CapsBasic]$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                4
On-line CPU(s) list:   0-3
Thread(s) per core:    1
Core(s) per socket:    4
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 94
Model name:            Intel(R) Core(TM) i5-6600K CPU @ 3.50GHz
Stepping:              3
CPU MHz:               3501.000
BogoMIPS:              7007.99
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              6144K
NUMA node0 CPU(s):     0-3

And regarding the OpenCL runtime available:

[lvs@eredmithrim CapsBasic]$ ./CapsBasic 
Number of available platforms: 1
Platform names:
    [0] Intel(R) OpenCL [Selected]
Number of devices available for each type:
    CL_DEVICE_TYPE_CPU: 1
    CL_DEVICE_TYPE_GPU: 0
    CL_DEVICE_TYPE_ACCELERATOR: 0

*** Detailed information for each device ***

CL_DEVICE_TYPE_CPU[0]
    CL_DEVICE_NAME: Intel(R) Core(TM) i5-6600K CPU @ 3.50GHz
    CL_DEVICE_AVAILABLE: 1
    CL_DEVICE_VENDOR: Intel(R) Corporation
    CL_DEVICE_PROFILE: FULL_PROFILE
    CL_DEVICE_VERSION: OpenCL 1.2 (Build 57)
    CL_DRIVER_VERSION: 1.2.0.57
    CL_DEVICE_OPENCL_C_VERSION: OpenCL C 1.2 
    CL_DEVICE_MAX_COMPUTE_UNITS: 4
    CL_DEVICE_MAX_CLOCK_FREQUENCY: 3500
    CL_DEVICE_MAX_WORK_GROUP_SIZE: 8192
    CL_DEVICE_ADDRESS_BITS: 64
    CL_DEVICE_MEM_BASE_ADDR_ALIGN: 1024
    CL_DEVICE_MAX_MEM_ALLOC_SIZE: 4125402112
    CL_DEVICE_GLOBAL_MEM_SIZE: 16501608448
    CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE: 131072
    CL_DEVICE_GLOBAL_MEM_CACHE_SIZE: 262144
    CL_DEVICE_GLOBAL_MEM_CACHELINE_SIZE: 64
    CL_DEVICE_LOCAL_MEM_SIZE: 32768
    CL_DEVICE_PROFILING_TIMER_RESOLUTION: 1
    CL_DEVICE_IMAGE_SUPPORT: 1
    CL_DEVICE_ERROR_CORRECTION_SUPPORT: 0
    CL_DEVICE_HOST_UNIFIED_MEMORY: 1
    CL_DEVICE_EXTENSIONS: cl_khr_icd cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_depth_images cl_khr_3d_image_writes cl_intel_exec_by_local_thread cl_khr_spir cl_khr_fp64 
    CL_DEVICE_PREFERRED_VECTOR_WIDTH_INT: 1
    CL_DEVICE_PREFERRED_VECTOR_WIDTH_LONG: 1
    CL_DEVICE_PREFERRED_VECTOR_WIDTH_FLOAT: 1
    CL_DEVICE_PREFERRED_VECTOR_WIDTH_DOUBLE: 1
    CL_DEVICE_NATIVE_VECTOR_WIDTH_INT: 8
    CL_DEVICE_NATIVE_VECTOR_WIDTH_LONG: 4
    CL_DEVICE_NATIVE_VECTOR_WIDTH_FLOAT: 8
    CL_DEVICE_NATIVE_VECTOR_WIDTH_DOUBLE: 4
[lvs@eredmithrim CapsBasic]$

My application has four kernels and each of them has several workgroups.

I would like to know how many compute units this program is actually using (The only info I can see above is the max number of them, but I think CL_DEVICE_MAX_COMPUTE_UNITS is just a reference and the actual number of compute units used may be different).

I wonder if there is a way to control the number of compute units or if this is a runtime-based decision. Any comments on this?

Any info or pointers are appreciated.

Leonardo

Jeffrey_M_Intel1 · ‎10-09-2016

The CPU implementation is written on top of Threading Building Blocks (TBB). By default it will use the number of physical cores in your machine -- 4 in your case.

You can control this behavior to use fewer cores with device fission.

https://software.intel.com/en-us/articles/opencl-device-fission-for-cpu-performance

Tamer_Assad · ‎10-13-2016

Hi Leonardo,

You can query your CL device using " clGetDeviceInfo()"

multiple calls to " clGetDeviceInfo()" passing different values for the "cl_device_info" parameter, can provide you all the info you need, lookup:

CL_DEVICE_MAX_WORK_GROUP_SIZE

CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS

CL_DEVICE_MAX_WORK_ITEM_SIZES

For a specific kernel you are setting up, you can use "clGetKernelWorkGroupInfo()", depending on your query, the following are valid values for the "cl_kernel_work_group_info" parameter:

CL_KERNEL_WORK_GROUP_SIZE

CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE

You can control and decide the target workgroup size, within the boundaries of device capabilities as informed by previous queries, upon kernel execution "clEnqueueNDRangeKernel()".

Best regards,

Tamer Assad