OpenCL* for CPU
Ask questions and share information on Intel® SDK for OpenCL™ Applications and OpenCL™ implementations for Intel® CPU.
Announcements
This forum covers OpenCL* for CPU only. OpenCL* for GPU questions can be asked in the GPU Compute Software forum. Intel® FPGA SDK for OpenCL™ questions can be ask in the FPGA Intel® High Level Design forum.

intermittent hang

bob_p_1
Beginner
679 Views

Hi,

I am working on an Intel Mini PC NUC6i55YK with a Sky Lake chip set and I'm getting an intermittent hang in a very complicated piece of kernel code. What is the best way to determine where the code is hanging? When it doesn't hang the output is fine. This code works well on an Nvidia card in another Ububtu 14.04.3 LTS system.

 Ubuntu 14.04.4 LTS, Intel_sdk_for_opencl_2016_6.0.0.1049_x64, opencl_runtime_16.1_x64_ubuntu_5.2.0.10002

Number of available platforms: 1
Platform names:
    [0] Intel(R) OpenCL [Selected]
Number of devices available for each type:
    CL_DEVICE_TYPE_CPU: 1
    CL_DEVICE_TYPE_GPU: 1
    CL_DEVICE_TYPE_ACCELERATOR: 0
 
*** Detailed information for each device ***
 
CL_DEVICE_TYPE_CPU[0]
    CL_DEVICE_NAME: Intel(R) Core(TM) i5-6260U CPU @ 1.80GHz
    CL_DEVICE_AVAILABLE: 1
    CL_DEVICE_VENDOR: Intel(R) Corporation
    CL_DEVICE_PROFILE: FULL_PROFILE
    CL_DEVICE_VERSION: OpenCL 2.0 (Build 10264)
    CL_DRIVER_VERSION: 1.2.0.10264
    CL_DEVICE_OPENCL_C_VERSION: OpenCL C 2.0 
    CL_DEVICE_MAX_COMPUTE_UNITS: 4
    CL_DEVICE_MAX_CLOCK_FREQUENCY: 1800
    CL_DEVICE_MAX_WORK_GROUP_SIZE: 8192
    CL_DEVICE_ADDRESS_BITS: 64
    CL_DEVICE_MEM_BASE_ADDR_ALIGN: 1024
    CL_DEVICE_MAX_MEM_ALLOC_SIZE: 4177714176
    CL_DEVICE_GLOBAL_MEM_SIZE: 16710856704
    CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE: 131072
    CL_DEVICE_GLOBAL_MEM_CACHE_SIZE: 262144
    CL_DEVICE_GLOBAL_MEM_CACHELINE_SIZE: 64
    CL_DEVICE_LOCAL_MEM_SIZE: 32768
    CL_DEVICE_PROFILING_TIMER_RESOLUTION: 1
    CL_DEVICE_IMAGE_SUPPORT: 1
    CL_DEVICE_ERROR_CORRECTION_SUPPORT: 0
    CL_DEVICE_HOST_UNIFIED_MEMORY: 1
    CL_DEVICE_EXTENSIONS: cl_khr_icd cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_depth_images cl_khr_3d_image_writes cl_intel_exec_by_local_thread cl_khr_spir cl_khr_fp64 cl_khr_image2d_from_buffer 
    CL_DEVICE_PREFERRED_VECTOR_WIDTH_INT: 1
    CL_DEVICE_PREFERRED_VECTOR_WIDTH_LONG: 1
    CL_DEVICE_PREFERRED_VECTOR_WIDTH_FLOAT: 1
    CL_DEVICE_PREFERRED_VECTOR_WIDTH_DOUBLE: 1
    CL_DEVICE_NATIVE_VECTOR_WIDTH_INT: 8
    CL_DEVICE_NATIVE_VECTOR_WIDTH_LONG: 4
    CL_DEVICE_NATIVE_VECTOR_WIDTH_FLOAT: 8
    CL_DEVICE_NATIVE_VECTOR_WIDTH_DOUBLE: 4
 
CL_DEVICE_TYPE_GPU[0]
    CL_DEVICE_NAME: Intel(R) HD Graphics
    CL_DEVICE_AVAILABLE: 1
    CL_DEVICE_VENDOR: Intel(R) Corporation
    CL_DEVICE_PROFILE: FULL_PROFILE
    CL_DEVICE_VERSION: OpenCL 1.2 
    CL_DRIVER_VERSION: r2.0.54425
    CL_DEVICE_OPENCL_C_VERSION: OpenCL C 1.2 
    CL_DEVICE_MAX_COMPUTE_UNITS: 48
    CL_DEVICE_MAX_CLOCK_FREQUENCY: 950
    CL_DEVICE_MAX_WORK_GROUP_SIZE: 256
    CL_DEVICE_ADDRESS_BITS: 64
    CL_DEVICE_MEM_BASE_ADDR_ALIGN: 1024
    CL_DEVICE_MAX_MEM_ALLOC_SIZE: 1713792614
    CL_DEVICE_GLOBAL_MEM_SIZE: 3427585229
    CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE: 1713792614
    CL_DEVICE_GLOBAL_MEM_CACHE_SIZE: 1048576
    CL_DEVICE_GLOBAL_MEM_CACHELINE_SIZE: 64
    CL_DEVICE_LOCAL_MEM_SIZE: 65536
    CL_DEVICE_PROFILING_TIMER_RESOLUTION: 83
    CL_DEVICE_IMAGE_SUPPORT: 1
    CL_DEVICE_ERROR_CORRECTION_SUPPORT: 0
    CL_DEVICE_HOST_UNIFIED_MEMORY: 1
    CL_DEVICE_EXTENSIONS: cl_intel_accelerator cl_intel_advanced_motion_estimation cl_intel_motion_estimation cl_intel_packed_yuv cl_intel_required_subgroup_size cl_intel_subgroups cl_intel_va_api_media_sharing cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_depth_images cl_khr_fp16 cl_khr_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_icd cl_khr_image2d_from_buffer cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_spir 
    CL_DEVICE_PREFERRED_VECTOR_WIDTH_INT: 4
    CL_DEVICE_PREFERRED_VECTOR_WIDTH_LONG: 1
    CL_DEVICE_PREFERRED_VECTOR_WIDTH_FLOAT: 1
    CL_DEVICE_PREFERRED_VECTOR_WIDTH_DOUBLE: 1
    CL_DEVICE_NATIVE_VECTOR_WIDTH_INT: 4
    CL_DEVICE_NATIVE_VECTOR_WIDTH_LONG: 1
    CL_DEVICE_NATIVE_VECTOR_WIDTH_FLOAT: 1
    CL_DEVICE_NATIVE_VECTOR_WIDTH_DOUBLE: 1

Thanks in advance

Bob

 

0 Kudos
2 Replies
allanmac1
Beginner
679 Views

Bob,

I don't know if this is your problem but I just ported a bunch of optimized CUDA kernels to OpenCL and the only tricky issue that I found was that the CUDA __syncthreads() primitive has a more relaxed behavior than an OpenCL barrier(LOCAL) on Intel.

The unexplained hanging in my kernels was fixed once I made sure that all local work items in the work group participated in the OpenCL barrier(LOCAL)/work_group_barrier(LOCAL).  Once I understood the problem, I was able to improve the kernel structure even more by using OpenCL 2.0 non-uniform work groups.

The issue appears to be that the relatively common CUDA idiom of having some of a block's threads/warps exiting a kernel early while calling __syncthreads() later on the remaining threads is a CUDA'ism and is in conflict with the OpenCL spec and Intel implementation.
 
I left a post here and noted that OpenCL 2.0's "non uniform work groups" capability is a possible workaround.
0 Kudos
bob_p_1
Beginner
679 Views

Thanks for the reply. I did see your post earlier and unfortunately this is not the issue. My code is also a port from CUDA however the hangs are in areas where  clFinish() is being called on the work queue to wait for all threads to complete before moving on to the next process. It also appears to happen much more frequently with higher levels of data to process, multiple workgroups.

Thanx,

Bob

0 Kudos
Reply