OpenCL* for CPU
Ask questions and share information on Intel® SDK for OpenCL™ Applications and OpenCL™ implementations for Intel® CPU.
Announcements
This forum covers OpenCL* for CPU only. OpenCL* for GPU questions can be asked in the GPU Compute Software forum. Intel® FPGA SDK for OpenCL™ questions can be ask in the FPGA Intel® High Level Design forum.
1721 Discussions

fail to verify number of compute units

JWong19
Beginner
747 Views

I cannot verify the number of compute units (24 compute units as reported via 'CL_DEVICE_COMPUTE_UNITS') of my GPU device . Test results are as follows. What's wrong?

Case 1:


-- local work size (1, 1, 1)

-- global work size (1, 1, 1)

-- duration 107.375ms (difference between 'CL_PROFILING_COMMAND_START' and 'CL_PROFILING_COMMAND_END')

 


Case 2


-- local work size (1, 1, 1)

-- global work size (1, 1, 12)

-- duration 109.577ms


Case 3


-- local work size (1, 1, 1)


-- global work size (1, 1, 13)


-- duration 212.974ms

1>OpenCL Intel(R) Graphics device was found!
1>Device name: Intel(R) HD Graphics 520
1>Device version: OpenCL 2.0
1>Device vendor: Intel(R) Corporation
1>Device profile: FULL_PROFILE
1>fcl build 1 succeeded.
1>bcl build succeeded.
1>
1>CNN_MNIST_Infer info:
1> Maximum work-group size: 256
1> Compiler work-group size: (1, 1, 1)
1> Local memory size: 12064
1> Preferred multiple of work-group size: 8
1> Minimum amount of private memory: 288
1> Amount of spill memory used by the kernel: 0
1>
1>Build succeeded!
 

0 Kudos
1 Solution
Ben_A_Intel
Employee
747 Views
To confirm: You're expecting to be able to run 24 concurrent work groups in the same time as it takes to run one work group (case 1), correct? But, you're only seeing scaling up to 12 concurrent work groups (case 2), and adding the 13th work group doubles execution time (case 3)? If so, I believe the issue is the amount of shared local memory your kernel requires:
1> Local memory size: 12064
Here's what's happening. Page numbers refer to the Programmers Reference Manual: https://01.org/sites/default/files/documentation/intel-gfx-prm-osrc-skl-vol04-configurations.pdf - Your kernel requires 12064 bytes of shared local memory, but due to shared local memory allocation granularities this is actually allocated as 16KB of shared local memory. - Your 24 EUs (AKA "compute units") are organized into three subslices of eight EUs (page 2). - Shared local memory is a subslice resource, with a maximum of 64KB per subslice (total of 3 x 64 = 192KB) (page 7). - Because your work group requires 16KB of shared local memory per thread, you can run no more than four work groups per subslice before running out of shared local memory. Said another way, even though there are still EU thread slots available, your GPU occupancy is limited by shared local memory requirements. Three subslices times 4 work groups per subslice gives you a total of 12 work groups executing concurrently. If you can reduce your kernel shared local memory requirements then you'll be able to run more work groups concurrently, up to a device maximum of 24.

View solution in original post

0 Kudos
1 Reply
Ben_A_Intel
Employee
748 Views
To confirm: You're expecting to be able to run 24 concurrent work groups in the same time as it takes to run one work group (case 1), correct? But, you're only seeing scaling up to 12 concurrent work groups (case 2), and adding the 13th work group doubles execution time (case 3)? If so, I believe the issue is the amount of shared local memory your kernel requires:
1> Local memory size: 12064
Here's what's happening. Page numbers refer to the Programmers Reference Manual: https://01.org/sites/default/files/documentation/intel-gfx-prm-osrc-skl-vol04-configurations.pdf - Your kernel requires 12064 bytes of shared local memory, but due to shared local memory allocation granularities this is actually allocated as 16KB of shared local memory. - Your 24 EUs (AKA "compute units") are organized into three subslices of eight EUs (page 2). - Shared local memory is a subslice resource, with a maximum of 64KB per subslice (total of 3 x 64 = 192KB) (page 7). - Because your work group requires 16KB of shared local memory per thread, you can run no more than four work groups per subslice before running out of shared local memory. Said another way, even though there are still EU thread slots available, your GPU occupancy is limited by shared local memory requirements. Three subslices times 4 work groups per subslice gives you a total of 12 work groups executing concurrently. If you can reduce your kernel shared local memory requirements then you'll be able to run more work groups concurrently, up to a device maximum of 24.
0 Kudos
Reply