Solved: fail to verify number of compute units

JWong19 · ‎03-04-2018

I cannot verify the number of compute units (24 compute units as reported via 'CL_DEVICE_COMPUTE_UNITS') of my GPU device . Test results are as follows. What's wrong?

Case 1:

-- local work size (1, 1, 1)

-- global work size (1, 1, 1)

-- duration 107.375ms (difference between 'CL_PROFILING_COMMAND_START' and 'CL_PROFILING_COMMAND_END')

Case 2

-- local work size (1, 1, 1)

-- global work size (1, 1, 12)

-- duration 109.577ms

Case 3

-- local work size (1, 1, 1)

-- global work size (1, 1, 13)

-- duration 212.974ms

1>OpenCL Intel(R) Graphics device was found!
1>Device name: Intel(R) HD Graphics 520
1>Device version: OpenCL 2.0
1>Device vendor: Intel(R) Corporation
1>Device profile: FULL_PROFILE
1>fcl build 1 succeeded.
1>bcl build succeeded.
1>
1>CNN_MNIST_Infer info:
1> Maximum work-group size: 256
1> Compiler work-group size: (1, 1, 1)
1> Local memory size: 12064
1> Preferred multiple of work-group size: 8
1> Minimum amount of private memory: 288
1> Amount of spill memory used by the kernel: 0
1>
1>Build succeeded!

Ben_A_Intel · ‎03-06-2018

To confirm: You're expecting to be able to run 24 concurrent work groups in the same time as it takes to run one work group (case 1), correct? But, you're only seeing scaling up to 12 concurrent work groups (case 2), and adding the 13th work group doubles execution time (case 3)? If so, I believe the issue is the amount of shared local memory your kernel requires:

1> Local memory size: 12064

Here's what's happening. Page numbers refer to the Programmers Reference Manual: https://01.org/sites/default/files/documentation/intel-gfx-prm-osrc-skl-vol04-configurations.pdf - Your kernel requires 12064 bytes of shared local memory, but due to shared local memory allocation granularities this is actually allocated as 16KB of shared local memory. - Your 24 EUs (AKA "compute units") are organized into three subslices of eight EUs (page 2). - Shared local memory is a subslice resource, with a maximum of 64KB per subslice (total of 3 x 64 = 192KB) (page 7). - Because your work group requires 16KB of shared local memory per thread, you can run no more than four work groups per subslice before running out of shared local memory. Said another way, even though there are still EU thread slots available, your GPU occupancy is limited by shared local memory requirements. Three subslices times 4 work groups per subslice gives you a total of 12 work groups executing concurrently. If you can reduce your kernel shared local memory requirements then you'll be able to run more work groups concurrently, up to a device maximum of 24.

View solution in original post

Ben_A_Intel · ‎03-06-2018

To confirm: You're expecting to be able to run 24 concurrent work groups in the same time as it takes to run one work group (case 1), correct? But, you're only seeing scaling up to 12 concurrent work groups (case 2), and adding the 13th work group doubles execution time (case 3)? If so, I believe the issue is the amount of shared local memory your kernel requires:

1> Local memory size: 12064

Here's what's happening. Page numbers refer to the Programmers Reference Manual: https://01.org/sites/default/files/documentation/intel-gfx-prm-osrc-skl-vol04-configurations.pdf - Your kernel requires 12064 bytes of shared local memory, but due to shared local memory allocation granularities this is actually allocated as 16KB of shared local memory. - Your 24 EUs (AKA "compute units") are organized into three subslices of eight EUs (page 2). - Shared local memory is a subslice resource, with a maximum of 64KB per subslice (total of 3 x 64 = 192KB) (page 7). - Because your work group requires 16KB of shared local memory per thread, you can run no more than four work groups per subslice before running out of shared local memory. Said another way, even though there are still EU thread slots available, your GPU occupancy is limited by shared local memory requirements. Three subslices times 4 work groups per subslice gives you a total of 12 work groups executing concurrently. If you can reduce your kernel shared local memory requirements then you'll be able to run more work groups concurrently, up to a device maximum of 24.