- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I cannot verify the number of compute units (24 compute units as reported via 'CL_DEVICE_COMPUTE_UNITS') of my GPU device . Test results are as follows. What's wrong?
Case 1:
-- local work size (1, 1, 1)
-- global work size (1, 1, 1)
-- duration 107.375ms (difference between 'CL_PROFILING_COMMAND_START' and 'CL_PROFILING_COMMAND_END')
Case 2
-- local work size (1, 1, 1)
-- global work size (1, 1, 12)
-- duration 109.577ms
Case 3
-- local work size (1, 1, 1)
-- global work size (1, 1, 13)
-- duration 212.974ms
1>OpenCL Intel(R) Graphics device was found!
1>Device name: Intel(R) HD Graphics 520
1>Device version: OpenCL 2.0
1>Device vendor: Intel(R) Corporation
1>Device profile: FULL_PROFILE
1>fcl build 1 succeeded.
1>bcl build succeeded.
1>
1>CNN_MNIST_Infer info:
1> Maximum work-group size: 256
1> Compiler work-group size: (1, 1, 1)
1> Local memory size: 12064
1> Preferred multiple of work-group size: 8
1> Minimum amount of private memory: 288
1> Amount of spill memory used by the kernel: 0
1>
1>Build succeeded!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
1> Local memory size: 12064Here's what's happening. Page numbers refer to the Programmers Reference Manual: https://01.org/sites/default/files/documentation/intel-gfx-prm-osrc-skl-vol04-configurations.pdf - Your kernel requires 12064 bytes of shared local memory, but due to shared local memory allocation granularities this is actually allocated as 16KB of shared local memory. - Your 24 EUs (AKA "compute units") are organized into three subslices of eight EUs (page 2). - Shared local memory is a subslice resource, with a maximum of 64KB per subslice (total of 3 x 64 = 192KB) (page 7). - Because your work group requires 16KB of shared local memory per thread, you can run no more than four work groups per subslice before running out of shared local memory. Said another way, even though there are still EU thread slots available, your GPU occupancy is limited by shared local memory requirements. Three subslices times 4 work groups per subslice gives you a total of 12 work groups executing concurrently. If you can reduce your kernel shared local memory requirements then you'll be able to run more work groups concurrently, up to a device maximum of 24.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
1> Local memory size: 12064Here's what's happening. Page numbers refer to the Programmers Reference Manual: https://01.org/sites/default/files/documentation/intel-gfx-prm-osrc-skl-vol04-configurations.pdf - Your kernel requires 12064 bytes of shared local memory, but due to shared local memory allocation granularities this is actually allocated as 16KB of shared local memory. - Your 24 EUs (AKA "compute units") are organized into three subslices of eight EUs (page 2). - Shared local memory is a subslice resource, with a maximum of 64KB per subslice (total of 3 x 64 = 192KB) (page 7). - Because your work group requires 16KB of shared local memory per thread, you can run no more than four work groups per subslice before running out of shared local memory. Said another way, even though there are still EU thread slots available, your GPU occupancy is limited by shared local memory requirements. Three subslices times 4 work groups per subslice gives you a total of 12 work groups executing concurrently. If you can reduce your kernel shared local memory requirements then you'll be able to run more work groups concurrently, up to a device maximum of 24.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page