I'm wondering how the private_memory of Intel HD graphics works, I read from the "Intel® SDK for OpenCL* Applications 2013 - Optimization Guide for Windows* OS" this recommandation about the private memory :
Since each work-item has its own __private memory, there is no locality for __private memory accesses, and each work-item frequently accesses a unique cache line for every access to __private memory. For this reason, accesses to __private memory are very slow, and you should avoid indexed private memory if possible.
But I don't really understand why I have to avoid the indexed private memory ? can any one tell me more about this or just explain this recommendation ?
Since __private memory is indeed "private" (not shared within workitems), each workitem might consume quite some mem space. And you have many work-items in the fly: considering numbers of execution units of the GPU (also each unit hosting multiple threads), SIMDification in the threads, etc. Up to almost 9000 of work-items potentially in the flight (refer to the preso for the details). So say you request just 32 floats of the private mem in the kernel, this would total up to ~1MB. This wouldn't fit even L1, not mentioning the GPU register space. So in a worst case the perfomance of the __private mem will be similar to the perfomance of the __global.
I believe that using __private memory hurts performance for the reason that you explained, but what I'm don't really understand is that with a specified computation like a histogram for example that fits in __private memory of a thread (I assume in the general register file) why this computation will run much slower (or the same) than if we put data on other memory spaces, it’s more obvious for me that registers are faster than other types of memory? Am I missing something?
Thank you very much for your answer.
a specified computation like a histogram for example that fits in __private memory of a thread
Each GPU thread doesn't consume much, but you might have hundreds of threads in the fly, so that private arrays might not fit in the registers' space. In general, the perfomance depends on the amount of private mem the kernel requires.
if I understood there is no limitation about (__private memory) if we took the right amount of data that fits in registers of all the threads on the fly. this makes sens for me.
if we restrict the number of threads on the fly, we can have more registers, and we can do more complex computations on bigger chunks of data "in registers". I think I have a bug in this conclusion but I don't know where? :)
Mohamed Amine BERGACH wrote:
in this preso (slide number 42), __private memory are allocated in global Memory, in which case this can happen?
As we speculated in this thread, this eventually would happen if the requested private mem (remeber that each work-item requires it's own copy) doesn't fit the register space
Ok, if the size of data used by each work item fits in the register space, is this suffice to give me the garanty that all data will be processed in registers ? if yes, in this case indexed __private memory will be still not recommended ?
I'm asking this question because when I use clGetKernelWorkGroupInfo (.....CL_KERNEL_PRIVATE_MEM_SIZE...) I get all the time 0, but when I use more than 256 Byte I get more understandable values ? is this means that when I allocate more than 256 Byte my kernel will spill?