I am trying to work out precisely what I need to do to achieve best possible performance for accessing local memory in a DPC++ kernel.
I am working from the "Accessing Work-Group Local Memory" in chapter 15 of Reinders et al.'s Data Parallel C++. However this leaves a number of unknowns.
1) It talks about "elements" in local memory with saying what size those elements are. I would guess that they are either 4-byte or 8-byte but how can I determine which for any given processor?
2) It talks about banks in local memory. How can I find out how many of these there are?
3) Clearly if two work-items access different elements in the same bank then that will have to be serialised. However what happens if two work-items access the same element in local memory. (e.g. One reads the top-half and the other reads the bottom-half). Do these have to be serialised? Is this the same for read and write operations?
I am programming (in the first instance) for a Kaby Lake HD 610 (Device Id: 5906).
Thanks for reaching out to us.
>> I would guess that they are either 4-byte or 8-byte but how can I determine which for any given processor?
For systems based on the IA-32 architecture, classification is performed on 4 bytes. For systems based on other architectures, classification is performed on 8 bytes.
For DPC++, classification is performed on 8 bytes.
>> It talks about banks in local memory. How can I find out how many of these there are?
You can use numbanks() memory attribute in your source code to define the number of banks.
For more information you can refer to the below link:
>>However what happens if two work-items access the same element in local memory. (e.g. One reads the top-half and the other reads the bottom-half).
Could you please elaborate more on this statement?
Could you please provide us with an example/usecase?
Thanks & Regards,
Thank you for the reply that helps a lot. However it raises a few more questions:-
How do I find out about features like intel::numbanks() and intel::bankwidth()?
Is there any reference documentation describing them properly? There are various tutorials, white-paper and examples, but I have yet to find any reference documentation.
What for example is the applicability of the bank control directives above? They appeared in a paper on optimising FGPA access so it is safe to assume that they will be effective on FGPA. I would be very surprised if they have any effect on the CPU, which leaves me doubting if they actually work on GPUs. (I am trying to program a GPU.)
You can find all information about optimizing your kernel in the optimization guide:
You can find out about the local memory and all its attributes in the document. These attributes are specific to FPGA kernel optimization and not CPU or GPU ones.
For instance, bank control allows you to perform concurrent access to local memory without arbitration. You should know that FPGA local memories are made of M20K blocks with only 2 read/write ports (for single pumped). In any case, you have multiple access, the compiler would infer arbiter which will stall the memory accesses. You can find out more about it in the documentation.
You say "These attributes are specific to FPGA kernel optimization and not CPU or GPU ones."
In that case, it is of no relevance at all to my question. My hardware has a GPU which I am trying to program.
It does not have a FGPA. Please would you go back to my original question and give me what informed you have about GPU optimisation.
I did not pay attention to your statement " (I am trying to program a GPU.) ", and this forum is FPGA-related with high-level synthesis. This is where the confusion came from.
I do not use ONEAPI on GPU, but you can have the documentation about GPU optimization here: