I am implementing openCL based kernels for the Intel Stratix 10 FPGA for some high performance application.
I would like to know what is the best way to guarantee that the current data write to the global memory from a kernel is completed, before next iteration of the kernel is executed.
I first thought of waiting in the kernel for some fixed amount of cycles but I don't see any defined way of achieving this using openCL.
I hope someone would be able to guide me and suggest a way to achieve this.
If I understand your question correctly, this is not at all required. Kernel execution only finishes after all data is written to device memory; this is required by the OpenCL standard. Needless to say, kernel enqueue functions are non-blocking; hence, you need to use clFinish() or clWaitForEvents() to determine when the kernel execution has actually completed. If you enqueue two or more kernels back to back in the same queue, it is again guaranteed that each kernel starts only after the previous one finishes. Please note that the OpenCL standard does not guarantee global memory consistency during kernel execution.
Thanks for response.
I do require the synchronization of the device memory during the kernel execution between the kernels which execute simultaneously.
Can you suggest a way in which this can be achieved?
If you have multiple kernels running in parallel in different queues updating the same global buffer, this is always going to give you an undefined output because, as I said, the OpenCL standard ensures global memory consistency only after the kernel execution has finished. I tried doing something like what you want once by using channels between two kernels running in parallel and sending messages from one to the other to synchronize them, but that didn't work since channel operations and memory operations have different latency and there is no guarantee that by the time the message reaches the second kernel, the memory operation in the first kernel has finished. Intel also provides a global memory barrier that should supposedly help for such cases but didn't seem to make any difference in my case. You can try using channels in conjunction with the global memory barrier to see if it works for you but note that if it doesn't, this is completely normal since such functionality is not expected to be supported by the OpenCL standard. Needless to say, there will always be alternative designs which do not require sharing global memory buffers.
I rather have a producer consumer relation between the kernels running simultaneously. As you pointed out as well, I expected to have a latency for the store operation in the producer and thats why wanted to understand the best way to implement a synchronization method. the channels like you said didn't work for me as well but I have not tried the combination of channels and memory fence.
I also find an interesting discussion regarding the buffer management using volatile memory in the following post
I did try this and it is producing a much better results but there are still some errors and I need to debug it further to be sure.
Also I would like to understand if any one can tell what is the purpose of write-ack LSU. They have higher latency then the burst coalesced LSU. Does they kind of guarantee memory updates while sacrificing cycles?
Volatile is certainly required to disable the private cache for global memory accesses and force all accesses to actually go to global memory to make sure all updates by each kernel is propagated to others.
I am not sure what you mean by "write-ack LSU". I think Intel's OpenCL compiler also supports "atomic" memory operations which might solve your problem; however, performance will be very poor because that basically serializes memory accesses and stalls the pipeline until the memory operation has finished (maybe this is what you call write-ack LSU?).
I see, that has been added in the newer versions of the compiler; didn't exist in the older versions. However, that seems to be something the compiler decides on based on the characteristics of the memory accesses, rather than something the programmer/user can explicitly control. Furthermore, the compiler will never analyze global memory access dependencies between two separate kernels and hence, such LSU will never be created by the compiler for your case. Based on the example in the guide, this LSU is created for cases a write-after-write dependency exists in the code; needless to say, such dependency is a false dependency and any sane compiler will optimize out the first write and only keep the second one. I fail to see why Intel even needed to add support for this LSU type...
What you are looking for is likely the atomic memory read/write I mentioned earlier.