When we access memory in the opencl kernel like this:
for (int i = 0; i < N; i++) ... = A[i]
Are they executed in non-blocking manner? Meaning does the generated FSM wait for the memory load to complete before sending another load request to memory, or it sends out mutliple load requests one after another and then handle the responses in-order when they come back?
In case of Single Work-item kernels, loops are pipelined. This also applies to the memory accesses inside loops. Hence, access requests are sent back to back and after a certain delay, data is received in the same order. If the buffer between the kernel and memory becomes empty, then the kernel will stall waiting for new data to arrive.