I have a follow up question to my old question here.
In the answer an enqueueMapBuffer call was suggested for both buffers. The input buffer which is read-only and the output buffer. Is it really necessary to map the input buffer after the kernel execution has finished? This is read-only memory and I do not care if the memory is not coherent with the CPU, because the GPU does not modify the memory anyway. I wonder if skipping the enqueueMapBuffer for read-only buffers could lead to the GPU not reading the hosts data.
This question actually belongs to a more complicated issue I have. In that case, the enqueueMapBuffer takes a lot of time. Allocating the buffers with 4K address alignment and 64-byte size alignment makes OpenCL do a "zero-copy" which is a lot faster. However, it is even faster, when I skip the enqueueMapBuffer and make sure the buffer is not aligned. This is really strange, because as far as I know, a "zero-copy" buffer should already be the fastest way and making the memory not aligned should result in OpenCL copying the buffers data. This should be slower than not copying any data at all.
PS: I try to create a minimal code example, but it is not that easy. The behavior is not reproducible with the simple copy-kernel of my last question. So it seems like the behavior also depends on the kernel itself.
Edit: I was able to create some kind of minimal code example: https://gist.github.com/lolxdfly/e6209ba776680fa6acc13ce18b7e36d5
The two important lines are lines 36 (64-byte size alignment) and 141 (enqueueMapBuffer).
Without alignment, but with MapBuffer (line 36 inactive and line 141 active) the GPU time is 17320455.
With alignment and with MapBuffer (line 36 active and line 141 active) the GPU time is 2264301.
With alignment, but without MapBuffer (line 36 active and line 141 inactive) the GPU time is 2200300.
Without alignment and without MapBuffer (line 36 inactive and line 141 inactive) the GPU time is 1971559.
Intel Corporation TigerLake-H GT1 [UHD Graphics] (rev 01)
Ubuntu 22.04 with 5.17.0-1004-oem kernel
intel-opencl-icd version: 22.14.22890-
Thanks for reaching out to us.
According to the letter of the spec, it is not valid to read from a host pointer used to create a USE_HOST_PTR buffer. The OpenCL spec isn’t as explicit about this as I would like, but the SYCL specification is:
“When using an SYCL buffer, the ownership of the pointer passed to the constructor of the class is, by default, passed to sycl-runtime, and that pointer cannot be used on the host side until the buffer or image is destroyed. An SYCL application can access the contents of the memory managed by an SYCL buffer by using a host_accessor as defined in section 4.7.6 ”
That being said, in practice, the odds of a problem occurring if neither the CPU nor GPU are writing to the host pointer or the buffer are small, so things will _probably_ work without mapping. I wouldn’t recommend doing this, and if it happened to break I’d call it an app bug, but I could certainly understand if an application does it in the cases where it is known to work.
Some other options to consider:
- Host USM can be read from the host and the device without mapping.
- Creating the buffer with COPY_HOST_PTR will pay the cost of a copy, but will likely perform better on discrete GPUs, and will allow simultaneous reading from both the buffer and the host pointer.
Thanks & Regards,
Thank for the answer!
This answers most of question. However, it is still unclear to my why my example code runs faster with a read-only non-zero-copy Buffer without mapping than with a read-only zero-copy buffer. The non-zero-copy Buffer should be copied and therefore be slower than the zero-copy Buffer, even if there is no call to map the memory back to the host.
Hi, is the question why this scenario (I think this is "read-only non-zero-copy"?):
> Without alignment and without MapBuffer (line 36 inactive and line 141 inactive) the GPU time is 1971559.
performs differently than this scenario (I think this is "read-only zero-copy"?)?
> With alignment, but without MapBuffer (line 36 active and line 141 inactive) the GPU time is 2200300.
If so, I've been trying to reproduce these results, so far without much luck. The timings I am seeing are a little noisy, but it doesn't seem that either one of these two scenarios is consistently better than the other. I am running on a different GPU so it's admittedly not an apples-to-apples comparison.
Are you seeing consistently different results? If so, I'd be curious to see if it's a host API call or if the kernel execution (or perhaps the other call to clEnqueueMapBuffer) is taking longer in the slow case.
Note, you can use the OpenCL Intercept Layer with HostPerformanceTiming and DevicePerformanceTiming to figure this out without needing to instrument the reproducer.
sorry I had a lot of other stuff going on in the last time.
Yes, Ben is right. My question is, why the scenario with non-zero-copy is faster that the scenario with zero-copy buffer.
The issue is hard to see with the example code I provided. It becomes more clear and consistent in one of my other applications. I used the OpenCL Intercept Layer to record some numbers. You can find them in the attachments.
I am interested in the time it takes from the clEnqueueNDRangeKernel to the end of clFinish because this is the time of the GPU Task without any setup. In the non-zero-copy scenario this takes about 2ms, but in the zero-copy scenario it takes over 6ms. According to the trace, it looks like the execution of the kernel itself is slower in the zero-copy example: 1.469ms vs 5.805ms.
I also included a trace with the MapBuffer call. It makes sense that this is with about 19ms the slowest.
Would it be possible to run the kernel for a few more iterations to see if the difference is due to run-to-run variation or perhaps if there is only a difference on the first iteration? One thought is that the copy in the non-zero-copy case means that the GPU is already running at a higher frequency when it executes its kernel, versus the zero-copy case where it may take some time for the GPU to ramp up to full frequency.
Note: I'll be on vacation for a while but other Intel folks will monitor this thread while I am gone. Thanks!
I did more iterations. Without tracing the times are very consistent and the non-zero-copy version is always faster than the zero-copy version. If I use the cliloader to trace the application the measured times become a bit slower and they have more fluctuations, but the outcome is still the same.
I also measured the times when I execute a completely different kernel with different data right before the kernel where I do my measurements. I wanted to check what this means for the executions times of the two scenarios. You can find the result traces in the attachments. The times are much more consistent now, even with tracing enabled. It had no impact on the non-zero-copy version, but the zero-copy version became faster. The execution time was reduced from about 5.805 ms to about 4.237 ms. This means that the already ramped up frequency in the non-zero-copy explains some of the time differences, but its not the complete story. There is still the gap of about 2.768 ms between the two versions.
here is my complete Code:
The lineitem.tbl is part of the TPC-H benchmark and is unfortunately too big to be uploaded.
In line 72 of gpudb.cpp you will find the the alignment of the size of the input buffer. Right now it is commented out which makes the kernel run faster. If I uncomment that line, the buffer has the requirement of a zero-copy-buffer, but the kernel runs slower.