I'd like to understand the excess overhead I'm measuring when submitting a sycl command group in the attached modified vector add example. Some key points:
I'm finding that the overall execution time is about 7 times longer than the execution + submit time. I'm curious to know the source of the additional overhead when submitting the command group.
Even when the kernel has no work (comment contents in parallel for) and set array_size = 1, the overall overhead is still half a second (much larger than the kernel submit time or execution time)
Note: I'm using the latest Intel oneAPI DPC++ Compiler included in the Basekit_p_2021.1.0.2659 release.
Thanks for your help.
This is an interesting finding.
There're a couple important overhead for each kernel:
Those are not counted by the submit-time and execution-time. But the overall system clock you have included all, so it's slower.
There is an open-source tool https://github.com/intel/pti-gpu/tree/master/samples/ze_tracer that can show more details on where the time spent. It may help.
Hope answers your questions.