I'd like to understand the excess overhead I'm measuring when submitting a sycl command group in the attached modified vector add example. Some key points:
- Vector add works on 100M complex elements and writes to output vector sum
- Target: CPU with 2 cores
- Profiling overall command group, and using sycl profiling to track submit time and execution time
I'm finding that the overall execution time is about 7 times longer than the execution + submit time. I'm curious to know the source of the additional overhead when submitting the command group.
Even when the kernel has no work (comment contents in parallel for) and set array_size = 1, the overall overhead is still half a second (much larger than the kernel submit time or execution time)
Note: I'm using the latest Intel oneAPI DPC++ Compiler included in the Basekit_p_2021.1.0.2659 release.
Thanks for your help.
This is an interesting finding.
There're a couple important overhead for each kernel:
- JIT compilation for each kernel: it happens once for each kernel
- Data transfer between CPU memory & GPU memory
Those are not counted by the submit-time and execution-time. But the overall system clock you have included all, so it's slower.
There is an open-source tool https://github.com/intel/pti-gpu/tree/master/samples/ze_tracer that can show more details on where the time spent. It may help.
Hope answers your questions.