- 신규로 표시
- 북마크
- 구독
- 소거
- RSS 피드 구독
- 강조
- 인쇄
- 부적절한 컨텐트 신고
I'm writing a ray tracer using SYCL and I've found using Intel VTune that `__opencl_emutls_get_address` takes a significant amount of CPU/GPU time (around 25%).
The function is called (under the hood) by various functions of my code. All the function listed above are hand written and not part of any library.
On this profile session, `__opencl_emutls_get_address` took 156s of CPU time out of 570s total. It is the biggest bottleneck of the application at moment (4 times as costly as triangle intersection...).
What is it and what does it do exactly?
My application was compiled in "Release with debug information" using ICPX from the Intel oneAPI Base kit 2023.2.1.
- 신규로 표시
- 북마크
- 구독
- 소거
- RSS 피드 구독
- 강조
- 인쇄
- 부적절한 컨텐트 신고
Hi,
I managed to reduce the overhead of __opencl_emutls_get_address by removing the SYCL_EXTERNAL attribute from the declaration of the functions that were showing up in the report of VTune.
In the end, I'm not sure what SYCL_EXTERNAL is used for considering my code compiles and executes fine without it but it solved my issue.
링크가 복사됨
- 신규로 표시
- 북마크
- 구독
- 소거
- RSS 피드 구독
- 강조
- 인쇄
- 부적절한 컨텐트 신고
Hi,
Thank you for posting in Intel Communities.
Could you please provide the following details you were using so we can investigate the issue from our end?
1. Complete reproducer code with steps
2. Hardware details
Thanks and Regards,
Pendyala Sesha Srinivas
- 신규로 표시
- 북마크
- 구독
- 소거
- RSS 피드 구독
- 강조
- 인쇄
- 부적절한 컨텐트 신고
Hi @SeshaP_Intel ,
Inlining the functions that were causing __opencl_emutls_get_address to be called seemed to have solved the issue. __opencl_emutls_get_address isn't called by inlined functions.
The behavior I observed didn't seem to be an issue of the SYCL implementation but rather a lack of understanding on my end so I was mainly asking for technical details about how this works and what __opencl_emutls_get_address does for the application.
- 신규로 표시
- 북마크
- 구독
- 소거
- RSS 피드 구독
- 강조
- 인쇄
- 부적절한 컨텐트 신고
Hi,
Thank you for the update.
To assist you further, please share below details:
- OS and Hardware details
- VTune version
- Sample reproducer code and exact steps to reproduce the issue from our end
Thanks
- 신규로 표시
- 북마크
- 구독
- 소거
- RSS 피드 구독
- 강조
- 인쇄
- 부적절한 컨텐트 신고
Hi,
We have not heard back from you. Could you please give us an update?
Thanks
- 신규로 표시
- 북마크
- 구독
- 소거
- RSS 피드 구독
- 강조
- 인쇄
- 부적절한 컨텐트 신고
Hi,
I managed to reduce the overhead of __opencl_emutls_get_address by removing the SYCL_EXTERNAL attribute from the declaration of the functions that were showing up in the report of VTune.
In the end, I'm not sure what SYCL_EXTERNAL is used for considering my code compiles and executes fine without it but it solved my issue.
- 신규로 표시
- 북마크
- 구독
- 소거
- RSS 피드 구독
- 강조
- 인쇄
- 부적절한 컨텐트 신고
Hi,
Glad to know that your issue is resolved.
SYCL_EXTERNAL is an optional macro that enables external linking of SYCL functions and methods to be included inside a SYCL kernel. To access the host (CPU) functions from the device (GPU), we need to label that function with the SYCL_EXTERNAL macro.
Please refer the following documentation for detailed information:
Since your issue is resolved, please post any additional questions in a new thread. This thread will be no longer monitored by Intel.
Thanks
