- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have recently installed the intel one API software on my fedora 31 machine. It is an i7 quad core laptop. I tried comparing the pstl performance with regular stl performance using the std::fill function which in pstl runs under parallel unsequenced mode. The results are fine as expected when run once over a dataset of the same size ( an array of 5mil size ). But if I loop over the same code say 10/20 times and try to repeatedly capture performance stats and print results the STL comes out faster.
I decided to write a shell script that runs the code ( which runs the pstl vs stl only once via c++ code ) and repeats the same test 10 times by calling the binary. Now pstl runs faster than stl every time. I just don't understand this behavior.
screenshots attached.
1. sample1.cpp.png: The run.sh calls sample1.cpp file binary 10 times and the results are as expected. Every time pstl is faster than stl.
2. pstl.cpp.png: This shows the pstl_vs_stl_fill.cpp which aims to run same test within c++ by running the test in loop 10/20+ like wise. and here is where problem happens. The stl run times are faster than pstl.
3. Intel1.png: shows these timings profiled output as a screenshot.
I have feeling something is getting cached here during run time afte pstl runs and that is helping stl fill implementation run faster ? what am i missing here ?
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thank you for posting in Intel Communities.
As per the system requirements, Fedora31 is not supported.
Please refer to below link for supported system requirements:
Could you please try with supported version and let us know if you still face any issues with compiler version being used so that we can investigate more on the issue from our end?
Thanks,
Priyanshu
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
We haven't heard back from you. Could you please provide an update on your issue?
Thanks & Regards
Priyanshu.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Its the same behaviour on FC35 , tried on my friends linux machine. His is dual core intel i5 running FC35.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
We are working on this issue internally. We will get back to you soon.
Thanks,
Priyanshu.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi AnandKulkarniSG,
I am trying to reproduce the performance issue that you described, but I cannot reproduce it on several machines with Ubuntu 20.04 or CentOS 8.3. I compiled "pstl_vs_stl_fill.cpp" with `dpcpp -o pstl_vs_stl_fill pstl_vs_stl_fill.cpp -ltbb`. Running on CPU with `SYCL_DEVICE_FILTER=cpu ./pstl_vs_stl_fill 10`, my result shows that pstl is much faster than stl. For example, the result I got on Ubuntu 20.04:
pstl/tbb : duration in microsec = 614
std stl: duration in microsec= 712
pstl/tbb : duration in microsec = 216
std stl: duration in microsec= 513
pstl/tbb : duration in microsec = 205
std stl: duration in microsec= 512
pstl/tbb : duration in microsec = 216
std stl: duration in microsec= 500
pstl/tbb : duration in microsec = 200
std stl: duration in microsec= 495
pstl/tbb : duration in microsec = 192
std stl: duration in microsec= 500
pstl/tbb : duration in microsec = 156
std stl: duration in microsec= 516
pstl/tbb : duration in microsec = 92
std stl: duration in microsec= 537
pstl/tbb : duration in microsec = 83
std stl: duration in microsec= 526
pstl/tbb : duration in microsec = 126
std stl: duration in microsec= 516
At first glance at the code, this problem seems to be DPC++ library related, not necessarily OS related. As I don't have a machine with Fedora 35 OS locally, I cannot identify your problem in a short time. In order to locate the problem, could you please provide the following information?
- Which version of OneAPI or dpcpp are you using? You can get the info through `dpcpp -v`;
- On which platform and device are your codes running? You can get the info by adding SYCL_PI_TRACE=1 in front of your executing command, for example `SYCL_PI_TRACE=1 ./pstl_vs_stl_fill 10`
Thanks,
Shu
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
All answers are in the screenshot. I hope this helps. Fc36 latest fedora dist with GCC 12 version.
It's a quad-core intel i7 6500U 3.1Ghz machine with 16GB RAM.
[anand@albatross intel]$neofetch
/:-------------:\ anand@albatross
:-------------------:: ---------------------------------------
:-----------/shhOHbmp---:\ OS: Fedora release 36 (Thirty Six) x86_64
/-----------omMMMNNNMMD ---: Host: 80Q7 Lenovo ideapad 300-15ISK
:-----------sMMMMNMNMP. ---: Kernel: 5.17.5-300.fc36.x86_64
:-----------:MMMdP------- ---\ Uptime: 1 day, 2 hours, 39 mins
,------------:MMMd-------- ---: Packages: 3896 (rpm), 5 (flatpak)
:------------:MMMd------- .---: Shell: bash 5.1.16
:---- oNMMMMMMMMMNho .----: Resolution: 1920x1080
:-- .+shhhMMMmhhy++ .------/ DE: GNOME 3.34.5
:- -------:MMMd--------------: WM: Mutter
:- --------/MMMd-------------; WM Theme: Adwaita
:- ------/hMMMy------------: Theme: Adwaita [GTK2/3]
:-- :dMNdhhdNMMNo------------; Icons: Adwaita [GTK2/3]
:---:sdNMMMMNds:------------: Terminal: gnome-terminal
:------:://:-------------:: CPU: Intel i7-6500U (4) @ 3.100GHz
:---------------------:// GPU: Intel Skylake GT2 [HD Graphics 520]
GPU: AMD ATI Radeon HD 8670A/8670M/8690M / R5 M330 / M430 / Radeon 520 Mobile
Memory: 5631MiB / 15866MiB
[anand@albatross intel]$
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi AnandKulkarniSG,
Thank you for the info. As you can see from the output of "$neofetch", you already have two GPUs in your machine. One is Skylake GT2 from Intel, the other is ATI Radeon HD 8670A/8670M/8690M / R5 M330 / M430 / Radeon 520 Mobile from AMD.
In order to specify which GPU your program is using, please add "SYCL_PI_TRACE=1" in the environment. You can run "$SYCL_PI_TRACE=1 ./pstl_vs_stl_fill.out 10" as I mentioned. Then you can see more information in the output.
To list all the platforms that are supported on your machine, please use "$sycl-ls".
To specify manually which platform your program should run on, please add "SYCL_DEVICE_FILTER=${PLATFORM}" in the environment. In order to avoid the influence of GPUs, please use Intel CPU to test first, by running "$SYCL_DEVICE_FILTER=cpu SYCL_PI_TRACE=1 ./pstl_vs_stl_fill.out 10".
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
doesn't seem to make a difference.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi AnandKulkarniSG,
Thanks for the quick feedback. I have finally reproduced your issue on a NUC with Ubuntu OS, so this issue is not OS related. This performance issue can only be reproduced on Laptops or NUCs with iGPUs, while on Desktops or Servers the performance is totally normal. I will work on it and will let you know if I have any update on this.
Thanks,
Shu
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi AnandKulkarniSG,
For heterogeneous computing (you are calling oneapi::dpl::execution::par_unseq in this example), the first call to kernel function takes much longer than the latter calls in the program due to the cost of JIT compilation. Therefore the first run should not be counted when performance is to be evaluated. Therefore the test codes in your sample1.cpp should be changed:
int main(int argc, char* argv[]) {
std::vector<int> data1 (5000000);
std::vector<int> data2 (5000000);
std::fill(oneapi::dpl::execution::par_unseq, data1.begin(), data1.end(), -1);
std::fill(data2.begin(), data2.end(), -1) ;
// intel parallel STL.with tbb performance
auto startTime=std::chrono::high_resolution_clock::now() ;
std::fill(oneapi::dpl::execution::par_unseq, data1.begin(), data1.end(), -1);
auto endTime=std::chrono::high_resolution_clock::now();
auto duration=std::chrono::duration_cast<std::chrono::microseconds>(endTime-startTime).count();
std::cout<<"pstl/tbb : duration in microsec = "<<duration<<std::endl;
// compare this with standard st performance
startTime=std::chrono::high_resolution_clock::now();
std::fill(data2.begin(), data2.end(), -1) ;
endTime=std::chrono::high_resolution_clock::now();
duration=std::chrono::duration_cast<std::chrono::microseconds>(endTime-startTime).count();
std::cout<<"std stl: duration in microsec= "<<duration<<std::endl;
return(0);
}
Then you will see similar performance to pstl.cpp on your laptop.
If you are interested in Just-In-Time Compilation in DPC++, please find more information in https://www.intel.com/content/www/us/en/develop/documentation/oneapi-gpu-optimization-guide/top/compilation/jitting.html.
Please try my suggestion to see if your problem has been solved.
Thanks!
Shu
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Has your issue been resolved? If I don't hear from you within 5 business days, I will assume your support request is resolved and will no longer monitor this thread.
Thanks!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Since I haven't heard from you for a while, I have to assume your support request is resolved. We will no longer monitor this thread. Hope the reply I provided earlier answers your question.
If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only.
Thanks!

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page