Intel® C++ Compiler
Community support and assistance for creating C++ code that runs on platforms based on Intel® processors.
7833 Discussions

Intel parallel STL vs STL performance weird results.

AnandKulkarniSG
Beginner
1,197 Views

I have recently installed the intel one API software on my fedora 31 machine. It is an i7 quad core laptop. I tried comparing the pstl performance with regular stl performance using the std::fill function which in pstl runs under parallel unsequenced mode. The results are fine as expected when run once over a dataset of the same size ( an array of 5mil size ). But if I loop over the same code say 10/20 times and try to repeatedly capture performance stats and print results the STL comes out faster. 

 

I decided to write a shell script that runs the code ( which runs the pstl vs stl only once via c++ code ) and repeats the same test 10 times by calling the binary. Now pstl runs faster than stl every time. I just don't understand this behavior.

 

screenshots attached.

1. sample1.cpp.png: The run.sh calls sample1.cpp file binary 10 times and the results are as expected. Every time pstl is faster than stl.

2. pstl.cpp.png: This shows the pstl_vs_stl_fill.cpp which aims to run same test within c++ by running the test in loop 10/20+ like wise. and here is where problem happens. The stl run times are faster than pstl.

3. Intel1.png: shows these timings profiled output as a screenshot.

I have feeling something is getting cached here during run time afte pstl runs and that is helping stl fill implementation run faster ? what am i missing here ?

0 Kudos
12 Replies
PriyanshuK_Intel
Moderator
1,149 Views

Hi,

Thank you for posting in Intel Communities.


As per the system requirements, Fedora31 is not supported.

Please refer to below link for supported system requirements:

https://www.intel.com/content/www/us/en/developer/articles/system-requirements/intel-oneapi-base-toolkit-system-requirements.html


Could you please try with supported version and let us know if you still face any issues with compiler version being used so that we can investigate more on the issue from our end?


Thanks,

Priyanshu


0 Kudos
PriyanshuK_Intel
Moderator
1,092 Views

Hi,


We haven't heard back from you. Could you please provide an update on your issue?


Thanks & Regards

Priyanshu.



0 Kudos
AnandKulkarniSG
Beginner
1,079 Views

Its the same behaviour on FC35 , tried on my friends linux machine. His is dual core intel i5 running FC35.

0 Kudos
PriyanshuK_Intel
Moderator
1,016 Views

Hi,


We are working on this issue internally. We will get back to you soon.


Thanks,

Priyanshu.


0 Kudos
sc_Intel
Employee
954 Views

Hi AnandKulkarniSG,

I am trying to reproduce the performance issue that you described, but I cannot reproduce it on several machines with Ubuntu 20.04 or CentOS 8.3. I compiled "pstl_vs_stl_fill.cpp" with `dpcpp -o pstl_vs_stl_fill pstl_vs_stl_fill.cpp -ltbb`. Running on CPU with `SYCL_DEVICE_FILTER=cpu ./pstl_vs_stl_fill 10`, my result shows that pstl is much faster than stl. For example, the result I got on Ubuntu 20.04:

pstl/tbb : duration in microsec = 614

std stl: duration in microsec= 712

pstl/tbb : duration in microsec = 216

std stl: duration in microsec= 513

pstl/tbb : duration in microsec = 205

std stl: duration in microsec= 512

pstl/tbb : duration in microsec = 216

std stl: duration in microsec= 500

pstl/tbb : duration in microsec = 200

std stl: duration in microsec= 495

pstl/tbb : duration in microsec = 192

std stl: duration in microsec= 500

pstl/tbb : duration in microsec = 156

std stl: duration in microsec= 516

pstl/tbb : duration in microsec = 92

std stl: duration in microsec= 537

pstl/tbb : duration in microsec = 83

std stl: duration in microsec= 526

pstl/tbb : duration in microsec = 126

std stl: duration in microsec= 516


At first glance at the code, this problem seems to be DPC++ library related, not necessarily OS related. As I don't have a machine with Fedora 35 OS locally, I cannot identify your problem in a short time. In order to locate the problem, could you please provide the following information?

  1. Which version of OneAPI or dpcpp are you using? You can get the info through `dpcpp -v`;
  2. On which platform and device are your codes running? You can get the info by adding SYCL_PI_TRACE=1 in front of your executing command, for example `SYCL_PI_TRACE=1 ./pstl_vs_stl_fill 10`


Thanks,

Shu


0 Kudos
AnandKulkarniSG
Beginner
941 Views

@sc_Intel 

 

All answers are in the screenshot. I hope this helps. Fc36 latest fedora dist with GCC 12 version.

It's a quad-core intel i7 6500U 3.1Ghz machine with 16GB RAM.

 

AnandKulkarniSG_0-1661767314939.png

 

[anand@albatross intel]$neofetch
/:-------------:\ anand@albatross
:-------------------:: ---------------------------------------
:-----------/shhOHbmp---:\ OS: Fedora release 36 (Thirty Six) x86_64
/-----------omMMMNNNMMD ---: Host: 80Q7 Lenovo ideapad 300-15ISK
:-----------sMMMMNMNMP. ---: Kernel: 5.17.5-300.fc36.x86_64
:-----------:MMMdP------- ---\ Uptime: 1 day, 2 hours, 39 mins
,------------:MMMd-------- ---: Packages: 3896 (rpm), 5 (flatpak)
:------------:MMMd------- .---: Shell: bash 5.1.16
:---- oNMMMMMMMMMNho .----: Resolution: 1920x1080
:-- .+shhhMMMmhhy++ .------/ DE: GNOME 3.34.5
:- -------:MMMd--------------: WM: Mutter
:- --------/MMMd-------------; WM Theme: Adwaita
:- ------/hMMMy------------: Theme: Adwaita [GTK2/3]
:-- :dMNdhhdNMMNo------------; Icons: Adwaita [GTK2/3]
:---:sdNMMMMNds:------------: Terminal: gnome-terminal
:------:://:-------------:: CPU: Intel i7-6500U (4) @ 3.100GHz
:---------------------:// GPU: Intel Skylake GT2 [HD Graphics 520]
GPU: AMD ATI Radeon HD 8670A/8670M/8690M / R5 M330 / M430 / Radeon 520 Mobile
Memory: 5631MiB / 15866MiB
[anand@albatross intel]$

 

 

 

0 Kudos
sc_Intel
Employee
928 Views

Hi AnandKulkarniSG,

Thank you for the info. As you can see from the output of "$neofetch", you already have two GPUs in your machine. One is Skylake GT2 from Intel, the other is ATI Radeon HD 8670A/8670M/8690M / R5 M330 / M430 / Radeon 520 Mobile from AMD.

In order to specify which GPU your program is using, please add "SYCL_PI_TRACE=1" in the environment. You can run "$SYCL_PI_TRACE=1 ./pstl_vs_stl_fill.out 10" as I mentioned. Then you can see more information in the output.

To list all the platforms that are supported on your machine, please use "$sycl-ls".

To specify manually which platform your program should run on, please add "SYCL_DEVICE_FILTER=${PLATFORM}" in the environment. In order to avoid the influence of GPUs, please use Intel CPU to test first, by running "$SYCL_DEVICE_FILTER=cpu SYCL_PI_TRACE=1 ./pstl_vs_stl_fill.out 10".


0 Kudos
AnandKulkarniSG
Beginner
924 Views

doesn't seem to make a difference.

AnandKulkarniSG_1-1661775315691.png

0 Kudos
sc_Intel
Employee
921 Views

Hi AnandKulkarniSG,


Thanks for the quick feedback. I have finally reproduced your issue on a NUC with Ubuntu OS, so this issue is not OS related. This performance issue can only be reproduced on Laptops or NUCs with iGPUs, while on Desktops or Servers the performance is totally normal. I will work on it and will let you know if I have any update on this.


Thanks,

Shu


0 Kudos
sc_Intel
Employee
899 Views

Hi AnandKulkarniSG,

For heterogeneous computing (you are calling oneapi::dpl::execution::par_unseq in this example), the first call to kernel function takes much longer than the latter calls in the program due to the cost of JIT compilation. Therefore the first run should not be counted when performance is to be evaluated. Therefore the test codes in your sample1.cpp should be changed:

int main(int argc, char* argv[]) {

    std::vector<int> data1 (5000000);

    std::vector<int> data2 (5000000);

    std::fill(oneapi::dpl::execution::par_unseq, data1.begin(), data1.end(), -1);

    std::fill(data2.begin(), data2.end(), -1) ;


    // intel parallel STL.with tbb performance

    auto startTime=std::chrono::high_resolution_clock::now() ;

    std::fill(oneapi::dpl::execution::par_unseq, data1.begin(), data1.end(), -1);

    auto endTime=std::chrono::high_resolution_clock::now();

    auto duration=std::chrono::duration_cast<std::chrono::microseconds>(endTime-startTime).count();

    std::cout<<"pstl/tbb : duration in microsec = "<<duration<<std::endl;

    // compare this with standard st performance

    startTime=std::chrono::high_resolution_clock::now();

    std::fill(data2.begin(), data2.end(), -1) ;

    endTime=std::chrono::high_resolution_clock::now();

    duration=std::chrono::duration_cast<std::chrono::microseconds>(endTime-startTime).count();

    std::cout<<"std stl: duration in microsec= "<<duration<<std::endl;

     

    return(0);

}

Then you will see similar performance to pstl.cpp on your laptop.


If you are interested in Just-In-Time Compilation in DPC++, please find more information in https://www.intel.com/content/www/us/en/develop/documentation/oneapi-gpu-optimization-guide/top/compilation/jitting.html.


Please try my suggestion to see if your problem has been solved.


Thanks!

Shu


0 Kudos
sc_Intel
Employee
868 Views

@AnandKulkarniSG 

Has your issue been resolved? If I don't hear from you within 5 business days, I will assume your support request is resolved and will no longer monitor this thread.

Thanks!

0 Kudos
sc_Intel
Employee
795 Views

Since I haven't heard from you for a while, I have to assume your support request is resolved. We will no longer monitor this thread. Hope the reply I provided earlier answers your question.

If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only.

Thanks!


0 Kudos
Reply