Performance problem of MonteCarlo integration

SandeepKoranne · ‎05-08-2021

Hello

I am comparing the runtime performance of a simple sample/reject
Monte-Carlo integration scheme.

The program is run on the following computer
model name : Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz

The code is attached to this report.
./M1.exe M (number of polynomials) N (number of trials) T(number threads)

With 32 OpenMP threads the DPCPP compiled program is approximately 3 times slower
than the one compiled with GCC.

dpcpp -O3 -fopenmp -Wall -funroll-loops -ffast-math monte_carlo_integration.cpp -o MC_DPCPP.exe
time ./MC_DPCPP.exe 1000 10000000 32 > /dev/null

real 1m17.954s
user 36m22.515s
sys 0m0.401s

g++ -O3 -fopenmp -funroll-loops -ffast-math -fprofile-use monte_carlo_integration.cpp -o M1.exe
GCC 11.1
./M1_GCC111.exe 1000 10000000 32 > /dev/null

real 0m23.694s
user 11m8.420s
sys 0m0.019s

GCC 8.3.1
time ./MC_POLY 1000 10000000 32 > /dev/null

real 0m26.024s
user 12m21.249s
sys 0m0.020s

Running perf stat on the two binaries gives

GCC 8.3.1
Performance counter stats for './MC_POLY 10 10000000 1':

5,619.33 msec task-clock:u # 1.000 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
167 page-faults:u # 0.030 K/sec
17,887,915,840 cycles:u # 3.183 GHz
30,678,358,310 instructions:u # 1.72 insn per cycle
4,101,348,797 branches:u # 729.864 M/sec
226,816,326 branch-misses:u # 5.53% of all branches

5.620014706 seconds time elapsed

5.609363000 seconds user
0.001990000 seconds sys

Performance counter stats for './MC_DPCPP.exe 10 10000000 1':

15,906.43 msec task-clock:u # 1.000 CPUs utilized
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
651 page-faults:u # 0.041 K/sec
49,488,407,800 cycles:u # 3.111 GHz
82,102,796,894 instructions:u # 1.66 insn per cycle
6,603,124,397 branches:u # 415.123 M/sec
3,099,931 branch-misses:u # 0.05% of all branches

15.911192960 seconds time elapsed

VidyalathaB_Intel · ‎05-10-2021

Hi Sandeep,

Thanks for reaching out to us.

Could you please provide us the details of DPC++ compiler version on which you are working?

Meanwhile we will look into this issue internally. we will get back to you soon.

Regards,

Vidya.

SandeepKoranne · ‎05-10-2021

Thanks Vidya

Intel(R) oneAPI DPC++ Compiler 2021.2.0 (2021.2.0.20210317)
Target: x86_64-unknown-linux-gnu

This is the version I am using.

Regards,

Sandeep

Viet_H_Intel · ‎05-11-2021

Hi Sandeep,

I've reported this problem to our Developer.

Thanks,

SandeepKoranne · ‎05-22-2021

Hi

Is there any update to this issue ?

Even single threaded performance is much (3x) slower than gcc. Is this due to LLVM not able to optimize lambda[] functions ?

Sandeep

Viet_H_Intel · ‎05-25-2021

Sorry, we don't have any update yet on this issue.

Viet_H_Intel · ‎06-13-2022

Hi,

This issue has been addressed. The next update will show icpx is much faster -fiopenmp.

Thanks,

Viet_H_Intel · ‎10-03-2022

Please upgrade to oneAPI2022.3 which addressed this issue.

I am going to close this thread.

Thanks,