Strange timings and sudden failures with -Qopenmp or -Qparallel options

Arjen_Markus · ‎09-11-2025

I am experimenting with different implementations of a straightforward algorithm as well as with different compile options. I am running into very odd results:

The -Qopenmp option does not provide a speed-up but will actually slow down the program.
With some combinations I seem to get race conditions resulting in an access violation.
The -Qparallel option makes the program stop at 3/4th of the calculation (around 7000 iterations instead of 10,000).
The two versions of the program I have put in the attached zip file work fine if I leave out these compile options.

For details see the readme.txt file.

I am not expecting miracles from these options, but this seems to hint at problems in the compiler.

jimdempseyatthecove · ‎09-12-2025

        !$omp parallel reduction(+:u)
        !$omp workshare
        pderiv = diffw * pwest  + &
                 diffe * peast  + &
                 diffn * pnorth + &
                 diffs * psouth   &
                 - (diffw + diffe + diffn + diffs) * pcentre + pforce

        u = u + deltt * du
        !$omp end workshare

        !$omp end parallel

Did you forget the reduction clause?

Jim

Arjen_Markus · ‎09-14-2025

I tried this and also using workshare directives on the individual statements, but no avail: the program remains unstable.

You are right, though, that there is a dependency within the original program that I overlooked. It is not there when you do things sequentially, so that is a useful lesson to learn.

jimdempseyatthecove · ‎09-14-2025

The workshare method may not be efficient should the compiler choose to allocate a temporary for the expression prior to assignment.

Performance wise, it would be better to explicitly parallel do to iterate on rows.

A better performant code would be to tile the array with columns in multiples of 16. In the example case 300x300, allocate to (304,300) and initialize to 0's.

While the code is more complex, if time is of essence, additional coding now will save you time later.

In the book High Performance Parallelism Pearls (Morgan Kaufman Publishers, (C) 2015), chapter 5, I showed (compared) various methods of simulating a diffusion of a solute through a volume of a liquid over time within a 3D container. Very similar calculation requirements of your 2D sample code. The example case was targeted to a Xeon Phi 5110P coprocessor with 60 cores/240 threads. Small test model was 256x256x256, large test model 512x512x512.

Jim Dempsey

Arjen_Markus · ‎09-15-2025

Thank for this suggestion. It is definitely worth examining. I am looking at the different possibilities to implement this and performance is one of the criteria I want to examine. I did not know of this book, I must admit and I will try to get hold of a copy.

jimdempseyatthecove · ‎09-15-2025

https://www.amazon.com/High-Performance-Parallelism-Pearls-One-ebook/dp/B00PLOC4D6/ref=sr_1_1?dib=eyJ2IjoiMSJ9.uRimfZLkPC9vXMSCYk3pyJJC5WzAoYPjoGq6YRUiYJs358gIkGBozViSX9y95O23.eIUc0L0u4mAhCGMKL7pKYaCKY3mr_hy6yEzVFHMuSP0&dib_tag=se&hvadid=695242528470&hvdev=c&hvexpln=67&hvlocphy=9019411&hvnetw=g&hvocijid=9072910197214417137--&hvqmt=e&hvrand=90729101...

Volume 1 (without "Volume 1")

From the promo:

Attached is the code examples, Linux

You will have to update a bit, the Xeon variant should run as-is. The Xeon Phi code may not compile as you would need an older co-processor and older version of Intel Parallel Studio.

I haven't revisited this code to update to newer GPU's (waiting for newer Battlemage GPUs to be released).

Jim Dempsey

Jim

Arjen_Markus · ‎09-18-2025

Thanks for the code and the book's details.

Meanwhile I have run my unparallellised program with matrix sizes that are multiples of 100 and 96 (so a clear multiple of 16 :)). At least with the compiler I used the differences are not systematic. The sizes ranged from 1*100 or 96 to 30*100 or 96. At least with the compiler options I used, /O2 /Qxhost /heap-arrays, I did see a difference between two subsequent runs that made me pauze. See the three attached output files that summarise the CPU and Wall clock timings. The circumstances were identical with the very same executables. I have no clue why the first run is so much slower. The second and third run give very similar results.

I should repeat this with some other compile options

(Attaching files is not working, so I copy the contents here)

First run (the word "array" indicates the version of the code, but the other versions provided very similar timings):

array   100     0.137000        0.140625
array   200     0.332000        0.328125
array   300     0.773000        0.781250
array   400     1.36600 1.35938
array   500     2.36100 2.32812
array   600     4.20700 3.98438
array   700     11.1930 10.8438
array   800     15.9220 15.1094
array   900     35.3590 29.7344
array   1000    70.1730 61.8125
array   1200    80.9790 75.0469
array   1400    106.704 99.4531
array   1600    155.164 144.125
array   1800    187.110 177.688
array   2000    240.011 227.672
array   2300    323.845 306.297
array   2600    400.688 376.891
array   3000    545.338 513.516
array   96      0.149000        0.156250
array   192     0.357000        0.359375
array   288     0.712000        0.703125
array   384     1.30000 1.29688
array   480     2.12600 2.12500
array   576     2.69900 2.67188
array   672     7.98500 7.71875
array   768     17.2240 16.4062
array   864     41.2990 37.4062
array   960     39.6680 38.2812
array   1152    66.3870 63.9531
array   1344    100.152 94.9688
array   1536    165.978 157.328
array   1728    178.468 169.391
array   1920    227.571 211.391
array   2208    288.139 274.750
array   2496    375.992 356.562
array   2880    510.286 480.453

Second run:

array	100	0.144000	0.125000
array	200	0.387000	0.375000
array	300	0.717000	0.718750
array	400	1.45600	1.23438
array	500	2.06400	2.04688
array	600	2.81100	2.79688
array	700	3.81300	3.73438
array	800	5.18200	4.93750
array	900	6.44200	6.32812
array	1000	8.86500	8.82812
array	1200	14.4560	14.2031
array	1400	21.9720	21.4688
array	1600	31.1820	30.7969
array	1800	41.0080	40.0000
array	2000	51.1810	50.5469
array	2300	72.4830	70.8906
array	2600	104.267	102.000
array	3000	137.264	134.672
array	96	0.142000	0.140625
array	192	0.345000	0.328125
array	288	0.637000	0.625000
array	384	1.25700	1.23438
array	480	1.85400	1.85938
array	576	2.60000	2.57812
array	672	3.46500	3.43750
array	768	4.55000	4.54688
array	864	5.67200	5.65625
array	960	7.39900	7.31250
array	1152	12.8170	12.6094
array	1344	20.0050	19.8281
array	1536	33.0860	32.1562
array	1728	37.1120	36.3594
array	1920	46.9890	45.8438
array	2208	64.6610	63.5000
array	2496	87.2770	86.3281
array	2880	122.998	119.922

Third run:

array	100	0.148000	0.140625
array	200	0.384000	0.390625
array	300	0.746000	0.750000
array	400	1.35200	1.34375
array	500	2.25900	2.23438
array	600	2.99700	2.95312
array	700	3.96600	3.93750
array	800	5.01500	4.96875
array	900	6.55800	6.50000
array	1000	8.54400	8.32812
array	1200	14.8070	14.3906
array	1400	22.0480	21.6875
array	1600	32.2550	31.7969
array	1800	41.0190	40.2656
array	2000	51.4770	50.3750
array	2300	72.4580	70.8594
array	2600	97.2020	95.5000
array	3000	153.008	150.281
array	96	0.139000	0.140625
array	192	0.333000	0.312500
array	288	0.651000	0.656250
array	384	1.16500	1.12500
array	480	1.90200	1.90625
array	576	2.58000	2.54688
array	672	3.66600	3.45312
array	768	4.57100	4.56250
array	864	5.90600	5.85938
array	960	7.44500	7.43750
array	1152	12.9550	12.6250
array	1344	19.6700	19.4844
array	1536	28.3800	27.3125
array	1728	37.7050	36.1719
array	1920	46.6500	46.0312
array	2208	64.8380	63.3438
array	2496	87.9230	85.8750
array	2880	121.877	119.594