- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am experimenting with different implementations of a straightforward algorithm as well as with different compile options. I am running into very odd results:
- The -Qopenmp option does not provide a speed-up but will actually slow down the program.
- With some combinations I seem to get race conditions resulting in an access violation.
- The -Qparallel option makes the program stop at 3/4th of the calculation (around 7000 iterations instead of 10,000).
- The two versions of the program I have put in the attached zip file work fine if I leave out these compile options.
For details see the readme.txt file.
I am not expecting miracles from these options, but this seems to hint at problems in the compiler.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
!$omp parallel reduction(+:u)
!$omp workshare
pderiv = diffw * pwest + &
diffe * peast + &
diffn * pnorth + &
diffs * psouth &
- (diffw + diffe + diffn + diffs) * pcentre + pforce
u = u + deltt * du
!$omp end workshare
!$omp end parallel
Did you forget the reduction clause?
Jim
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I tried this and also using workshare directives on the individual statements, but no avail: the program remains unstable.
You are right, though, that there is a dependency within the original program that I overlooked. It is not there when you do things sequentially, so that is a useful lesson to learn.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The workshare method may not be efficient should the compiler choose to allocate a temporary for the expression prior to assignment.
Performance wise, it would be better to explicitly parallel do to iterate on rows.
A better performant code would be to tile the array with columns in multiples of 16. In the example case 300x300, allocate to (304,300) and initialize to 0's.
While the code is more complex, if time is of essence, additional coding now will save you time later.
In the book High Performance Parallelism Pearls (Morgan Kaufman Publishers, (C) 2015), chapter 5, I showed (compared) various methods of simulating a diffusion of a solute through a volume of a liquid over time within a 3D container. Very similar calculation requirements of your 2D sample code. The example case was targeted to a Xeon Phi 5110P coprocessor with 60 cores/240 threads. Small test model was 256x256x256, large test model 512x512x512.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank for this suggestion. It is definitely worth examining. I am looking at the different possibilities to implement this and performance is one of the criteria I want to examine. I did not know of this book, I must admit and I will try to get hold of a copy.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Volume 1 (without "Volume 1")
From the promo:
Attached is the code examples, Linux
You will have to update a bit, the Xeon variant should run as-is. The Xeon Phi code may not compile as you would need an older co-processor and older version of Intel Parallel Studio.
I haven't revisited this code to update to newer GPU's (waiting for newer Battlemage GPUs to be released).
Jim Dempsey
Jim
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for the code and the book's details.
Meanwhile I have run my unparallellised program with matrix sizes that are multiples of 100 and 96 (so a clear multiple of 16 :)). At least with the compiler I used the differences are not systematic. The sizes ranged from 1*100 or 96 to 30*100 or 96. At least with the compiler options I used, /O2 /Qxhost /heap-arrays, I did see a difference between two subsequent runs that made me pauze. See the three attached output files that summarise the CPU and Wall clock timings. The circumstances were identical with the very same executables. I have no clue why the first run is so much slower. The second and third run give very similar results.
I should repeat this with some other compile options
(Attaching files is not working, so I copy the contents here)
First run (the word "array" indicates the version of the code, but the other versions provided very similar timings):
array 100 0.137000 0.140625
array 200 0.332000 0.328125
array 300 0.773000 0.781250
array 400 1.36600 1.35938
array 500 2.36100 2.32812
array 600 4.20700 3.98438
array 700 11.1930 10.8438
array 800 15.9220 15.1094
array 900 35.3590 29.7344
array 1000 70.1730 61.8125
array 1200 80.9790 75.0469
array 1400 106.704 99.4531
array 1600 155.164 144.125
array 1800 187.110 177.688
array 2000 240.011 227.672
array 2300 323.845 306.297
array 2600 400.688 376.891
array 3000 545.338 513.516
array 96 0.149000 0.156250
array 192 0.357000 0.359375
array 288 0.712000 0.703125
array 384 1.30000 1.29688
array 480 2.12600 2.12500
array 576 2.69900 2.67188
array 672 7.98500 7.71875
array 768 17.2240 16.4062
array 864 41.2990 37.4062
array 960 39.6680 38.2812
array 1152 66.3870 63.9531
array 1344 100.152 94.9688
array 1536 165.978 157.328
array 1728 178.468 169.391
array 1920 227.571 211.391
array 2208 288.139 274.750
array 2496 375.992 356.562
array 2880 510.286 480.453
Second run:
array 100 0.144000 0.125000
array 200 0.387000 0.375000
array 300 0.717000 0.718750
array 400 1.45600 1.23438
array 500 2.06400 2.04688
array 600 2.81100 2.79688
array 700 3.81300 3.73438
array 800 5.18200 4.93750
array 900 6.44200 6.32812
array 1000 8.86500 8.82812
array 1200 14.4560 14.2031
array 1400 21.9720 21.4688
array 1600 31.1820 30.7969
array 1800 41.0080 40.0000
array 2000 51.1810 50.5469
array 2300 72.4830 70.8906
array 2600 104.267 102.000
array 3000 137.264 134.672
array 96 0.142000 0.140625
array 192 0.345000 0.328125
array 288 0.637000 0.625000
array 384 1.25700 1.23438
array 480 1.85400 1.85938
array 576 2.60000 2.57812
array 672 3.46500 3.43750
array 768 4.55000 4.54688
array 864 5.67200 5.65625
array 960 7.39900 7.31250
array 1152 12.8170 12.6094
array 1344 20.0050 19.8281
array 1536 33.0860 32.1562
array 1728 37.1120 36.3594
array 1920 46.9890 45.8438
array 2208 64.6610 63.5000
array 2496 87.2770 86.3281
array 2880 122.998 119.922
Third run:
array 100 0.148000 0.140625
array 200 0.384000 0.390625
array 300 0.746000 0.750000
array 400 1.35200 1.34375
array 500 2.25900 2.23438
array 600 2.99700 2.95312
array 700 3.96600 3.93750
array 800 5.01500 4.96875
array 900 6.55800 6.50000
array 1000 8.54400 8.32812
array 1200 14.8070 14.3906
array 1400 22.0480 21.6875
array 1600 32.2550 31.7969
array 1800 41.0190 40.2656
array 2000 51.4770 50.3750
array 2300 72.4580 70.8594
array 2600 97.2020 95.5000
array 3000 153.008 150.281
array 96 0.139000 0.140625
array 192 0.333000 0.312500
array 288 0.651000 0.656250
array 384 1.16500 1.12500
array 480 1.90200 1.90625
array 576 2.58000 2.54688
array 672 3.66600 3.45312
array 768 4.57100 4.56250
array 864 5.90600 5.85938
array 960 7.44500 7.43750
array 1152 12.9550 12.6250
array 1344 19.6700 19.4844
array 1536 28.3800 27.3125
array 1728 37.7050 36.1719
array 1920 46.6500 46.0312
array 2208 64.8380 63.3438
array 2496 87.9230 85.8750
array 2880 121.877 119.594

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page