- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I wonder is anyone has the time and inclination to have look at the code below for
any possible improvements.
The extract included here is the the heaviest user of cpu in a large-ish simulation code .
A typical run would take 6-9 months of running 24/24 and 7/7 with 6 threads on six cores.
The omp part is working very well and there cannot be much inprovement with the multithreading part.
The compiler call used for the whole code is:
ifort -O3 -r8 -openmp -fpp -parallel -mcmodel=medium -i-dynamic -shared-intel
Would there be a benefit if part or all of it were written in assembler?
Lots and lots of thanks for any suggestions.
--
! typical values
! N1 = 768
! N2 = N3 = 12
! M3 = M2 = 42
Do KEL = 1, N3
Do JEL = 1, N2
...
[address calculations]
!$OMP PARALLEL DEFAULT(SHARED) PRIVATE( I, J, K, JA, KA, JJ, KK )
!$OMP DO
Do K = 1, M3
Do J = 1, M2
JJ = (J-1)*NX32
! - copy into work arrays for later fft.
Do I = 1, N1
WK_1( JJ+I, K ) = U( J_Jump+J, K_Jump+K, I )
WK_2( JJ+I, K ) = V( J_Jump+J, K_Jump+K, I )
WK_3( JJ+I, K ) = W( J_Jump+J, K_Jump+K, I )
End Do
Do I = 1, N1, 2
! - du/dx
WKX_1( JJ+I, K ) = -Wv(i)*U( J_Jump+J, K_Jump+K, I+1 )
WKX_1( JJ+I+1, K ) = Wv(i)*U( J_Jump+J, K_Jump+K, I )
! - dv/dx
WKX_2( JJ+I, K ) = -Wv(i)*V( J_Jump+J, K_Jump+K, I+1 )
WKX_2( JJ+I+1, K ) = Wv(i)*V( J_Jump+J, K_Jump+K, I )
! - dw/dx
WKX_3( JJ+I, K ) = -Wv(i)*W( J_Jump+J, K_Jump+K, I+1 )
WKX_3( JJ+I+1, K ) = Wv(i)*W( J_Jump+J, K_Jump+K, I )
End Do
! - Y derivatives.
Do JA = 1, M2
Do I = 1, N1
WK_4( JJ+I, K ) = WK_4( JJ+I, K ) + RDY*DYGL(J,JA)*U( J_jump+JA, K_jump+K, I ) ! du/dy
WK_5( JJ+I, K ) = WK_5( JJ+I, K ) + RDY*DYGL(J,JA)*V( J_jump+JA, K_jump+K, I ) ! dv/dy
WK_6( JJ+I, K ) = WK_6( JJ+I, K ) + RDY*DYGL(J,JA)*W( J_jump+JA, K_jump+K, I ) ! dw/dy
End Do
End Do
! - Z derivatives.
Do KA = 1, M3
Do I = 1, N1
WK_7( JJ+I, K ) = WK_7( JJ+I, K ) + RDZ*DZGL(K,KA)*U( J_jump+J, K_jump+KA, I ) ! du/dz
WK_8( JJ+I, K ) = WK_8( JJ+I, K ) + RDZ*DZGL(K,KA)*V( J_jump+J, K_jump+KA, I ) ! dv/dz
WK_9( JJ+I, K ) = WK_9( JJ+I, K ) + RDZ*DZGL(K,KA)*W( J_jump+J, K_jump+KA, I ) ! dw/dz
End Do
End Do
End Do
End Do ! eo single element loop.
!$OMP END DO
!$OMP END PARALLEL
...
[other stuff]
end do
end do
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I suppose you would require selection of sse4.1 or avx to see vectorization in the opt-report. If compiler doesn't accept unroll and jam directives at O3 you may need to write it explicitly and use an event profiler to see if you are alleviating poor cache behavior.
My best results with unroll and jam came with specified number:
!dir$ unroll_and_jam = 2
You would put the directive ahead of the JA, and KA loops. This should replace the unrolling which the compiler normally does on the inner loop only (which could be eliminated by putting !dir$ unroll(0) there). The compiler would tell you in opt-report what unrolling was actually chosen.
If 2 works, you would need to test whether a larger number (including whatever the compiler might choose) is better for a cut down version of your application. For VTune profiling (with one of the L1 and L2 cache analysis choices), you probably need a case which runs just a few minutes without enabling multiple runs.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you.
With or without the unroll_and_jam directive the -opt-report gives nothing for
the lines concerned.
Anyway, I found that
!dir$ unroll_and_jam = 4
produces about 4-5% reduction in the cpu for the whole code. Not a bad start.
I will follow the rest of your suggestions and see what comes up.
I am also trying to switch to fftw with weird things happening, but this
is another forum - i know.
--
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Can you reorganize your U,V,W arrays such that the last index is rearranged to be first?
Alternately, if U,V,W arrays are relatively fixed through long compute sections, then consider shadowing U,V,W with Uotherway, Votherway, Wotherway that the I index first (you can pick different annotation for "otherway"). Only update the shadows when necessary.
Making the above change will improve vectorization opportunities (at the expense of making the copy).
Jim Demspey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
JD,
The i subscript was intentionally put last because the problem
is decoupled with respect to this index for the quite a lot of the code.
So there are a number of places alsewhere where the omp loop is over
the last index i. This makes the whole calculation quite efficient elsewhere
in the code and thus this bit now stands out. I may try some copying into
temporary workarrays to see if things improve.
I tried the -x option of ifort with SSE4.1 and avx with no perceptible
change in the cpu timings.
Thanks.
--
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have tested the JD suggestion of carrying two copies of the big arrays
with reverse indexing i.e. I have now both uri( i, j, k ) = u( j, k, i ) etc.
On old Q9650 (4 cores 4 threads) the overall reduction in cpu usage is 8%.
On the newer i7-3090k (6 cores & 6 threads) overall code cpu usage dropped
by 33% (no kidding).
The !dir$ unroll_and_jam directive fiddling gives consistent 4% reduction in cpu.
These are quite good results with an increase of 33% in the usage of
ram (+1.4GB for my current case).
There are few more things to try.
Now, what stands out as the most cpu intensive part is the ffts.
I am trying to switch to fftw but so far this has messed up
completely the whole code.
Thank you.
--
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
33% improvement is a good return on a little bit of time invested in making this change.
In some cases you will want to organize your data, paying particular attention to the dimension to the dimension index with respect to vectorization. IOW which order will improve performance by improving vectorization.
In other cases part of the code may benefit with one dimension order while a different part of the code benefits best with a different order. When the order is non-flippant (one/either order used multiple times in a row before other order), then I incorporate a flag indicating which array has the most recent representation of the data (0 = a same as b, 1 = a more recent than b, -1 = b more recent than a), . When the most recent copy is not the one I wish to use then a copy is made and flag set to 0 if the procedure is read-only or 1/-1 for modification and depending on which array being modified. This will cut down on unnecessary copy/transformations.
FWIW I also used this technique when an array could reside in CPU memory or GPU memory (or both).
Considering that SIMD vectors are getting wider and wider, more care is require in designing data layouts. Discount scatter/gather as that will only reduce instruction count and not store/fetch cycles. Scatter and gather are beneficial for infrequent accesses of those data vectors with respect to being used in combination with other non-gathered vectors.
You will tend to find that the number of gripes about poor vectorization are solvable by reorganization of data.
Jim Dempsey
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page