Quote:andrew_4619 wrote: - Page 2

AONym · ‎05-01-2018

I am attempting to speed up a large program. I have identified the hot spots using profiling, but when I use OpenMP to parallelize the key loops, I get only a slight speed-up (about 30%), instead of the ideal factor of 8, for the key loops.

I created a smaller test program to figure out what is happening. This is the part containing the OpenMP loops (the complete program is attached).

!$OMP PARALLEL DEFAULT(SHARED)
!$OMP DO  PRIVATE(iRadius)
        DO iRadius = 1, nRadii
            Dradius(iRadius)=DiffCoeff(iSpecies, iRadius, concTotal(iRadius, 1:3)) ! diffusion coefficient at iRadius
            sRadius(iRadius)=SedCoeff(iSpecies, iRadius, concTotal(iRadius, 1:3)) ! if compression, this will account for it
        END DO
!$OMP END DO
!$OMP END PARALLEL

!$OMP PARALLEL DEFAULT(SHARED)
!$OMP DO PRIVATE(iRadius)

        L_R2_Parallel: &
            DO iRadius = 1, nRadii
                Z(iRadius)=ZCalc(iRadius, iSpecies)
                G(iRadius, 1)=Dradius(iRadius-1)*dt*A1(iRadius, 1)+B(iRadius, 1)-sRadius(iRadius-1)*omSqRun*dt*A2(iRadius, 1)
                G(iRadius, 2)=Dradius(iRadius)*dt*A1(iRadius, 2)+B(iRadius, 2)-sRadius(iRadius)*omSqRun*dt*A2(iRadius, 2)
                G(iRadius, 3)=Dradius(iRadius+1)*dt*A1(iRadius, 3)+B(iRadius, 3)-sRadius(iRadius+1)*omSqRun*dt*A2(iRadius, 3)
            END DO L_R2_Parallel
        nThreads=omp_get_num_threads()
!$OMP END PARALLEL

VTune shows a large amount of time was spent in __kmp_fork_barrier and _kmpc_barrier. I don't understand why any significant time is being spent at barriers, since an even division of the workload for each of the loops should result in all threads finishing at the same time. Task Manager shows 100% CPU usage while the program is running, as expected. I have attached the VTune summary; it also shows a large "spin time" which is mostly the sum of these two.

Compiled under Visual Studio as x64 release build, with option /Qopenmp.

3.4 GHz Haswell (8 logical CPUs), Windows 7 64-bit, Visual Studio 2017 15.6.7, Intel XE2018 update 1.

AONym · ‎05-17-2018

andrew_4619 wrote:

Within the second loop you in effect multiply two constants omSqRun and dt by each other 3*nRadii times when once would do!

This should be taken care of by the optimizer. It should recognize the common subexpression omSqRun*dt and notice it does not change during the loop, and therefore only do the multiplication once. This was my intent, but it does not work because in both omSqRun and dt are in MODULE ParallelTestData, and could potentially be changed by function ZCALC.

To get the Intel compiler to recognize the common subexpression as constant, I can make a local copy of each variable outside the loop. For the code as shown, without this change, the compiled code does load omSqRun and dt from memory once outside the loop and store them in registers (xmm6 and xmm7).

Here is the relevant assembler code produced by the Fortran compiler XE 2018 u1 (I left out most of the computation for clarity):

        movsd     xmm6, QWORD PTR [PARALLELTESTDATA_mp_OMSQRUN] ;125.68
        lea       rax, QWORD PTR [rax+r15*8]                    ;125.68
        movsd     xmm7, QWORD PTR [PARALLELTESTDATA_mp_DT]      ;125.17
...
                                ; LOE rbx rdi r12 r13 r14 r15 esi xmm6 xmm7 xmm8 xmm9 xmm10 xmm11 xmm12 xmm13 xmm14 xmm15
.B1.50::                        ; Preds .B1.107 .B1.49
                                ; Execution count [9.72e+008]
                ; optimization report
                ; OPENMP LOOP
                ; %s was not vectorized: vector dependence prevents vectorization%s
                ; VECTOR TRIP COUNT IS ESTIMATED CONSTANT
...
        call      ZCALC                                         ;123.28
...
        mulsd     xmm1, xmm6                                    ;125.110
        mulsd     xmm3, xmm6                                    ;126.106
        mulsd     xmm5, xmm6                                    ;127.110


...
        mulsd     xmm2, xmm7                                    ;125.17
        mulsd     xmm4, xmm7                                    ;126.17
        mulsd     xmm0, xmm7                                    ;127.17
...
        inc       rdi                                           ;119.7
...
        jb        .B1.50        ; Prob 99%                      ;119.7

John_Campbell · ‎05-18-2018

I think one of the problems with your smaller test program is there is not enough work in each loop iteration to justify !$OMP.
However, you splitting of !$OMP PARALLEL and !$OMP DO does appear to perform better for this test program. Implementing this approach does requires more care. I used some counters to confirm the looping and varied between $!OMP SINGLE, !$OMP MASTER and "IF ( iThread==0)" for limiting calculations. Minimising barriers can have an effect.

I experimented with a second generation 4-core i5 and an eight generation 6-core i7 and found that the i5 ran faster than the i7 (?)
I then limited the i5 to 3 threads and the i7 to 6 (threads = num cores) and found that for both, the run times improved.
I concluded from this that using !$OMP PARALLEL with max threads was not optimal and fewer threads works better (no hyper-threading), although this may be specific to my test configuration.

I have attached the changed test program.
I also placed the dot_product as a reduction in the DO loop, which appears to be more suited to the loop workload than a separate dot_product.
I also included some more monitoring of elapsed times, which may identify some relative performance.

Thanks for the example and hopefully some of these results may transfer to your main program.

OpenMP speedup not realized due to spin time