Solved: Re: Advice on how to effectively parallelize simple loops

BBoston · ‎10-29-2009

Hello,

Iam usingIntel Fortran for Windows, 11.1.046

I have a Fortran program in which most of the time boils down to the following subroutine, which is called tens of thousandsof times or more:

Subroutine DMMult5M(N,A1,A2,A3,A4,A5,A6,B,C1,C2,C3,C4,C5,
$ D1,D2,D3,D4,D5)
Implicit Real*8 (A-H,O-Z)
Dimension A1(*),A2(*),A3(*),A4(*),A5(*),A6(*),
$ C1(*),C2(*),C3(*),C4(*),C5(*),
$ D1(*),D2(*),D3(*),D4(*),D5(*)

Do I=1,N
A1(I)=A1(I)+B*C1(I)*D1(I)
A2(I)=A2(I)+B*(C1(I)*D2(I)+D1(I)*C2(I))
A3(I)=A3(I)+B*(C1(I)*D3(I)+D1(I)*C3(I))
A4(I)=A4(I)+B*(C1(I)*D4(I)+D1(I)*C4(I))
A5(I)=A5(I)+B*(C2(I)*D2(I)+C3(I)*D3(I)+C4(I)*D4(I))
A6(I)=A6(I)+B*(C1(I)*D5(I)+D1(I)*C5(I))
EndDo

Return
End

and where N typically ranges from 1 to a few hundred.

I am using the /O3 optimization flag.

I've tried to effectively parallelize the loop in this subroutine using OpenMP such as:

c$OMP PARALLEL DO
Do I=1,N
A1(I)=A1(I)+B*C1(I)*D1(I)
A2(I)=A2(I)+B*(C1(I)*D2(I)+D1(I)*C2(I))
A3(I)=A3(I)+B*(C1(I)*D3(I)+D1(I)*C3(I))
A4(I)=A4(I)+B*(C1(I)*D4(I)+D1(I)*C4(I))
A5(I)=A5(I)+B*(C2(I)*D2(I)+C3(I)*D3(I)+C4(I)*D4(I))
A6(I)=A6(I)+B*(C1(I)*D5(I)+D1(I)*C5(I))
EndDo
c$OMP END PARALLEL DO

On an Intel Q9300 @2.5 GHz, all four processors are being used but Ialways get wall-clock timings that are about a factor of two slower than with no parallelization.

Why is that?

Is there any hope of taking advantage of multiple processors for this subroutine?

Thanks very much for any advice.

BBoston

jimdempseyatthecove · ‎10-30-2009

>>I have a Fortran program in which most of the time boils down to the following subroutine, which is called tens of thousands of times or more

Maybe you should consider moving the parallel code to the outer level

"parallel outer, vector inner"

Jim

View solution in original post

TimP · ‎10-29-2009

I assume you vectorized it, depend on standard hardware prefetch, and set KMP_AFFINITY so as to pin adjacent threads to the same cache. You would still expect full false sharing up to N of about 20 (maybe a little less if you disable adjacent sector prefetch). If you are depending on cache locality to avoid reading, it looks like each thread will use at least 120 bytes per loop trip, so cache fill and flush traffic would take over about the size where false sharing goes away. When you try 2 threads, is it better with both pinned to the same or different caches?

BBoston · ‎10-29-2009

Quoting - tim18

I assume you vectorized it, depend on standard hardware prefetch, and set KMP_AFFINITY so as to pin adjacent threads to the same cache. You would still expect full false sharing up to N of about 20 (maybe a little less if you disable adjacent sector prefetch). If you are depending on cache locality to avoid reading, it looks like each thread will use at least 120 bytes per loop trip, so cache fill and flush traffic would take over about the size where false sharing goes away. When you try 2 threads, is it better with both pinned to the same or different caches?

Thanks tim18.

Yes, vecorization is on. Prefetch insertion doesn't make any difference.

I also made sure to check N and only do OMP when N > 30.

I messed around with OMP_NUM_THREADS andKMP_AFFINITY a bit and found that the best parallel timing results were with KMP_AFFINITY=granularity=fine,compact and OMP_NUM_THREADS=2.

But even then the timings still suck compared (10% increase) to just using a single processor.

The interesting thing is that if I run 4 separate instances of my program, i.e., 4 separate processes, at the same time then they all 4 finish in essentially the same time as when running one alone.

I was hoping parallelization of my programs would be a lot more automatic :(

jimdempseyatthecove · ‎10-29-2009

B,

Since N is passed in try:

Subroutine DMMult5M(N,A1,A2,A3,A4,A5,A6,B,C1,C2,C3,C4,C5,
$ D1,D2,D3,D4,D5)
Implicit Real*8 (A-H,O-Z)
Dimension A1(N),A2(N),A3(N),A4(N),A5(N),A6(N),
$ C1(N),C2(N),C3(N),C4(N),C5(N),
$ D1(N),D2(N),D3(N),D4(N),D5(N)

This might help the optimizer when computing the indexes

Also try

c$OMP SECTIONS PRIVATE(I)
Do I=1,N
A1(I)=A1(I)+B*C1(I)*D1(I)
EndDo
c$OMP SECTION
Do I=1,N
A2(I)=A2(I)+B*(C1(I)*D2(I)+D1(I)*C2(I))
EndDo
c$OMP SECTION
Do I=1,N
A3(I)=A3(I)+B*(C1(I)*D3(I)+D1(I)*C3(I))
EndDo
c$OMP SECTION
Do I=1,N
A4(I)=A4(I)+B*(C1(I)*D4(I)+D1(I)*C4(I))
EndDo
c$OMP SECTION
Do I=1,N
A5(I)=A5(I)+B*(C2(I)*D2(I)+C3(I)*D3(I)+C4(I)*D4(I))
EndDo
c$OMP SECTION
Do I=1,N
A6(I)=A6(I)+B*(C1(I)*D5(I)+D1(I)*C5(I))
EndDo
c$OMP END SECTIONS

If the above improves the performance then reorder the above sections in decreasing order of estimated run time (do A5 first and A1 last)

Jim Dempsey

Paul_Curtis · ‎10-29-2009

Quoting - BBoston

Do I=1,N
A1(I)=A1(I)+B*C1(I)*D1(I)
A2(I)=A2(I)+B*(C1(I)*D2(I)+D1(I)*C2(I))
A3(I)=A3(I)+B*(C1(I)*D3(I)+D1(I)*C3(I))
A4(I)=A4(I)+B*(C1(I)*D4(I)+D1(I)*C4(I))
A5(I)=A5(I)+B*(C2(I)*D2(I)+C3(I)*D3(I)+C4(I)*D4(I))
A6(I)=A6(I)+B*(C1(I)*D5(I)+D1(I)*C5(I))
EndDo

Many subscripted items appear multiple times within the loop. Speed would be improved by introducing explicit iteration constants (do not rely on automatic optimization for this), eg:

DO i = 1, n
c_1 = C1(i)
d_1 = D1(i)
A1(i) = A1(i) + B*(c_1*D2(i) + d_1*C2(i))
A2(i) = A2(i) + B*(c_1*D3(i) + d_1*C3(i))
... etc...

BBoston · ‎10-29-2009

Thanks for your suggestions Jim.

Unfortunately, performance did not improve.

I think there is no hope of effectivley parallelizing this subroutine.

As I mentioned earlier, if I run 4 identical (non-parallelized) jobs at the same time in separate processes then I don't see any significant slowdown for each compared to just running 1.

Perhaps I will try to parallelize at a higher level than inside this subroutine...which will require some restructuring.

BBoston · ‎10-29-2009

Quoting - Paul Curtis

Many subscripted items appear multiple times within the loop. Speed would be improved by introducing explicit iteration constants (do not rely on automatic optimization for this), eg:

DO i = 1, n
c_1 = C1(i)
d_1 = D1(i)
A1(i) = A1(i) + B*(c_1*D2(i) + d_1*C2(i))
A2(i) = A2(i) + B*(c_1*D3(i) + d_1*C3(i))
... etc...

Thanks Paul. I've heard arguments for and against this approach. I guess it depends on the compiler, so its worth a shot. ButI would expectthat a decent compiler would create a temporary if it was advantageous to...

rogcar · ‎10-30-2009

Hello there.

Actually tim and Paul got it right, but instead of passing all those matrix, pass only the argumment.I can't exactly explain like them but when you create the call using those matrix, part of them are load in the L2 cash and then put in L1 cash. Since L2 is filled by the first matrix, the program must go to L3 to the load the second matrix from there. If you pass only the value needed, a smaller part will be loaded in L2 cash, and some space will be left for the second, third, etc... Did I make my self clear???

Well, try that:

Do i = 1,n
call my_routine(a(i),b(i),c(i),d(i)...)
end do

....

my_routine(a,b,c,d,...)
a = b + c + d
return
end do

Got it?

Hope it helped!
Roger

TimP · ‎10-30-2009

The style of the example, a mixture of f77 and non-standard IBM extensions to f66, has limitations, but performance problems may or may not be among them. Loss of points in your programming class definitely should be expected.
From a performance point of view, the biggest probable deficit is the impossibility of knowing the relative alignments of the arrays at compile time. MODULE, or, if you are a traditionalist, COMMON block arrays, would fix that, and allow optimization for CPUs prior to Core i7. We've also seen cases, back in the Nocona CPU days, where putting a large number of assumed size array pointers on the stack for subroutine call provoked Write Combine buffer thrashing. Even back then, this would be unlikely to happen in 32-bit mode for as few as 20 arguments in the list.
gfortran is incapable of vectorizing cases like this with multiple assignments in a loop; that compiler would do better if the loop were changed to individual array assignments. A great deal of work has gone into ifort and other commercial compilers, so they should not share this limitation, but you should look into your vectorization report to see that the optimization has been carried out, if your expected loop count is in the range where that is desirable. However, if your desire is to show threaded performance gains, without caring about overall performance, that goal is more likely to be achieved by minimizing single threaded performance, including avoidance of vectorization.
The number of store streams shown here is right at the limit of what the Penryn CPUs can handle consistently without fill buffer thrashing; ifort may choose to split ("distribute") the loop automatically. As long as the loop is short enough for the data to remain in L1 cache, splitting the loop and requiring operands to be read over will not aggravate overall memory bandwidth performance limitation.
From a debugging point of view, few compilers would be able to check for range violations in this style of coding, and it is unnecessarily vulnerable to typos. This style is one of the reasons why the argument list checking is on by default with ifort in Visual Studio.
Sorry to be so long winded, but others have, perhaps rightly, brought up issues which may or may not be central to the original question.

jimdempseyatthecove · ‎10-30-2009

>>I have a Fortran program in which most of the time boils down to the following subroutine, which is called tens of thousands of times or more

Maybe you should consider moving the parallel code to the outer level

"parallel outer, vector inner"

Jim

BBoston · ‎10-31-2009

Quoting - rogcar

Hello there.

Actually tim and Paul got it right, but instead of passing all those matrix, pass only the argumment.I can't exactly explain like them but when you create the call using those matrix, part of them are load in the L2 cash and then put in L1 cash. Since L2 is filled by the first matrix, the program must go to L3 to the load the second matrix from there. If you pass only the value needed, a smaller part will be loaded in L2 cash, and some space will be left for the second, third, etc... Did I make my self clear???

Well, try that:

Do i = 1,n
call my_routine(a(i),b(i),c(i),d(i)...)
end do

....

my_routine(a,b,c,d,...)
a = b + c + d
return
end do

Got it?

Hope it helped!
Roger

Roger, thanks, but what you have said is not correct.

BBoston · ‎10-31-2009

Quoting - jimdempseyatthecove

>>I have a Fortran program in which most of the time boils down to the following subroutine, which is called tens of thousands of times or more

Maybe you should consider moving the parallel code to the outer level

"parallel outer, vector inner"

Jim

Yes, that is the lesson here. I have now done that and while it required some restructuring I at least got some overall performance gains though not what I would consider good enough.

Thanks to all.

Paul_Curtis · ‎10-31-2009

Quoting - BBoston

Yes, that is the lesson here. I have now done that and while it required some restructuring I at least got some overall performance gains though not what I would consider good enough.

Thanks to all.

The execution speed for your loop is driven by the many individual array item references, as well as the overhead associated with setting up the arguments for each call. If possible, consider turning the data structure inside out:

TYPE bigtype
REAL a1, a2, a3, a4, a5
REAL b
REAL c1, c2, c3, c4, c5
REAL d1, d2, d3, d4, d5
END TYPE bigtype

TYPE(bigtype),DIMENSION(n) :: bt

DO i = 1, n
CALL DMMult5M (bt(i))
END DO

SUBROUTINE DMMult5M (bt)
TYPE(bigtype),INTENT(INOUT) :: bt
bt%a1=bt%a1+bt%b*bt%c1*bt%d1
bt%a2=bt%a2+bt%b*(bt%c1*bt%d2+bt%d1*bt%c2)
...etc...
END SUBROUTINE DMMult5M

DavidWhite · ‎11-01-2009

Quoting - Paul Curtis

The execution speed for your loop is driven by the many individual array item references, as well as the overhead associated with setting up the arguments for each call. If possible, consider turning the data structure inside out:

TYPE bigtype
REAL a1, a2, a3, a4, a5
REAL b
REAL c1, c2, c3, c4, c5
REAL d1, d2, d3, d4, d5
END TYPE bigtype

TYPE(bigtype),DIMENSION(n) :: bt

DO i = 1, n
CALL DMMult5M (bt(i))
END DO

SUBROUTINE DMMult5M (bt)
TYPE(bigtype),INTENT(INOUT) :: bt
bt%a1=bt%a1+bt%b*bt%c1*bt%d1
bt%a2=bt%a2+bt%b*(bt%c1*bt%d2+bt%d1*bt%c2)
...etc...
END SUBROUTINE DMMult5M

Paul,

I'm dappling in parallelisation too. Can you explain how your suggestion improves performance? To apply this concept to my project would be a major undertaking, so I would like to understand it first.

Thanks,

David

Paul_Curtis · ‎11-01-2009

Quoting - David White

I'm dappling in parallelisation too. Can you explain how your suggestion improves performance? To apply this concept to my project would be a major undertaking, so I would like to understand it first.

The OP's loop code looks awkward to me, too many array references, plus a large number of array arguments. Leaving aside what might be expected of a good optimizing compiler as far as dealing with loop invariants (I favor doing this explicitly), the stack work for each call is large. Using a derived type reduces the high-iteration-count arg list to one item (ie, a single address on the stack). I don't know whether internal reference calculations for type%component are more efficient than finding array(element); I also don't know whether this sort of structure favors parallelisation, but I suspect that execution would be faster than the original (and I infer that speed is the real issue here, and parallelisation only one possible route to improvement). I have done this sort of thing a lot in my own codes (without the burden of having to restructure huge amounts of pre-existing code), and it certainly makes for clean code and also executes well. If you can construct a test case without too much work, comparative results would be interesting.

rogcar · ‎11-03-2009

Quoting - BBoston

Roger, thanks, but what you have said is not correct.

BBoston,

Actuallywhat I was trying to say is that you shouldimprove your code before you parallelize it. You may get better results, even if you use a single core. Ifyou use a resource like VTune, youwill notice that your processor is almost stopped, because it is waiting for the data. After your code is improved, you will notice the speedup, and thatthe new code may beeffectively parallelized. If you don't improve it, instead of one core waiting for memory, you will have several cores waiting... That's why you didn't get any improvement when you parallelized your code.

Notice that what I wrote in my post it is almost the same thing that Paul Curtis wrote, one day latter. But he used the correct sentence: "stack work for each call is large". The only difference is that he is using structures. If that would lead to any difference, it is a matter of testing, and then profiling it.

And if don't trust me, read: D. Patterson, J. Hennessy. Computer Design and Organization: The Hardware / Software Interface. Morgan Kauffman, 3rd. ed. rev. 2006. You will see the same kind of optimization there.

Regards,
Roger