- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Iam usingIntel Fortran for Windows, 11.1.046
I have a Fortran program in which most of the time boils down to the following subroutine, which is called tens of thousandsof times or more:
Subroutine DMMult5M(N,A1,A2,A3,A4,A5,A6,B,C1,C2,C3,C4,C5,
$ D1,D2,D3,D4,D5)
Implicit Real*8 (A-H,O-Z)
Dimension A1(*),A2(*),A3(*),A4(*),A5(*),A6(*),
$ C1(*),C2(*),C3(*),C4(*),C5(*),
$ D1(*),D2(*),D3(*),D4(*),D5(*)
Do I=1,N
A1(I)=A1(I)+B*C1(I)*D1(I)
A2(I)=A2(I)+B*(C1(I)*D2(I)+D1(I)*C2(I))
A3(I)=A3(I)+B*(C1(I)*D3(I)+D1(I)*C3(I))
A4(I)=A4(I)+B*(C1(I)*D4(I)+D1(I)*C4(I))
A5(I)=A5(I)+B*(C2(I)*D2(I)+C3(I)*D3(I)+C4(I)*D4(I))
A6(I)=A6(I)+B*(C1(I)*D5(I)+D1(I)*C5(I))
EndDo
Return
End
and where N typically ranges from 1 to a few hundred.
I am using the /O3 optimization flag.
I've tried to effectively parallelize the loop in this subroutine using OpenMP such as:
c$OMP PARALLEL DO
Do I=1,N
A1(I)=A1(I)+B*C1(I)*D1(I)
A2(I)=A2(I)+B*(C1(I)*D2(I)+D1(I)*C2(I))
A3(I)=A3(I)+B*(C1(I)*D3(I)+D1(I)*C3(I))
A4(I)=A4(I)+B*(C1(I)*D4(I)+D1(I)*C4(I))
A5(I)=A5(I)+B*(C2(I)*D2(I)+C3(I)*D3(I)+C4(I)*D4(I))
A6(I)=A6(I)+B*(C1(I)*D5(I)+D1(I)*C5(I))
EndDo
c$OMP END PARALLEL DO
On an Intel Q9300 @2.5 GHz, all four processors are being used but Ialways get wall-clock timings that are about a factor of two slower than with no parallelization.
Why is that?
Is there any hope of taking advantage of multiple processors for this subroutine?
Thanks very much for any advice.
BBoston
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>I have a Fortran program in which most of the time boils down to the following subroutine, which is called tens of thousands of times or more
Maybe you should consider moving the parallel code to the outer level
"parallel outer, vector inner"
Jim
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks tim18.
Yes, vecorization is on. Prefetch insertion doesn't make any difference.
I also made sure to check N and only do OMP when N > 30.
I messed around with OMP_NUM_THREADS andKMP_AFFINITY a bit and found that the best parallel timing results were with KMP_AFFINITY=granularity=fine,compact and OMP_NUM_THREADS=2.
But even then the timings still suck compared (10% increase) to just using a single processor.
The interesting thing is that if I run 4 separate instances of my program, i.e., 4 separate processes, at the same time then they all 4 finish in essentially the same time as when running one alone.
I was hoping parallelization of my programs would be a lot more automatic :(
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
B,
Since N is passed in try:
Subroutine DMMult5M(N,A1,A2,A3,A4,A5,A6,B,C1,C2,C3,C4,C5,
$ D1,D2,D3,D4,D5)
Implicit Real*8 (A-H,O-Z)
Dimension A1(N),A2(N),A3(N),A4(N),A5(N),A6(N),
$ C1(N),C2(N),C3(N),C4(N),C5(N),
$ D1(N),D2(N),D3(N),D4(N),D5(N)
This might help the optimizer when computing the indexes
Also try
c$OMP SECTIONS PRIVATE(I)
Do I=1,N
A1(I)=A1(I)+B*C1(I)*D1(I)
EndDo
c$OMP SECTION
Do I=1,N
A2(I)=A2(I)+B*(C1(I)*D2(I)+D1(I)*C2(I))
EndDo
c$OMP SECTION
Do I=1,N
A3(I)=A3(I)+B*(C1(I)*D3(I)+D1(I)*C3(I))
EndDo
c$OMP SECTION
Do I=1,N
A4(I)=A4(I)+B*(C1(I)*D4(I)+D1(I)*C4(I))
EndDo
c$OMP SECTION
Do I=1,N
A5(I)=A5(I)+B*(C2(I)*D2(I)+C3(I)*D3(I)+C4(I)*D4(I))
EndDo
c$OMP SECTION
Do I=1,N
A6(I)=A6(I)+B*(C1(I)*D5(I)+D1(I)*C5(I))
EndDo
c$OMP END SECTIONS
If the above improves the performance then reorder the above sections in decreasing order of estimated run time (do A5 first and A1 last)
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
A1(I)=A1(I)+B*C1(I)*D1(I)
A2(I)=A2(I)+B*(C1(I)*D2(I)+D1(I)*C2(I))
A3(I)=A3(I)+B*(C1(I)*D3(I)+D1(I)*C3(I))
A4(I)=A4(I)+B*(C1(I)*D4(I)+D1(I)*C4(I))
A5(I)=A5(I)+B*(C2(I)*D2(I)+C3(I)*D3(I)+C4(I)*D4(I))
A6(I)=A6(I)+B*(C1(I)*D5(I)+D1(I)*C5(I))
EndDo
Many subscripted items appear multiple times within the loop. Speed would be improved by introducing explicit iteration constants (do not rely on automatic optimization for this), eg:
DO i = 1, n
c_1 = C1(i)
d_1 = D1(i)
A1(i) = A1(i) + B*(c_1*D2(i) + d_1*C2(i))
A2(i) = A2(i) + B*(c_1*D3(i) + d_1*C3(i))
... etc...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Unfortunately, performance did not improve.
I think there is no hope of effectivley parallelizing this subroutine.
As I mentioned earlier, if I run 4 identical (non-parallelized) jobs at the same time in separate processes then I don't see any significant slowdown for each compared to just running 1.
Perhaps I will try to parallelize at a higher level than inside this subroutine...which will require some restructuring.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Many subscripted items appear multiple times within the loop. Speed would be improved by introducing explicit iteration constants (do not rely on automatic optimization for this), eg:
DO i = 1, n
c_1 = C1(i)
d_1 = D1(i)
A1(i) = A1(i) + B*(c_1*D2(i) + d_1*C2(i))
A2(i) = A2(i) + B*(c_1*D3(i) + d_1*C3(i))
... etc...
Thanks Paul. I've heard arguments for and against this approach. I guess it depends on the compiler, so its worth a shot. ButI would expectthat a decent compiler would create a temporary if it was advantageous to...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello there.
Actually tim and Paul got it right, but instead of passing all those matrix, pass only the argumment.I can't exactly explain like them but when you create the call using those matrix, part of them are load in the L2 cash and then put in L1 cash. Since L2 is filled by the first matrix, the program must go to L3 to the load the second matrix from there. If you pass only the value needed, a smaller part will be loaded in L2 cash, and some space will be left for the second, third, etc... Did I make my self clear???
Well, try that:
Do i = 1,n
call my_routine(a(i),b(i),c(i),d(i)...)
end do
....
my_routine(a,b,c,d,...)
a = b + c + d
return
end do
Got it?
Hope it helped!
Roger
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
From a performance point of view, the biggest probable deficit is the impossibility of knowing the relative alignments of the arrays at compile time. MODULE, or, if you are a traditionalist, COMMON block arrays, would fix that, and allow optimization for CPUs prior to Core i7. We've also seen cases, back in the Nocona CPU days, where putting a large number of assumed size array pointers on the stack for subroutine call provoked Write Combine buffer thrashing. Even back then, this would be unlikely to happen in 32-bit mode for as few as 20 arguments in the list.
gfortran is incapable of vectorizing cases like this with multiple assignments in a loop; that compiler would do better if the loop were changed to individual array assignments. A great deal of work has gone into ifort and other commercial compilers, so they should not share this limitation, but you should look into your vectorization report to see that the optimization has been carried out, if your expected loop count is in the range where that is desirable. However, if your desire is to show threaded performance gains, without caring about overall performance, that goal is more likely to be achieved by minimizing single threaded performance, including avoidance of vectorization.
The number of store streams shown here is right at the limit of what the Penryn CPUs can handle consistently without fill buffer thrashing; ifort may choose to split ("distribute") the loop automatically. As long as the loop is short enough for the data to remain in L1 cache, splitting the loop and requiring operands to be read over will not aggravate overall memory bandwidth performance limitation.
From a debugging point of view, few compilers would be able to check for range violations in this style of coding, and it is unnecessarily vulnerable to typos. This style is one of the reasons why the argument list checking is on by default with ifort in Visual Studio.
Sorry to be so long winded, but others have, perhaps rightly, brought up issues which may or may not be central to the original question.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>I have a Fortran program in which most of the time boils down to the following subroutine, which is called tens of thousands of times or more
Maybe you should consider moving the parallel code to the outer level
"parallel outer, vector inner"
Jim
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello there.
Actually tim and Paul got it right, but instead of passing all those matrix, pass only the argumment.I can't exactly explain like them but when you create the call using those matrix, part of them are load in the L2 cash and then put in L1 cash. Since L2 is filled by the first matrix, the program must go to L3 to the load the second matrix from there. If you pass only the value needed, a smaller part will be loaded in L2 cash, and some space will be left for the second, third, etc... Did I make my self clear???
Well, try that:
Do i = 1,n
call my_routine(a(i),b(i),c(i),d(i)...)
end do
....
my_routine(a,b,c,d,...)
a = b + c + d
return
end do
Got it?
Hope it helped!
Roger
Roger, thanks, but what you have said is not correct.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>I have a Fortran program in which most of the time boils down to the following subroutine, which is called tens of thousands of times or more
Maybe you should consider moving the parallel code to the outer level
"parallel outer, vector inner"
Jim
Yes, that is the lesson here. I have now done that and while it required some restructuring I at least got some overall performance gains though not what I would consider good enough.
Thanks to all.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes, that is the lesson here. I have now done that and while it required some restructuring I at least got some overall performance gains though not what I would consider good enough.
Thanks to all.
The execution speed for your loop is driven by the many individual array item references, as well as the overhead associated with setting up the arguments for each call. If possible, consider turning the data structure inside out:
TYPE bigtype
REAL a1, a2, a3, a4, a5
REAL b
REAL c1, c2, c3, c4, c5
REAL d1, d2, d3, d4, d5
END TYPE bigtype
TYPE(bigtype),DIMENSION(n) :: bt
DO i = 1, n
CALL DMMult5M (bt(i))
END DO
SUBROUTINE DMMult5M (bt)
TYPE(bigtype),INTENT(INOUT) :: bt
bt%a1=bt%a1+bt%b*bt%c1*bt%d1
bt%a2=bt%a2+bt%b*(bt%c1*bt%d2+bt%d1*bt%c2)
...etc...
END SUBROUTINE DMMult5M
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The execution speed for your loop is driven by the many individual array item references, as well as the overhead associated with setting up the arguments for each call. If possible, consider turning the data structure inside out:
TYPE bigtype
REAL a1, a2, a3, a4, a5
REAL b
REAL c1, c2, c3, c4, c5
REAL d1, d2, d3, d4, d5
END TYPE bigtype
TYPE(bigtype),DIMENSION(n) :: bt
DO i = 1, n
CALL DMMult5M (bt(i))
END DO
SUBROUTINE DMMult5M (bt)
TYPE(bigtype),INTENT(INOUT) :: bt
bt%a1=bt%a1+bt%b*bt%c1*bt%d1
bt%a2=bt%a2+bt%b*(bt%c1*bt%d2+bt%d1*bt%c2)
...etc...
END SUBROUTINE DMMult5M
Paul,
I'm dappling in parallelisation too. Can you explain how your suggestion improves performance? To apply this concept to my project would be a major undertaking, so I would like to understand it first.
Thanks,
David
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The OP's loop code looks awkward to me, too many array references, plus a large number of array arguments. Leaving aside what might be expected of a good optimizing compiler as far as dealing with loop invariants (I favor doing this explicitly), the stack work for each call is large. Using a derived type reduces the high-iteration-count arg list to one item (ie, a single address on the stack). I don't know whether internal reference calculations for type%component are more efficient than finding array(element); I also don't know whether this sort of structure favors parallelisation, but I suspect that execution would be faster than the original (and I infer that speed is the real issue here, and parallelisation only one possible route to improvement). I have done this sort of thing a lot in my own codes (without the burden of having to restructure huge amounts of pre-existing code), and it certainly makes for clean code and also executes well. If you can construct a test case without too much work, comparative results would be interesting.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Roger, thanks, but what you have said is not correct.
BBoston,
Actuallywhat I was trying to say is that you shouldimprove your code before you parallelize it. You may get better results, even if you use a single core. Ifyou use a resource like VTune, youwill notice that your processor is almost stopped, because it is waiting for the data. After your code is improved, you will notice the speedup, and thatthe new code may beeffectively parallelized. If you don't improve it, instead of one core waiting for memory, you will have several cores waiting... That's why you didn't get any improvement when you parallelized your code.
Notice that what I wrote in my post it is almost the same thing that Paul Curtis wrote, one day latter. But he used the correct sentence: "stack work for each call is large". The only difference is that he is using structures. If that would lead to any difference, it is a matter of testing, and then profiling it.
And if don't trust me, read: D. Patterson, J. Hennessy. Computer Design and Organization: The Hardware / Software Interface. Morgan Kauffman, 3rd. ed. rev. 2006. You will see the same kind of optimization there.
Regards,
Roger

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page