Using /Qparallel with no effects

mfangmeyer · ‎02-03-2011

Hi!

I've set the compiler option Parallelization to "/Qparallel", but it takes no effects. I've tried this code. It's a DLL called by a .NET application:

[fortran]subroutine LRZERLEGUNG(AMatrix, DimensionN, AbsolutB, ResultX)
!DEC$ ATTRIBUTES DLLEXPORT::LRZERLEGUNG

  implicit none
  ! Variables
  real :: T1, T2

  integer, intent(in) :: DimensionN
  double precision, dimension(DimensionN,DimensionN) :: AMatrix
  double precision, dimension(DimensionN) :: AbsolutB
  double precision, intent(out), dimension(DimensionN) :: ResultX
  double precision, dimension(DimensionN)::y

  integer :: i,j,k

  ! Body of lrZerlegung
  CALL CPU_TIME(T1)
  ! Berechnung von L (Ax=LRx=b)
  !dir$ parallel
  do i = 1,DimensionN-1
     do j = i+1,DimensionN
     ! Bestimme die i-te Spalte von L
       AMatrix(j,i) = AMatrix(j,i)/AMatrix(i,i)
       do k = i+1,DimensionN
         ! Datiere die j-te Zeile auf
         AMatrix(j,k) = AMatrix(j,k) - AMatrix(j,i)*AMatrix(i,k)
       end do
     end do
  end do

  ! Vorwaertseinsetzen
  !dir$ parallel
  !dir$ loop count min(4)
  do j = 1,DimensionN
    y(j) = AbsolutB(j)
    do k = 1,j-1
      y(j) = y(j) - AMatrix(j,k)*y(k)
    end do
  end do

  ! Rueckwrtseinsetzen
  !dir$ parallel
  !dir$ loop count min(4)
  do j = DimensionN,1,-1
    ResultX(j) = y(j)
        do k = j+1,DimensionN
        ResultX(j) = ResultX(j) - AMatrix(j,k)*ResultX(k)
        end do
    ResultX(j) = ResultX(j)/AMatrix(j,j)
  end do
  CALL CPU_TIME(T2)
  write( *, * ) T2-T1
end subroutine LRZERLEGUNG
[/fortran]

Best regards, Marc

psantos · ‎02-03-2011

Hello mfangmeyer,

In the example you gave to us, there isn't a value for the loop count, since we don't know the value of "DimensionN". In fact if this value is too small, say less then 100 loop counts (just as an example) you will don't have any speed-up, since there is always some overhead. In fact there is compiler switch /Qpar-threshold:[] where you can define the minimum value of loop iterations in order to get some profit from parallelize the code. So, if you are saying that you haven't any profit, perhaps you need to increase the number of loop counts.
This advice is not related to this particular question, but looking at your code I noticed that you use "doubleprecision" to declare some variables. It should be avoided since it is compiler dependent. Therefore, you should always use real(KIND=8) or just real(8) instead, assuming that your compiler takes the single precision as real(4) (which is almost certain). I hope this helps you.

Pedro

mfangmeyer · ‎02-03-2011

Hello Pedro,

the loop count is 1000 minimum. Looking at the Taskmanager says that the CPU usage is maximum 50% for the process. I have a dual core CPU.

OK, in future I will use real(8) instead of db declaration.

Marc

psantos · ‎02-03-2011

mfangmeyer,

I have looked more carefully to your code, and I have detected some situation where there is data dependency on your loops, which prevents auto-parallelization to occur. The compiler can automatically detect these situations. Under Visual Studio open your project properties, then FORTRAN>Diagnostics>Optimization Diagnostics>Optimization Diagnostics Level. Then select medium (should be enough). Recompile your code and you will see the diagnostics messages. Hope this helps you.

EDIT: I forgot to say. You could also use the Guided-Auto-Parallelism that emit advices. To use it right click your source file then Intel Visual Fortran Composer XE 2011>Guided-Auto-Parallelism>Run Analysis. Give it try. This a very useful feature.

Pedro

IanH · ‎02-03-2011

Quoting psantos

...
This advice is not related to this particular question, but looking at your code I noticed that you use "doubleprecision" to declare some variables. It should be avoided since it is compiler dependent. Therefore, you should always use real(KIND=8) or just real(8) instead, assuming that your compiler takes the single precision as real(4) (which is almost certain).

It is really the other way round - using a number (a literal integer constant) for a kind is more compiler dependent than using DOUBLE PRECISION. There's no guarantee that a kind of 8 even exists on other compilers, let alone that 8 means double precision (though for a number (the majority?) of compilers that is exactly what a real kind of 8 means). But all fortran compilers must support DOUBLE PRECISION, and that must be more precise than a default REAL (as of F2008 there are other requirements on it that mean that it will be at least as good as what everyone typically regards "double precision" to be).

If you wanted to save some typing, you could define an integer parameter that represents the kind of real that you typically want to use - something like:

[fortran]MODULE MyKinds
  ...
  INTEGER, PARAMETER, PUBLIC :: rk = KIND(1.0D0)
  ...
END MODULE MyKinds



USE MyKinds
...
REAL(rk), dimension(DimensionN,DimensionN) :: AMatrix  
REAL(rk), dimension(DimensionN) :: AbsolutB  
REAL(rk), intent(out), dimension(DimensionN) :: ResultX  
REAL(rk), dimension(DimensionN) :: y  
[/fortran]

This is portable to different compilers and, if at some stage in the future you want to change the kind in use, you can just edit in one place the the expression in the parameter declaration.

(Edit to change the parameter name from dp to rk to avoid confusion if the expression that defines the parameter did get changed in future).

psantos · ‎02-03-2011

Hello IanH,

I agree with you when you put the king value in a separate module.
When we use the single precision terms we don't really know what precision is being used, since there are compiler switches to change the behaviour of this keywords. So perhaps the best is to use the intrinsic SELECT_REAL_KIND(), which will return the kind number that is needed to accomplish a specified precision and range. With this we ensure that the compiler will automatically select the kind based on the programmer request and ensure portability. So I change your code to:

[bash]MODULE MyKinds 
   ... 
  INTEGER, PARAMETER, PUBLIC :: rk = SELECT_REAL_KIND(p=13)  
  ... 
END MODULE MyKinds 
  
 
USE MyKinds 
 ... 
REAL(rk), dimension(DimensionN,DimensionN) :: AMatrix   
REAL(rk), dimension(DimensionN) :: AbsolutB   
REAL(rk), intent(out), dimension(DimensionN) :: ResultX   
REAL(rk), dimension(DimensionN) :: y  [/bash]

Note that I have chosen 13 decimal digits for precision, which will return a KIND=8 under Intel Fortran.

Pedro

mfangmeyer · ‎02-04-2011

"I have looked more carefully to your code, and I have detected some situation where there is data dependency on your loops, which prevents auto-parallelization to occur. The compiler can automatically detect these situations. Under Visual Studio open your project properties, then FORTRAN>Diagnostics>Optimization Diagnostics>Optimization Diagnostics Level. Then select medium (should be enough). Recompile your code and you will see the diagnostics messages. Hope this helps you."

Indeed, there ist not enough algorithmic independency in my code. By the way it is a LU decomposition to solve linear equations.
So I tried a matrix multiplication with matmul(). Same behaviour as before. The CPU usage is maximum 50%. A matrix multiplication is predestinated for parallelization!

Additionaly I'v set compiler option "Use Intel Math Kernel Libary" to Parallel (/Qmkl:parallel).

Can anybody give me a suitable code example?

psantos · ‎02-04-2011

Hello mfangmeyer,

I didn't understand when you say "Additionaly I'v set compiler option "Use Intel Math Kernel Libary" to Parallel (/Qmkl:parallel)." If you are not using the MKL, why including it? This will not make any difference.

If this a LU decomposition perhaps you should consider using the "getrf" from MKL. See MKL library manual for more details. The MKL are very optimized and will certainly give you better performance.

Pedro

mfangmeyer · ‎02-07-2011

OK, I just want to test (auto) parallelization. Can anybody give me a code example to learn more about it? My examples seem to be unsuitable.

John_Campbell · ‎02-07-2011

I'm with you on this. I tried to get similar help 2 weeks ago and was not able to get a simple example of using /Qparallel. I was hoping to learn from this and better understand how to use this feature.

What we need isa simple case study of code and compiler options to demonstrate parallelization working. A few worked examples of different approaches to the same code could help us understand how it best works.

The code example I tried to use was the dot product loop, which is the inner loop of a LU Crout decomposition.

Steve Lionel suggestedin an earlier post :You can also try /Qparallel and look at the new "Guided Auto Parallelization" feature to help you get the most out of it.
Unfortunately I could not find this feature (?)

I should add that I was given a lot of very good advise, but for a new user to Intel Visual Fortran and /Qparallel, a simple introductory case study would be more helpful to get us started.

John

bendel_boy1 · ‎02-07-2011

Of course, REAL (KIND = 8) is also compiler-dependent. DOUBLE PRECISION is easier to work out the intent.

You really want REAL (SELECTED_REAL_KIND(15, 308)), or, better, INTEGER, PARAMETER:: dp = SELECTED_REAL_KIND(15, 308); REAL (dp):: MyValue,

remembering that this will fail should the requirements not be possible.

mfangmeyer · ‎02-07-2011

Hi John Campbell! Good to know you "on my side"... :-)

psantos · ‎02-07-2011

Hello bendel boy,

when I used SELECTED_REAL_KIND I only specified the number of decimal digits I want. It is perfectly valid, since both arguments are optional (but you have to specify at least one). So, I really want what I have written and nothing more. Just for reference, in the F2008 standard, a new argument was introduced: the radix.

Pedro

anthonyrichards · ‎02-07-2011

If you want an example of what the compiler should do when left to itself: Running on a Core 2-duo workstation, this simple Dot-product routine, compiled using IVF 11.1.067 without /Qparallel and with no OpenMP directives, runs as a console program in Release configuration at approx 60% on both cores, according to Task Manager, and takes about 8.3 seconds to do 1,000,000 iterations of it.

With /Qparallel selected, it runs at 100% on both cores and takes on average about 5 seconds. 100*(5/8.3) = 63% so it is consistent.

Note that to use the more accurate timer function OMP_GET_WTIME(), you must have /Qopenmp selected (even though there are no OpenMP directives) in order for the library containing the function to be linked in.

program timedotproduct
implicit none
INTEGER, PARAMETER::N=10000
REAL(8) A(N), B(N),Y
REAL(8) T1, T2, T3,OMP_GET_WTIME
REAL(4) TDOTPROD,TGENERATELOOP
INTEGER(4) I,J, JMAX,K1
TDOTPROD=0.0D0
T1=OMP_GET_WTIME()
do i=1,N
A(I)=dble(I)
B(I)=2.0D0*dble(I)
enddo
T2=OMP_GET_WTIME()
TGENERATELOOP=T2-T1
PRINT *,"TGENERATELOOP =",TGENERATELOOP
!DOTPROD returns the dot product of arrays A and B up
! to the Kth element. The dot product is returned in Y
!The total should be 2N(N+1)(2N+1)/6
K1=N
JMAX=1000000
CALL CPU_TIME(T3)
DO J=1,JMAX
CALL DOTPROD(A,B,N,K1,Y)
end do
T3=OMP_GET_WTIME()
TDOTPROD=T3-T2
PRINT *,"JMAX= ",JMAX,", dotprod = ",Y,", TDOTPROD =",TDOTPROD
PAUSE
end program timedotproduct

SUBROUTINE DOTPROD(A,B,N,K,SUM)
! Simplest dot-product code to compute the dot product up to the kth element
INTEGER(4) N,K
REAL(8) A(N),B(N), SUM
INTEGER(4) I
SUM=0.0D+0
DO I=1,K
SUM=SUM+A(I)*B(I)
END DO
RETURN
END

timintel · ‎02-07-2011

Quoting John Campbell

Steve Lionel suggestedin an earlier post :You can also try /Qparallel and look at the new "Guided Auto Parallelization" feature to help you get the most out of it.
Unfortunately I could not find this feature (?)

I think we've spent a lot of time here giving advice which has been ignored.

The gap options are written up in the html docs. The compile line spelling was changed to "guide" some time after the "gap" terminology became widespread, as there are both auto-parallel (-guide-par) and auto-vector (-guide-vec) options, in case you want the categories separated. They write suggestions e.g. about directives at compile time. It's worth while to put in loop count directives before generating gap advice, if loop counts are significantly different from default assumptions (although gap may advise you to do that if you haven't).
gap is heavy on advice to use IVDEP directives even when there are superior alternatives.

mfangmeyer · ‎02-07-2011

OK, that's it! It works fine. But, as I must say, it is a trivial example. Such loops are easy to parallelize. Just take half of the loop count or N/c where c= Number of threads/cores.

What about nested loops? What is when there are (low) data dependencies? Does this automatic parallelization it's job only for such simple cases?

Furthermore I want do set the maximum number of threads with "export OMP_NUM_THREADS=value". I don't know where to set it. In my code?

What is the different between /Qopenmp and /Qparallel?

Many questions... Thanks for help!

Steven_L_Intel1 · ‎02-07-2011

/Qparallel is the "auto-parallel" option. The compiler looks at loops and decides for itself whether it can parallelize a loop. /Qopenmp says that you will be using OpenMP to parallelize your program - this requires you to add OpenMP directives naming specific loops to parallelize and providing information about variables. You can get better results with OpenMP, but it is more work on your part.

/Qparallel with the new guided-auto-parallelism (GAP) feature helps you get better results out of auto-parallelism without requiring the more extensive changes of OpenMP.

John_Campbell · ‎02-07-2011

To respond to timintel comment, I have not ignored the advice, but as a new user to the intel compiler, I have found some advice difficult to understand.
What are the html docs? A file name would help.

Thank you anthonyrichards for your example.
I have utilised this to generate a modified program which does achieve parallel performance.
I have grouped the main loop in a routine "test_loop" and provided reporting of performance in "report_time"
I have replaced the elapsed time routine with System_Clock and also introduced a processor time via cpu_time. These avoid the use of /Openmp, which provides potential confusion as to what parallelization is being used.

My compilation command is : ifort test_dot_yes.f90 /Qparallel /Qpar-report

There are 3 calls to Test_Loop, all of which report LOOP WAS AUTO-PARALLELIZED."
This report is on a subroutine call and not on the do loop?
If I put the test_loop call into a do loop, then test_loop is no longer auto-parallelized. It's a fickle option!
Importantly, why has it been stopped? I anticipated this would not be a big change to the program structure.
What are the criteria for Test_Loop to be auto-parallelized.
I'm a bit worried by this, as the call to DOTPROD returns the same value JMAX times. I'm not sure what /Qparallel is achieving.
In the case of my LU decomposition, where the J loop was changed to give a different value each loop and the values are dependent on the previous J itteration, would we still have achieved auto-parallelized? Perhaps accumulating the Y error in the JMAX loop may be more effective.

If anyone wants to repeat these tests, I have attached 3 files.

test_dot_yes.f90 which does perform parallelization
test_dot_no.f90 which does not perform parallelization
test_dot.log which records the run time of the two alternatives.

It is my aim to better understand what can be achieved with /Qparallel, before contemplating /Qopenmp.

With regard to use of real(8), could I point out that real*8 is more portable and not excluded by the 95/03 standard.

John

John_Campbell · ‎02-07-2011

Marc,

Reviewing your original post, you should note the difference between CPU time (via call cpu_time) and elapsed time (via call system_clock). With parallelization, the CPU time actually increases, due to the thread initiation overhead, while the elapsed time hopefully decreases.
You also need to judge the advantage of increased processor utilisation against the increased conflict with other background processes that are running. My pc also runs multiple svhost.exe and a virus scanner, which at times it appears as if that is all my pc does.

John

IanH · ‎02-07-2011

Quoting John Campbell

With regard to use of real(8), could I point out that real*8 is more portable and not excluded by the 95/03 standard.

Is that a typo? The 8 might be processor specific, but specifying it via "REAL*8" is definitely not standard Fortran (but again, there are a number (and probably even the majority) of compilers that support it as an extension, but it's a bit of a stretch to call any sort of extension "more portable" than the standard syntax). From F2003:

R501: type-declaration-stmt is declaration-type-spec [ [, attr-spec ] ... :: ] entity-decl-list

R502: declaration-type-spec is intrinisic-type-spec
...

R403: intrinsic-type-spec is ...
REAL [ kind-selector ]

R404: kind-selector is ( [ KIND= ] scalar-int-initialization-expr )

So having a * after REAL (or INTEGER or LOGICAL) in the context of a type declaration is excluded by the syntax rules. If you apply the appropriate standards checking switches to ifort it will give you an almost appropriate whack around the ears for it too (the compiler calls it a length specification, which is not quite right...).

Not that this has the slightest thing to do with parallellisation...

Back on that topic - when I compile test_dot_no.f90 with "Fortran > Diagnostics > Guided Auto Parallelism Analysis" set to Extreme (/Qguide:4), it tells me (amongst some other things) that I should "Insert a "!dir$ loop count min(16)" statement right before the loop at line 96 to parallelize the loop. [VERIFY] Make sure that the loop has a minimum of 16 iterations". When I do that (and then remember to turn the GAP thing off...) I get pretty similar runtimes. I presume this means that the additional loop in the main program has suficiently obfuscated the range of loop counts that will be used for the loop in test_loop.

(Note apart from the odd !$OMP thing in bleedingly obvious places I don't play on the parallel swings very often - and the dependency tracking/constant folding/loop unrolling available in my head compiler is well and truely exceeded here...).

John_Campbell · ‎02-08-2011

IanH,

The use of REAL*8 is neither a deleted or obsolete feature in the 1990, 1995 or 2003 Fortran standard. As such it is standard Fortran.

I know of no 95 or 03 compiler that does not support this syntax and as such is more portable than REAL(8). It is a concise and clearly understood definition of precision.

Having worked with code from many pre 90 compilers, when encountering the declaration "REAL A", you had little idea what precision was required. Since the introduction of KIND, it is still much better to read REAL*8, than having to look for a KIND parameter value which is often hidden in another difficult to find or not supplied file. What would you expect the declaration"REAL(rf) A" to mean? The intrinsic SELECTED_REAL_KIND implies a flexibility of precision that is not available, with typically only 2 or 3 possible successful outcomes.

Back on the topic, is Guided Auto Parallelism Analysis available in Version 11 ?

The do loops presented in the example above are very simple. I certainly need to understand what complexity can be accommodated by /Qparallel

John