Using /Qparallel with no effects - Page 2

mfangmeyer · ‎02-03-2011

Hi!

I've set the compiler option Parallelization to "/Qparallel", but it takes no effects. I've tried this code. It's a DLL called by a .NET application:

[fortran]subroutine LRZERLEGUNG(AMatrix, DimensionN, AbsolutB, ResultX)
!DEC$ ATTRIBUTES DLLEXPORT::LRZERLEGUNG

  implicit none
  ! Variables
  real :: T1, T2

  integer, intent(in) :: DimensionN
  double precision, dimension(DimensionN,DimensionN) :: AMatrix
  double precision, dimension(DimensionN) :: AbsolutB
  double precision, intent(out), dimension(DimensionN) :: ResultX
  double precision, dimension(DimensionN)::y

  integer :: i,j,k

  ! Body of lrZerlegung
  CALL CPU_TIME(T1)
  ! Berechnung von L (Ax=LRx=b)
  !dir$ parallel
  do i = 1,DimensionN-1
     do j = i+1,DimensionN
     ! Bestimme die i-te Spalte von L
       AMatrix(j,i) = AMatrix(j,i)/AMatrix(i,i)
       do k = i+1,DimensionN
         ! Datiere die j-te Zeile auf
         AMatrix(j,k) = AMatrix(j,k) - AMatrix(j,i)*AMatrix(i,k)
       end do
     end do
  end do

  ! Vorwaertseinsetzen
  !dir$ parallel
  !dir$ loop count min(4)
  do j = 1,DimensionN
    y(j) = AbsolutB(j)
    do k = 1,j-1
      y(j) = y(j) - AMatrix(j,k)*y(k)
    end do
  end do

  ! Rueckwrtseinsetzen
  !dir$ parallel
  !dir$ loop count min(4)
  do j = DimensionN,1,-1
    ResultX(j) = y(j)
        do k = j+1,DimensionN
        ResultX(j) = ResultX(j) - AMatrix(j,k)*ResultX(k)
        end do
    ResultX(j) = ResultX(j)/AMatrix(j,j)
  end do
  CALL CPU_TIME(T2)
  write( *, * ) T2-T1
end subroutine LRZERLEGUNG
[/fortran]

Best regards, Marc

TimP · ‎02-08-2011

Quoting John Campbell

The use of REAL*8 is neither a deleted or obsolete feature in the 1990, 1995 or 2003 Fortran standard. As such it is standard Fortran.

That's an interesting viewpoint; apparently, the fact that REAL*8 was never in the standard, and so was never explicitly obsoleted, makes it standard, even though it was used for many years to hinder porting applications to machines other than those they were written on. I guess I'll have to agree that nearly all f95 and later compilers were tied to architectures which supported an 8 byte floating point data type.

GAP is a highly advertised new feature of the "12.0" xe 2011 compilers. The loop count directive is an idea resurrected from compiler versions prior to 11.0.

mfangmeyer · ‎02-08-2011

Why does this example not run parallel?

[bash]program matrix
	implicit none
   	real(8),allocatable:: a(:,:), b(:,:),c(:,:)
	integer:: N = 200, i
   	real:: t0, t1

	allocate( a(N,N), b(N,N), c(N,N) )
   	call random_seed()
   	call random_number(a)
   	call random_number(b)

	call cpu_time(t0)

       !dir$ parallel allways
	do i = 1, 500
   		c = matmul(a,b)
   	end do
   	call cpu_time(t1)
   	write(*,'(a, f10.3)') ' Time for matmul        = ', (t1 - t0) / 5
       pause
end program matrix[/bash]

TimP · ‎02-08-2011

The "12.0" compilers have a specific option for parallelized matmul (by library call):
-Qopt-matmul
Did you set that option?
If you're trying to find out how efficiently the threads are used, you will want to measure elapsed time as well as the total CPU time of all threads added up.

abhimodak · ‎02-08-2011

The documentation which says that /Qopt-mamtul is enabled by default when /Qparallel is specified. Diagnostic information also reflects matmul being parallelized and I see that both cores are used 100%. Without /Qparallel, both cores' usage is about 60%.

To get the correct comparison I used the openmp timer. CPU_Time does not reflect correct values.

Most likely my test below (which is little bit different in that I use random numbers in each loop) does not have large enough matrix sizes but I don't see any significant performance improvement when using /Qopt-matmul.

Abhi

---

[fortran]      Program Test_Qmatmul

         Implicit None
         
         Real(8), Allocatable:: a(:,:), b(:,:), c(:,:)
         Integer :: N = 200, M = 500
         Integer :: i
         Real(8) :: ts, te, OMP_GET_WTIME

         Allocate( a(N,N), b(N,N), c(N,N) )
         Call random_seed()

         ts = OMP_GET_WTIME()
         !Call CPU_Time(ts)
         do i = 1, M
            Call random_number(a)
            Call random_number(b)     
            c = matmul(a,b)
         end do
         te = OMP_GET_WTIME()
         !Call CPU_Time(te)

         Write(*,'(A, F0.3)') 'Time for matmul = ', (te - ts)

      End Program Test_Qmatmul[/fortran]

John_Campbell · ‎02-08-2011

TimP,

From the 2003 Standard document linked in this forum, "Section 1.6.3 FORTRAN 77 compatibility" implies that REAL*8 is in the standard. I have converted many programs between compilers, including between 8-byte and non 8-byte architectures. Much public domain numerical software was developed on Control Data in 70's and 80's. With a program of unknown origin, it can be very difficult to know the required precision. KIND did not solve the problem, as often the KIND module listing is not included with the code listing. Have a look through some of the code examples in this forum.

John

TimP · ‎02-08-2011

The Intel docs do, unfortunately, sometimes refer to f77 or f90 compatibility, meaning compatibility with extensions present in a predecessor compiler, not in Fortran standard. Once or twice I asked for corrections and got turned down.
Posts on this forum often refer to legacy code containing many extensions as "f77" perhaps because the f77 standard didn't require a way of diagnosing extensions, as f90 and later standards do.
I've been around long enough to have used f77 compilers which didn't support the REAL*8 extension, including those for Honeywell 36-bit and Harris 24-bit platforms, as well as CDC compilers which certainly didn't accept REAL*8. You might argue that when we old-timers argue about such widely accepted extensions, we're pining in vain for the return of some of the old architectures.

Steven_L_Intel1 · ‎02-08-2011

The *n syntax for REAL, INTEGER, LOGICAL and COMPLEX, has never ever been part of the Fortran standard. It has always been an extension. I see nothing in section 1.6.3 of F2003 that relates to this - which text are you referring to?

John_Campbell · ‎02-08-2011

You may be right !

Unfortunately I can't find my hard copy of the Fortran 77 standard.
On reflection, recalling the influence of CDC and IBM on that standard, it probably wasn't.
Was there never a reference to the REAL*8 syntax as an extension?

My apologies for the assumption of REAL*8 being included in the standard.

As a long time Fortran user, I will maintain *byte is the most concise and informative definition of required numerical precision for the numerical calculation being performed.
I recall the frustration when converting programs to run on Prime/Vax or PC, of not knowing the required precision for the calculations.
I have always been disappointed with the KIND structure and the allusion of dialing up a precision in SELECTED_REAL_KIND. What do you do when someone selects SELECTED_REAL_KIND (7,38) or SELECTED_REAL_KIND (3,3) which is the example in an oldLahey manual ?
My computer science background has been more based on numerical methods and providing the ability to use SELECTED_REAL_KIND (3,3) is not a significant improvement in the use of Fortran. What would you think the original programmer was hoping to achieve ?
If I see REAL*10 or REAL*16 in new code, at least I know I have a problem, while REAL*8 says the conversion is easy. The statement USE PRECISION is still an unknown. (Actually the use of REAL or REAL*4 was always the biggest worry)

John

Steven_L_Intel1 · ‎02-08-2011

I have a hardcopy of F77 (and F66) and can guarantee you that REAL*8 was never there.

The problem with using the byte size is that this tells you nothing about the precision or range of the datatype. You were obviously never a VAX user, where there were two different representations of REAL*8, one of them with the same range as REAL*4, or VMS on Alpha where there were THREE REAL*8 representations (and two REAL*4), each with different precision and range. This is why the SELECTED_REAL_KIND intrinsic is so useful as it frees you from such issues and lets you pick the needed precision and range and lets the compiler decide which type best fits.

SELECTED_REAL_KIND(7,38) would get you REAL*4 on VAX. SELECTED_REAL_KIND(3,3) is easily satisfied by single precision - remember that these are minimums. The compiler must pick the smallest decimal precision kind that meets the criteria for both range and precision. If there is more than one, the smallest kind value is chosen. If the implememtation doesn't have a kind that meets the requirements, you'll get -1 for a kind which will almost certainly result in a compile-time error.

Consider also traditional Cray systems, where single precision is 60 (!) bits. There are also 16-bit real types used in some graphics processors.

Of course there are times when you do need to know the byte size, such as C interoperability, which is why there are constants for C_FLOAT, etc.

John_Campbell · ‎02-08-2011

Steve,
Thanks for your comments. I do now recall the longer real*8 precision. Was there a F90 VAX compiler that allowed the different real*4 and real*8 formats to be used in the same program? Our Vax went in the late 80's.
While the Lahey documented precision of REAL*4 is about 7-8 digits and 10^38 exponent, SELECTED_REAL_KIND(7,38) will get you REAL*8 on ifort and most win-32 compilers, which illistrates my problem: What did the original programmer require for his calculation and what did he get, ie what accuracy did the calculation require.
The importance of byte size has not gone away, especially when using direct access files, where it is still possible to legally mix integer, real and character, not to mention the illegal mixing via subroutine arguments that some of us still find convenient.
Never wrote code for a Cray, however CDC, IBMand ICL were a challenge.I think ICL had48 bit reals. 48 bits gave a new level of uncertainty to REAL for code from the UK.
REAL*8 is a lot more certain than these past offerings. Now, listing code without the defining module takes me back to those annoying times.

KIND isn't likely to be changed. I do use it for portability when selecting higher precision than real*8

John

TimP · ‎02-08-2011

VAX single precision had the same precision (but slightly different range)as the later IEEE standard single precision, so I don't think selected_real_kind(7,38) should have produced it. We never saw f90 on the VAX. Our last VAX compiler (I didn't workin the computer industry then)did adopt the f77 subset ofDATE_AND_TIME (which we then put into g77) to facilitate Y2K work.
I've never heard the CDC 60-bit format called "traditional Cray," although Seymour did have a hand in that.The Cray branded systems started out with a 64-bit format (but not with IEEE precision).

Steven_L_Intel1 · ‎02-09-2011

VAX never had a F90 compiler but Alpha did. DEC Fortran for Alpha VMS systems allowed you to select among the two VAX double formats or the IEEE double format, though you were permitted to use only one at a time. This was an arbitrary restriction made for simplicity and not based on the hardware's requirements.

F-float (VAX single precision) had a range from .29E-38 to 1.7E38 and 23 fraction bits, which qualifies for (7,38). IEEE single precision has effectively one less power of 2 of range due to the encodings for denormals, etc. and the range is more asymmetric than for VAX F-float (1.2E-38 to 3.4E38).

I don't think Lahey ever supported VAX floating, so their example is a bit strange.

There is also accomodation in Fortran 2008 for decimal floating point with the addition of an optional RADIX argument, so you might have more than one kind of a given byte size.

mfangmeyer · ‎02-10-2011

This Code by using "/Qopenmp" runs parallel:

[bash]program matrix
    !use omp_lib
    Implicit None  
    
    Real(8), Allocatable:: a(:,:), b(:,:), c(:,:)  
    Integer :: N = 200, M = 500  
    Integer :: i  
    Real(8) :: ts, te, OMP_GET_WTIME   
    Allocate( a(N,N), b(N,N), c(N,N) )  
    Call random_seed()  
   
    ts = OMP_GET_WTIME()  
    !Call CPU_Time(ts)  
!$OMP PARALLEL DO
    do i = 1, M  
       Call random_number(a)  
       Call random_number(b)       
       c = matmul(a,b)  
    end do  
    te = OMP_GET_WTIME()  
    !Call CPU_Time(te)  
   
     Write(*,'(A, F0.3)') 'Time for matmul = ', (te - ts)  
     pause
end program matrix[/bash]

But /Qparallel has no effects! I thought /Qparallel would parallelize the Code without using any explicit directives?!

TimP · ‎02-10-2011

OpenMP allows you to force parallelism with minimal checking for validity. As you didn't set up private copies of the arrays, it doesn't look like your OpenMP could give well defined results. Besides, if there were a parallel random number library, it could not give results consistent with serial execution, and /Qopenmp would not perform an automatic switch to parallel random_number. It looks like the only opportunity for auto-parallel is via the opt-matmul transformation.
If it were not for a requirement to trace a history in random_number, the compiler might start out by the shortcut of replacing doi=1,M by doi= M,M when you don't set /Qopenmp. I don't think the compiler would perform the analysis to decide that your call to random_seed would eliminate a requirement to follow a full random_number sequence.

mfangmeyer · ‎02-11-2011

OK, but why does the analyzer not auto-parallelize the matrix multiplication? That is the point I am surprised about.

Steven_L_Intel1 · ‎02-11-2011

The code in the original post would auto-parallelize if the loop count directive said 8 instead of 4.

When I build the code with matmul and the version 12 compiler, it tells me it auto-parallelized the matmul:

c:\Projects\t.f90(16): (col. 13) remark: LOOP WAS AUTO-PARALLELIZED.
c:\Projects\t.f90(17): (col. 13) remark: LOOP WAS AUTO-PARALLELIZED.
c:\Projects\t.f90(18): (col. 12) remark: LOOP WAS AUTO-PARALLELIZED.
c:\Projects\t.f90(18): (col. 12) remark: PERMUTED LOOP WAS AUTO-PARALLELIZED.

If I add /Qguide I see this:

c:\Projects\t.f90(16): remark #30525: (PAR) Insert a "!dir$ loop count min(64)"
statement right before the loop at line 16 to parallelize the loop. [VERIFY] Make sure that the loop has a minimum of 64 iterations.
c:\Projects\t.f90(17): remark #30525: (PAR) Insert a "!dir$ loop count min(64)"
statement right before the loop at line 17 to parallelize the loop. [VERIFY] Make sure that the loop has a minimum of 64 iterations.
c:\Projects\t.f90(18): remark #30525: (PAR) Insert a "!dir$ loop count min(64)"
statement right before the loop at line 18 to parallelize the loop. [VERIFY] Make sure that the loop has a minimum of 64 iterations.

16 and 17 are the calls to RANDOM_NUMBER, so I would ignore that. I tried adding a directive before the MATMUL and it didn't change the output, though as I note the compiler thinks it did parallelize it.

mfangmeyer · ‎02-15-2011

Steve, many thanks to you! For now I will pause testing. Later I will resume to parallelize my algorithms.