New user of Intel fortran on Win XP-64 - Page 2

John_Campbell · ‎01-09-2011

Hi,

I am a new user to the Intel Fortran Forum.
I have come here to gain access to a 64-bit compiler and also manage multi processor computation.
I have a lot of experience in the 32-bit microsoft windows environment, using Lahey and Salford compilers and am hoping to get a new level of performance in this new environment.

To date I have managed to compile link and run my 32-bit program using the Intel 64-bit compiler, with only 2 minor problems:
1) Is there a routine of the form Get_IOSTAT_Message (IOSTAT, Message_return). I couldn't find it.
2) I notice RECL= is a 4-byte word count rather than the fortran standard byte count.
This program uses about 1.7gb of memory.

I have also run a simple test program which allocates large arrays (8gb) in size which appears to confirm I am getting access to more than 2gb of addressable memory. So far so good.

My next problem is to try and introduce some processor efficiency.
I tried "ifort /?" and was swamped with options about optimisation and hardware options. Should I hnow what this means or do I have to go through a learning curve to use these options !!
I have a dual core Xeon pc, but was lost as to what to select to:
- get better performance suited to this processor,
- be able to target multi-thread calculations.

My software is basically Finite Element calculations, where for large problems, there are two "3-line do loop"routines that do most of the work. (dot_product : c = vec_a . vec_b and vector subtraction : vec_a = vec_a - c . vec_b) Do I compile just these rotines only with the special options, or also the main procedure or all the code ? Also,what are the options ?

Hopefully I can post this message on the forum and start to learn this new compiler environment which appears to have so much potential. I thank you for your assistance.

John

John_Campbell · ‎01-13-2011

Jim,

To clarify some of your assumptions.

>>> Are we talking the same N (loop iteration count for vec_sum and vec_product)?

What this tells me (correct me if I am wrong)

You are using the Skyline solver to build sets (plural) of equation solutions, each set is passed through steps 1-6 with step 3) containing vec_sum and dot_product.
<<<
There is only one build set of equations.
N is the array size in dot_product calls.

For the problem I am taking statistics from, we are talking of only one set of equations, with one pass through steps 1-6.
The simultaneous equation set is 154,000 equations, with half the coefficients are stored, as the matrix is symmetric. As real*8, the half matrix of 280 million non-zero values between the skyline profile and above the diagonal. Thistakes about 2.2gb of memory. This is a good example as for the larger in-memory problem, there are significant paging delays.
To do the forward reduction of this matrix (step 3 in post #17), there are 271,679,207 calls to a dot_product routine. The statistics I have quoted in post #17 is a histogram, giving a count of N: the size of vectors in the dot_product call. (I have used a log scale for the bucket sizes, to better identify the frequency of small values of N).Step 5uses Vec_Sum calls while step 6 uses Vec_Sub. The profile of the value of N shows a higher average value, but still few > 10,000.
This reduction (step3) now takes about 4 minutes on my PC, while steps 5-6 takes 5 seconds.
By contrast the lahey and Salford 32-bit versions take about 6 minutes, but these do (buffered) disk I/O for the solution.
There are other solution strategies ( non-linear or eigen-vector calculations ) which have multiple cycles of steps 4-6 (say 100 times) within multiple cycles of steps 2-3+ (say 20 times) for this modified sets of 150,000 simultaneous equations (in steps 2-3).
I've chosen a single pass solution to learn on, as multiple cycle solutions change their itteration flow, depending on the relative elapsed time of step 3 compared to5 & 6.
{Back in the days when I studied computer science, there was a strong connection to numerical methods/calculations. Most of us were civil or mechanical engineers. Times have changed.}

Back to what I am trying to achieve: based on this definition of N: the array size in the Vec_Sum (dot_product) call, is a size of 10,000 necessary before parallel can be effectively implemented ?

My next step is to take TimP's recommendation and test a /O1 option and see the difference. It may be that the Xeon only supports 2 xreal*8 computations in the vectorized approach and I should see if there is a processor that supports 4 x real*8 computations.
Before I started this investigation, I washopeing to gain significantly from a quad-core PC, which implies parallelization. I anticipated this would have been achieved by re-writing the vec_sum code to drive multiple processors, when N > say 50. I was not then aware of the vector instruction set in each CPU.
At present my hope is diminishing.

John

John_Campbell · ‎01-17-2011

I have now carried out a number of tests, comparing Salford and Lahey 32-bit compilers with the Intel 32-bit and 64-bit compilers, for the 2.2gb matrix size problem.
The 32-bit solutions partition the matrix into 4 blocks of max size 630mb, while the 64-bit solution is a single block. All runs were done on an XP-64os with Xeon processor and 6gb of memory.

The performance times were:

Compile St3_cpu St3_IO St56_cpu St56_IO Description
Lahey_32 359.75 1.02 11.16 0.03 Lahey 32 bit compier
Salford_32 364.03 0.74 9.17 0.04 Salford 32 bit Compiler
Intel_32 231.77 0.64 5.86 0.03 Intel 32bit /O2 /QxHost
Intel_O1 358.06 0.62 7.30 0.00 Intel 64bit /O1 /QxHost
Intel_O2 233.17 0.10 5.02 0.00 Intel 64bit /O2 /QxHost
Intel_O3 234.66 0.45 5.06 0.09 Intel 64bit /O3 /QxHost

The 4 times quoted for each test are:
1 CPU time for the step 3 matrix reduction, using Vec_Sum
2 Other time (elapsed - CPU) for step 3
3 CPU time for the steps 5 & 6 : Load case solution, using Vec_Sum then Vec_Sub
4 Other time for steps 5,6
CPU time was taken from "call Cpu_Time", while elapsed time was taken from "call System_Clock".

These times show there is a significant improvement from /O1 to /O2, I think, indicating I am getting vectorizing improvement. Is there differences between processors in the size of the vector for vectorised calculations? Would different processors (newer Xeon or Core5-7) produce a significantly better solution time ?
Also of note is the lack of elapsed time penalty for the out of memory storage in the 32-bit cases. This is due to the significant disk buffereing in these cases from the 6gb of physical memory. This is especially the case in steps 5-6 where there is 3.6gb of virtual disk reads.
I also did a problem with matrix size of 5.5gb, with unfortunately only 6gb of physical memory. This gave very interesting results and all cases suffered form either less disk buffering efficiency or significant paging delays for the in-memory 64-bit solution.The in-memory results were better and my next test will use more memory.
These results show the performance of 32-bit programs that use moderate disk I/O appear to perform favourably in comparison to 64-bit versions, but also there is not a significant penalty in going to 64-bit addressing of larger arrays.

The compiler options I have used for the Vec_Sum (Dot_Product) to attempt both vectorisation and parallelisation has not produced the automatic result I had hoped.

ifort /c /source:vec_new.f95 /free /O3 /QxHost /Qparallel /Qvec-report /Qpar-report /Qdiag-file:vec_new.lis

Is someone able to provide an example or assist in a compatible change to the Vec_Sum code and compiler options, (which include reporting of parallelization achieved) so I can test different loop sizes for parallel implementation. Dot_Product looks like a relatively simple example of what I am trying to achieve, however achieving is not as simple.

John

John_Campbell · ‎01-19-2011

As I have not been able to implement a parallelized version of my program, I have tried the MKL Blas routine "ddot", which claims to support multi threading.

The new Vec_Sum code is now:

REAL*8 FUNCTION VEC_SUM (A, B, N)
!
! Performs a vector dot product VEC_SUM = .
! account is NOT taken of the zero terms in the vectors
! c = dot_product ( a, b )
!
integer*4, intent (in) :: n
real*8, dimension(n), intent (in) :: a
real*8, dimension(n), intent (in) :: b
!
real*8 c
real*8 ddot
external ddot
!
c = ddot(n, a, 1, b, 1)
vec_sum = c
RETURN
!
END

I linked with :
ifort *.obj ..\lib\*.obj mkl_intel_lp64.lib mkl_intel_thread.lib mkl_...

CPU = 380.78 sec and I/O = -138.74or a total of 242.04, compared to 233.3 for the vectorised version (/O2)or 358 sec for the non-vectorised version(/O1).

Total elapsed time for the run is now 277.4 sec compared with 267.4 seconds for /O2.

While it was good to see the cpu % climb above 50%, the results are not promising.
If anyone has some advice as to a better approach, possibly which better tunes ddot, based on the value of N, I'd appreciate your suggestions.

John

jimdempseyatthecove · ‎01-20-2011

>> N is the array size in dot_product calls.
>>To do the forward reduction of this matrix (step 3 in post #17), there are 271,679,207 calls to a dot_product routine

This indicates that you should concentrate on moving your parallization efforts to the caller of the dot_product and not focus on parallization inside the dot_product. IOW:

[fortran]!$OMP PARALLEL DO
DO I=1, 271679207
   CALL SHELL_FOR_DOT_PRODUCT(I)
END DO
!$OMP END PARALLEL DO

SUBROUTINE SHELL_FOR_DOT_PRODUCT(I)
   INTEGER :: I
   
   ! DETERMINE VECTORS AND RESULT LOCATION
   
   ! THEN CALL __serial__ version of DOT_PRODUCT
   CALL DOT_PRODUCT(VECTOR_A, VECTOR_B, N, RESULT_C) 
   
END SUBROUTINE SHELL_FOR_DOT_PRODUCT

SUBROUTINE DOT_PRODUCT(VECTOR_A, VECTOR_B,N, RESULT_C)
  INTEGER :: N
  REAL :: VECTOR_A(N), VECTOR_B(N), RESULT_C
  INTEGER :: I
  RESULT_C = 0.0
  DO I=1,N
     RESULT_C = RESULT_C + VECTOR_A(I) * VECTOR_B(I)
  END DO
END SUBROUTINE DOT_PRODUCT

[/fortran]

as opposed to

[fortran]DO I=1, 271679207
   CALL SHELL_FOR_DOT_PRODUCT(I)
END DO

SUBROUTINE SHELL_FOR_DOT_PRODUCT(I)
   INTEGER :: I
   
   ! DETERMINE VECTORS AND RESULT LOCATION
   
   ! THEN CALL DOT_PRODUCT
   CALL DOT_PRODUCT(VECTOR_A, VECTOR_B, RESULT_C)
   
END SUBROUTINE SHELL_FOR_DOT_PRODUCT
...
SUBROUTINE DOT_PRODUCT(VECTOR_A, VECTOR_B,N, RESULT_C)
  INTEGER :: N
  REAL :: VECTOR_A(N), VECTOR_B(N), RESULT_C
  INTEGER :: I
  RESULT_C = 0.0
  !$OMP PARALLEL DO REDUCE(+:RESULT_C)
  DO I=1,N
     RESULT_C = RESULT_C + VECTOR_A(I) * VECTOR_B(I)
  END DO
  !$OMP END PARALLEL DO
END SUBROUTINE DOT_PRODUCT
[/fortran]

In the former case (recommended) you enter and exit the parallel region only once
In the latter case (not recommended) you enter and exit the parallel region 271679207 times

Note, if it is more convenient, the parallel loop for the 271,679,207 individual dot_products partitioning could be replace with a parallel loop for the 154,000 equation sets loop.

*** relying on MKL to parallelize the inner dot_products induces a 271,679,207 fold increase in the number of times you enter and exit a parallel region.

Jim Dempsey

John_Campbell · ‎01-20-2011

Jim,

Thanks for your comments. To address you point of going back one loop, to reduce the thread initiation from 270 million elements of the matrixdown to 150 thousand equations/columns, I will present you an abbreviated description of the forward reduction of a set of linear equations, using the Crout Skyline approach:

In it's simplest form, the Crout Skyline approach is
doi = 1,neq
!reduce column i by column j
do j = 1,i-1
a(i,j+1) = a(i,j+1) - dot_product ( a(i,1:j), a(j,1:j) )
end do
! reduce column i by itself
...
end do

This basically shows that each element of the matrix is subject to a dot_product modification. dot_product is a PURE function.

To step back to the J loop; the column reduction, each element of the column a(i,:) is sequentially modified by the previous columns. Through the loop the columns 1:i-1 do not vary but the values of columni do.This is not a PURE loop. As such can it still be parallelized ?

To include the column storage of the matrix, the main J loop becomes:
locate column_i
identify ti : which is the start of column i
do j = ti, i-1
locate column_j
identify t : which is the lower start of both columns i:ti and j:tj
column_i(j+1) = column_i(j+1) - dot_product ( column_i(t:j), column_j(t:j) )
end do

My limited understanding of parallel calculations is that non-PURE loops are not suitable.

I have also done some tests, varying the minimum dot_product loop size "N"for use of "ddot"
For my PC, the best performance I got was limiting the size of N to greater than 1600, with a total elapsed time of 259.2 seconds, compared to 261.8 with only vectorisation ( N > 99999). Not an effective solution.
Using ddot does not show me how to write parallelized code.

John

jimdempseyatthecove · ‎01-21-2011

John,

Set aside paralle programming for the momemt and lets look at your serial programming

>>
In it's simplest form, the Crout Skyline approach is
doi = 1,neq
!reduce column i by column j
do j = 1,i-1
a(i,j+1) = a(i,j+1) - dot_product ( a(i,1:j), a(j,1:j) )
end do
! reduce column i by itself
...
end do

This basically shows that each element of the matrix is subject to a dot_product modification. dot_product is a PURE function.

To step back to the J loop; the column reduction, each element of the column a(i,:) is sequentially modified by the previous columns. Through the loop the columns 1:i-1 do not vary but the values of columni do
<<

Then why are you doing all these redundant dot_products?

Try something along the line of

doi = 1,neq
prior_dot_product(i) = 0.0
end do
doi = 1,neq
!reduce column i by column j
do j = 1,i-1
current_dot_product = dot_product ( a(i,j:j), a(j,j:j) ) ! or a(i,j:j) * a(j,j:j)
a(i,j+1) = a(i,j+1) -current_dot_product- prior_dot_product(j)
prior_dot_product(j) = current_dot_product+ prior_dot_product(j) ! j not i
end do
! reduce column i by itself
...
end do

*** test the code, the above is untested***

The basic concept is: eliminate redundant work first in your serial code, then look at parallization.

dot products are accumulative

dot(all sections) == dot(first part) + dot(remaining part)

When dot(first part) does not change (other than for the next step containing the incrimental accumulation of the prior step), then the dot(first part) for next step is redundant work (as you already know this from the prior step).

========================

A second problem with your current code is it is formulated something like this

A(iOuterLoop,jInnerLoop+1) = DOT(A(iOuterLoop,jRangeInnerLoop),A(jInnerLoop,jRangeInnerLoop)

IOW the data "stream" is propigating from the right most index.

In FORTRAN, the adjacent data is held in the left most index (progressing in larger stridesto the right most index).

Therefore, if you were to take a little up front time to convert A to Atranspose

do i=1,neq
do j=1,neq
Atranspose(i,j) = A(j,i)
end do
end do

The perform your original code above, substituting Atranspose for A and transposing the indexs then you would have better cache utilization.

*** remember to re-transpose Atranspose back to A afterwards ***

Transposing A will improve vectorization as well as cache utilization.

(sorry I did not comment on your second loop)

Jim Dempsey

John_Campbell · ‎01-23-2011

Jim,

Thanks very much for your comments.
You stated "Then why are you doing all these redundant dot_products?"

I need to think about this, as my initilal response is No, they are not redundant.

Loop J-1 is dot_product (column_i, column_j-1)
loop J is dot_product (column_i, column_j)
I don't see any redundant calculations as these 2 dot_products are significantly different.

However, except for the last valuea(i,j-1) of loop Jchanging, both dot_products could be calculated at the same time, as a dual calculation.

It is important to note for the skyline solver, the dot_product size is dependent on the profile height of the 2 columns j and j-1 so there is some variability in the dot_product sizes.

I'll review your comments some more and reply again.

John

jimdempseyatthecove · ‎01-23-2011

[plain]John, in reexamining the suggestion for saving prior dot products,
I no longer think that there are appropriate products to be saved.
Disregard the prior post. (sorry for the red herring)
Combining the loops may have benefits.



 
Jim Dempsey[/plain]