New user of Intel fortran on Win XP-64

John_Campbell · ‎01-09-2011

Hi,

I am a new user to the Intel Fortran Forum.
I have come here to gain access to a 64-bit compiler and also manage multi processor computation.
I have a lot of experience in the 32-bit microsoft windows environment, using Lahey and Salford compilers and am hoping to get a new level of performance in this new environment.

To date I have managed to compile link and run my 32-bit program using the Intel 64-bit compiler, with only 2 minor problems:
1) Is there a routine of the form Get_IOSTAT_Message (IOSTAT, Message_return). I couldn't find it.
2) I notice RECL= is a 4-byte word count rather than the fortran standard byte count.
This program uses about 1.7gb of memory.

I have also run a simple test program which allocates large arrays (8gb) in size which appears to confirm I am getting access to more than 2gb of addressable memory. So far so good.

My next problem is to try and introduce some processor efficiency.
I tried "ifort /?" and was swamped with options about optimisation and hardware options. Should I hnow what this means or do I have to go through a learning curve to use these options !!
I have a dual core Xeon pc, but was lost as to what to select to:
- get better performance suited to this processor,
- be able to target multi-thread calculations.

My software is basically Finite Element calculations, where for large problems, there are two "3-line do loop"routines that do most of the work. (dot_product : c = vec_a . vec_b and vector subtraction : vec_a = vec_a - c . vec_b) Do I compile just these rotines only with the special options, or also the main procedure or all the code ? Also,what are the options ?

Hopefully I can post this message on the forum and start to learn this new compiler environment which appears to have so much potential. I thank you for your assistance.

John

DavidWhite · ‎01-09-2011

John,

You could try using the IOMSG keyword in the statements generating the IOSTAT value - this should return the error message.

david

Steven_L_Intel1 · ‎01-09-2011

The Fortran standard does not specify the units of RECL= in OPEN. It wasn't until F2003 that it was "suggested' that the unit be 1 byte. Before that, it was just "storage unit" and many compilers took that as "numeric storage unit", including the DEC compilers of the 1970s which are ancestors of today's Intel Fortran. /assume:byterecl sets the unit to 1 byte.

For processor targeting options, I suggest you use /QxHost , which looks at the processor you compile on and chooses the appropriate setting for that processor. You can also try /Qparallel and look at the new "Guided Auto Parallelization" feature to help you get the most out of it.

John_Campbell · ‎01-09-2011

Thank you for your prompt replies. I shall investigate these options.

Can I include /QxHost on all files, but also/Qparallel on only the one file that contains the two routines ? I have about 60 *.f95 files in 4 directories to compile.

As these routines can have vector sizes from 1 to 10,000, should I provide a covering routine to select a single thread routine for vectors of size less than (say) 6 elements ?

John

John_Campbell · ‎01-09-2011

I included the option /QxHost on all files. I observed a 2% reduction in run times.

The following code contains the two key routines, which I have nowwritten in fortran array syntax. I previously used a DO loop and F77 style syntax.
I tried /Qparallel for this file. All code options I have tried have not achieved a multi-processor performance. I fear I am mis-understanding what can be achieved. Any recommendations on where I am going wrong ?

Also, excuse my ignorance, but where would I find the new "Guided Auto Parallelization" feature ?

[bash]      SUBROUTINE VEC_SUB (A, B, FAC, N)
!
!   Performs the vector opperation   =  -  * FAC
!
      integer*4,                 intent (in)    :: n
      real*8,    dimension(n),   intent (inout) :: a
      real*8,    dimension(n),   intent (in)    :: b
      real*8,                    intent (in)    :: fac
!
      integer*4 i
      real*8    c
!
      c = -fac
!  do i = 1,n
!    a(i) = a(i) + b(i) * c
!  end do
      a = a + b * c
... . 
!     account is NOT taken of the zero terms in the vectors
!
      integer*4,                 intent (in)    :: n
      real*8,    dimension(n),   intent (in)    :: a
      real*8,    dimension(n),   intent (in)    :: b
!
      real*8    c
!
!  c = 0
!  do i = 1,n
!    c = c + a(i) * b(i)
!  end do
!
      c = dot_product ( a, b )
      vec_sum = c...

TimP · ‎01-10-2011

Assuming you have checked that you get vectorization of these functions, parallelization isn't likely to show additional speedup unless the loop counts are several thousand. You don't have any local information which would tell the compiler if this is the case, so it will probably optimize for loop count (100). Alternatively, you could use /Qpar-threshold to make the parallelization more aggressive, recognizing that many of the additional parallelizations will be unproductive. Even if the loop counts are long enough for combined vectorization and parallelization, parallelization will be more effective if it can be done outside these function calls.
The current compilers are unpredictable about whether the f77 or dot_product syntax is more likely to vectorize. If the f77 loop doesn't vectorize (e.g. supposing you set /fp:source), you should be able to get vectorization by directive
!dir$ simd reduction(+: sum)
do i=1,n
....
If you find useful opportunities for parallelism, you may need OpenMP directives to specify in detail which parallelizations you want.
The guided auto-parallel should make suggestions about vectorization and parallelization. It should help to set e.g.
!dir$loop count(3000)
(substitute appropriate value) on all loops in order to get better suggestions.
Documentation on GAP is here . Needless to say, the writer of the doc forgot about the existence of Fortran.

One of the differences between /Qxhost (for core i7) and default /arch:sse2 for double precision is the compiler's choice between scalar and parallel loads for data of unknown alignment. While the Core i7 and later CPUs are more likely to benefit from the parallel loads (almost certainly so if the data are actually aligned, although the compiler doesn't know it), it isn't always the best choice. If you do know that the data are always aligned, you can use
!dir$ vector aligned
to make these loops more efficient. This will cause run-time failure if your assertion is incorrect.

John_Campbell · ‎01-11-2011

Tim,

Thank you very much for your comments. Infortuately I do not yet fully understand what you have stated, as I'm a bit new to the Intel termonology.

You have discussed vectorisation and parallelization, which I need to better understand what they mean.
My initial expectation was to be able to compile the dot_product, so as to use both processors to perform the calculation, when the loop size was big enough, with the compiler manageing all the multi-thread handshaking. I thought this would be parallelization. Is this a possibility, for the explicit DO loop I provided ? If so I will be asking how later !
Do you have a simple description of vectorization ?

I have had some success today.
1) I have managed to get my 64-bit program version running and have solved 2 problems, one with a 2.4gb matrix size and one with a 5.8gb matrix size, using a single allocated array. They both worked, maxing out at 50% CPU on my 2-processor machine, giving comparible results to the out of core 32-bit application. Unfortunately, I only have 6gb of physical memory so the elapsed time was not as good as could be hoped. More hardware required.
2) I also calculated a histogram of the loop size. For the smaller problem, there were 271,679,207 Dot_Product calls, 18% were in the range 404-665, 31% in the range 666-1096, 17% 1097-1808 and 10% > 1808. Only 8.5% were < 150. (could not insert the excel chart) It would appear that a substantial number of calls could benefit from multi threading ?

I'll try to learn more tomorrow.

John

jimdempseyatthecove · ‎01-11-2011

John,

>>My software is basically Finite Element calculations, where for large problems, there are two "3-line do loop"routines that do most of the work. (dot_product : c = vec_a . vec_b and vector subtraction : vec_a = vec_a - c . vec_b)

FE systems are oftencomposed of in one of two ways:

a) A single large mesh
b) Multiple smaller mesh objects

In the case of the single large mesh you can add OpenMP parallelization at the inner loop level:

c = -fac
!$omp parallel do
do i = 1,n
a(i) = a(i) + b(i) * c
end do
!$omp end parallel do

In the case of multiple smaller mesh objects you move the parallelization outwards in your program:

!$omp parallel do
do iObj = 1, nObjs
call VEC_SUB(Objects(iObj)%A, Objects(iObj)%B, aFAC, Objects(iObj)%nVerts)
end do
!$omp end parallel do
...

The above case assumes sufficiently large number of objects of similar number of verticies.
The above case partitions the iteration space 1:nObjs into a number of partitions equal to the number of available threads (typically the logical CPU count). When the objects have approximately the same number of verticies then this partitioning works ok. However, when the number of verticies vary, equal partitioning may not yield equal amount of work. In this situation you might want to modify the above loop's !$omp directive

!$omp parallel do schedule(static,1)

The above cause each thread in turn to pick the next object. There are other scheduling clauses for varying circumstances.

The compiler's auto parallelization knows little about your program and would parallize only the inner most statements.

A general rule to follow:

Parallize outer
Vector inner

Parallizing the inner loop of the DOT Product requires a little care

c = 0.0
!$omp parallel do reduce(+:c)
do i = 1, n
c = c + a(i) * b(i)
end do
!$omp end parallel do

With the reduce(+:c) clause, each thread starts with a different variable for c, initialized to 0, and produces a partial sum for its slice of the iteration space 1:n. Upon exit from the parallel region, the thread private partial sums are atomically summed to produce the correct result.

Without the reduction clause, the end summation has a high probability of being incorrect (due to floating point addition not being multi-thread-safe atomic by nature).

Jim Dempsey

Wendy_Doerner__Intel · ‎01-11-2011

Here are some good resources (in addition to the documentation for the compiler installed on your your system) for learning how to optimize with the Intel compiler and more about vectorization:

http://software.intel.com/en-us/articles/step-by-step-application-performance-tuning-with-intel-compilers/

http://software.intel.com/en-us/articles/being-successful-with-the-intel-compilers-you-need-to-know/

http://software.intel.com/en-us/articles/vectorization-with-the-intel-compilers-part-i/

http://www.intel.com/intelpress/sum_vmmx.htm

Hopefully some of these will be helpful.

------

Wendy

Attaching or including files in a post

John_Campbell · ‎01-11-2011

Jim,

Thanks for your comments. Again I will try to understand what you have said.

My FE calculations are based on a Skyline direct solver, which was popular in the 80's. Finite Element systems are large collections of discrete elements, when converted to a system oflinear equations becomes a symmetric matrix of significant variation in bandwidth(profile). In a direct solution approach, the forward reduction of the stiffness matrix and the forward reduction of the load vectors is based on dot_product, while the backwards substitution of load vectors is based on Vec_Sub.
My understanding to date of vectorisation is this is based on parallel calculation of small vectors ( probably size 3 to 6) that is carried out in a graphics transformation in a single supporting CPU, while parallelization is for multi threading into multiple CPU's.
Based on this a good place to start, as an initial learning experience, I am trying to achieve some vectorisation or parallelisation of my basic dot_product routine. After reading some documentation, I have progressed tothe following code idea, but was not able to achieve any change to run time performance or an increase above 50% CPU in task manager. I am hoping someone could review the code and suggest an appropriate ifort compile option that might work.

The latest code "vec_new.f95" I tried was:

REAL*8 FUNCTION VEC_SUM (A, B, N)
!
! Performs a vector dot product VEC_SUM = .
! account is NOT taken of the zero terms in the vectors
! c = dot_product ( a, b )
!
integer*4, intent (in) :: n
real*8, dimension(n), intent (in) :: a
real*8, dimension(n), intent (in) :: b
!
real*8 c
integer*4 i
!
c = 0.0
if ( n > 100 ) then
!DEC$ PARALLEL
do i = 1,n
c = c + a(i)*b(i)
end do
else
!DEC$ NOPARALLEL
do i = 1,n
c = c + a(i)*b(i)
end do
...

John_Campbell · ‎01-11-2011

I must admit I am becoming confused ! I have been searching through a number of articles on Vectorizing and Parallelizing of Fortran Dot_Product. While the discussions appear to be very complex and rich in terminology, I find a lot of confusing discussions which do not provide a clear and concise answer. It may be the clear answer does not exist !!

However I will try !

My goal is to modify the following two code examples to produce improved performance, via either Vectorizing or Parallelizing. The code as best I can summarise is:

[bash]      SUBROUTINE VEC_SUB (A, B, FAC, N)
!
!     Performs the vector opperation   =  -  * FAC
!
      integer*4,                 intent (in)    :: n
      real*8,    dimension(n),   intent (inout) :: a
      real*8,    dimension(n),   intent (in)    :: b
      real*8,                    intent (in)    :: fac
!
      integer*4 i
      real*8    c
!
      c = -fac
      do i = 1,n
         a(i) = a(i) + b(i) * c
      end do
      r... . 
!     account is NOT taken of the zero terms in the vectors
!
      integer*4,                 intent (in)    :: n
      real*8,    dimension(n),   intent (in)    :: a
      real*8,    dimension(n),   intent (in)    :: b
!
      integer*4 i
      real*8    c
!
      c = 0.0
      do i = 1,n
         c = c + a(i)*b(i)
      end do
      vec_sum ...

Can someone provide a suggestion of a combination of code modification and compatible compiler options that improve the performance ?

From my previous posts, the value of n varies considerably between 1 and 10,000, withless than 10% n < 150.Based on my understanding of vectorisation, that it isrelated to instruction sets that suit graphics vector calculation, I'd expect this would be the better option, if the cpu supports vector calculations for real*8 in a 64-bit environment.

If a concise vectorized solution to Dot_Product does exist, where the inputs are rank-one vectors, why doesn't ifort use this if /fast is selected ?

If this is not the case, then why not ??

I'm not trying to be overly critical as a new user of this forum, but as a long time fortran programmer, my best advicehas always beenKISS, keep it simple s... Our goal should be to provide a simple solution, as I'm sure I'm not the first one to travel down this path.

John

jimdempseyatthecove · ‎01-12-2011

>>My FE calculations are based on a Skyline direct solver, which was popular in the 80's. Finite Element systems are large collections of discrete elements, when converted to a system of linear equations becomes a symmetric matrix of significant variation in bandwidth(profile).

Are the dot_products and Vec_Subs performed:

a) On each discrete element individually with additional code handling any element-to-element interaction.

or

b) On a single large matrix built from the concatenation/assemblage of the discrete elements

If a) then perform the parallelization on the element-by-element loop.

If b), then perform the parallelization inside the dot_product and vec_sub loops

>>My understanding to date of vectorisation is this is based on parallel calculation of small vectors ( probably size 3 to 6) that is carried out

The vector size is depending on your system architecture. If using SSE vector instructions this handles 2 REAL(8)s or 4 REAL(4)s sized small vectors. If using AVX 4 REAL(8)s or 8 REAL(4).

When the data is organized favorably, and when the code can make use of small vectors, vectorization can significantly improve performance. On a multi-core system, parallelization also improves performance. So the ideal is to use both vectorization and parallelization.

>>Based on this a good place to start, as an initial learning experience, I am trying to achieve some vectorisation or parallelisation of my basic dot_product routine.

Good for you to take this initiative.

When the data is organized favorably, and when the code can make use of small vectors, vectorization can significantly improve performance. On a multi-core system, parallelization also improves performance. So the ideal is to use vectorization .AND. parallelization.

>>>After reading some documentation, I have progressed to the following code idea, but was not able to achieve any change to run time performance or an increase above 50% CPU in task manager.

This suggests either code not parallized, or not entering parallel region, or parallel loops are too small to warrant parallization.

I suggest you turn off auto-parallization and use OpenMP for parallization. OpenMP gives you better control over when and where you perform the parallization.

!DEC$ PARALLEL

do i = 1,n

c = c + a(i)*b(i)

end do

becomes

Depending on the version of OpenMP, you can add a threshold clause

!$OMP PARALLEL DO IF(n > 10000)

do i = 1,n

c = c + a(i)*b(i)

end do

!$OMP END PARALLEL DO

Starting up and shutting down parallel regions has some overhead. The loop code x number of iterations has to be sufficiently large enough to absorb the overhead. For a simple DOT product, 100 is too low, 10,000 could be more than necessary to warrant parallization.

>>I'm not trying to be overly critical as a new user of this forum, but as a long time fortran programmer, my best advicehas always beenKISS, keep it simple s... Our goal should be to provide a simple solution, as I'm sure I'm not the first one to travel down this path

This forum is the best place to seek advice.

Jim

TimP · ‎01-12-2011

1) Maybe you should be interested in IOMSG=
2) -assume:byterecl switches the RECL units to bytes

You haven't responded to a fraction of the answers you've already received to your questions. As we've already pointed out, you should first check whether your important loops have been auto-vectorized, and get some idea what their typical length is.
Profiling might soon become important (if you haven't already been doing it), and Windows severely limits the options there (Intel VTune, AMD CodeAnalyst).

anthonyrichards · ‎01-12-2011

Forgive my butting in, but this appears a good point to ask Jim/Steve about one aspect of the sort of code that the poster is trying to maximally optimise for speed.

Given the code

do i = 1,n

c = c + a(i)*b(i)

end do

where the variable c is local to this particular patch of code, are all modern compilers intelligent enough to know that it is enough to increment 'c' in a local register and to neither send each incremented value for 'c' on the left of the '=' sign to the address assigned to 'c' nor to access from the address of 'c' the value required on the right of the '=' sign each time an increment is to be applied to it? Thus one memory set and one memory fetch is eliminated from the whole sum represented by the code inside the loop? Clearly on exiting the loop, the 'final' value of 'c' must be set at the address of variable 'c' eventually.

(My Assembler knowledge was only used many years ago on CDC machines, but even then, there were available a handful of floating-point registers useful for storing such incremented values without requiring memory calls and also a handful of integer registers useful for computing 2-D and 3-D array element locations by utilising increments by stride lengths to integer register values without having to get/fetch values from the addresses of array index variables. Thus, a standard compiled version of the Gauss-Jordan inversion of a symmetric real matrix was spectacularly improved by switching to assembler code making maximum use of local registers)

TimP · ‎01-12-2011

Quoting anthonyrichards

Forgive my butting in, but this appears a good point to ask Jim/Steve about one aspect of the sort of code that the poster is trying to maximally optimise for speed.

Given the code

do i = 1,n
c = c + a(i)*b(i)
end do
where the variable c is local to this particular patch of code, are all modern compilers intelligent enough to know that it is enough to increment 'c' in a local register and to neither send each incremented value for 'c' on the left of the '=' sign to the address assigned to 'c' nor to access from the address of 'c' the value required on the right of the '=' sign each time an increment is to be applied to it? Thus one memory set and one memory fetch is eliminated from the whole sum represented by the code inside the loop? Clearly on exiting the loop, the 'final' value of 'c' must be set at the address of variable 'c' eventually.

(My Assembler knowledge was only used many years ago on CDC machines, but even then, there were available a handful of floating-point registers useful for storing such incremented values without requiring memory calls and also a handful of integer registers useful for computing 2-D and 3-D array element locations by utilising increments by stride lengths to integer register values without having to get/fetch values from the addresses of array index variables)

This is one of the more crucial points where it's important to check the optional compiler optimization reports. If -Qvec-report tells you this loop has been auto-vectorized, that includes parallel register accumulation of the result, and final addition of the individual batched sums.
If you set an option which inhibits such optimization, such as /fp:source, you can still over-ride that for the individual loop by !dir$ simd reduction(+: c) in the current ifort.

jimdempseyatthecove · ‎01-12-2011

You are right to some extent (up to the last part)

do i = 1,n

c = c + a(i)*b(i)

end do

Assume for example a and b are REAL(4) arrays and assume you compile using optimizations for P4 and later processors. Without parallization the above loop is vectorizable. Ignore for the moment that n could be very small (1, 2, 3,...) and that the compiler code will generate hidden tests for minimal size of n and alignments for a(1) and b(1).

The pseudo code for the above loop might look like

temp_c[1:4] = {0.0, 0.0, 0.0, 0.0} ! one instruction xorps xmm0,xmm0
loop:
temp_a[1:4] = a(i:i+3) ! one instruction "movaps xmm1,[reg_a+reg_i*4]"
temp_a[1:4] *= b(i:i+3) ! one instruction "mulps xmm1,[reg_b+reg_i*4]"
temp_c[1:4] += temp_a[1:4] ! one instruction "addps xmm0,xmm1"
reg_i +=4 ! addreg_i, 4
if(reg_i <= n) goto loop ! two instructions "cmp reg_i,reg_n; jle loop"
temp_c[1] = temp_c[1] + temp_c[2] + temp_c[3] + temp_c[4]
! above in two instructions "haddps xmm0,xmm0; haddps xmm0,xmm0"
! finally
c += temp_c[1] ! *** following three instructions
movss xmm2,[reg_c] ! xmm2 = c
addps xmm2,xmm0 !xmm2 += temp_c[1] (and 2, 3, 4)
movss [reg_c],xmm0 ! c = xmm2
! *** or two instructions
addss xmm0,[reg_c] ! temp_c[1] += c
movss [reg_c],xmm0 ! c = xmm2

For this example we also assume loop count n is multipl of 4 (compiler adds code to test and take additional code paths for residual array elements).

In the above sketch you can observe that 4 elements of each array are processed at each step (REAL(8) would process two at a time).

At theexit of the loop you find two approximate representations of how the eventual resultsum is added to c.

Note, that you are unable to perform the final summation to c in a single floating point instruction.
Even for integer summation, direct to memory (add ,temp_c) the single instructionoperation would be performed as a Read/Modify/Write (Read / Add / Write).

In a multi-threaded environmentboth floating point summation and integer summation to memory have atime interval between the Read and the Write. In your case of SSE floating point, the instruction latencies up to the write could be anywhere between 25 and 250 clock ticks (approximate). It is probable that two threads might do the following

C = 0.0
enter parallel
Thread A | Thread B
Read old C (0.0) to register inA
| Read old C (0.0) to register in B
Add A's partial sum to register in A(0.0 + Asum)
| Add B's partial sum to register in B (0.0 + Bsum)
Write new C(0.0 + Asum)
| Write new C (0.0 + Bsum)
exit parallel

C now == 0.0 + Bsum, not 0.0 + Asum + Bsum

Note, that you may find the correct result 99.99% of the time (the two threads not finishing up at the same time). The probability is real that you will at times end up with only one of the threads sum.

Jim Dempsey

jimdempseyatthecove · ‎01-12-2011

There are ways to handle the race condition

c = 0.0 ! return complete DOT

!$omp parallel do reduce(+:c)

do i = 1,n

c = c + a(i)*b(i)

end do

!$omp end parallel do

----------------

c_temp = 0.0 ! accumulate DOT

!$omp parallel do reduce(+:c_temp)

do i = 1,n

c_temp = c_temp + a(i)*b(i)

end do

!$omp end parallel do

c = c + c_temp ! accumulate to caller

-----------------

!$omp parallel private(c_temp) shared(c)

c_temp = 0.0 ! thread local accumulate DOT

!$omp do

do i = 1,n

c_temp = c_temp + a(i)*b(i)

end do

!$omp end do

!$omp critical ! or in this case !$omp atomic

c = c + c_temp ! accumulate to caller

!$omp end critical ! or omit for !$omp atomic
!$omp end parallel
----

You would choose the last method if you have more code to perform in the critical section.

Jim Dempsey

John_Campbell · ‎01-12-2011

Tim,

>> You haven't responded to a fraction of the answers you've already received to your questions <<
Thanks for your advice on using this forum, I will try to answer questions in each post and try to progress.

Your points 1) and 2) have been understood and are no longer an issue.

>> As we've already pointed out, you should first check whether your important loops have been auto-vectorized, <<
I have used the following command:
ifort /c /source:vec_new.f95 /free /O3 /QxHost /Qparallel /Qvec-report /Qpar-report /Qdiag-file:vec_new.lis
( I could not find the option to include these reports in the code listing ( ie a /list:file option )

I received the following report:

E:\Intel_Sap_64\test\vec_new.f95(18): (col. 7) remark: LOOP WAS VECTORIZED.

E:\Intel_Sap_64\test\vec_new.f95(40): (col. 7) remark: LOOP WAS VECTORIZED.

This appears to show that vectorisation was implemented, but no auto parallelism. When I run this version, I do not see the benefits of the vectorisation in my run times. ( Xeon 3505 not supporting, or some other problem ? )
Given that the code is vectorised, why do I not see any run time improvement ?
Is it a problem with the compile command for the main program or other routines that call Vec_Sum, or a problem with the link command ? Are you able to comment on this observation.
My general compile command for other procedures is : ifort /c /O2 /source:%1.f95 /free /QxHost
My link command is basically: ifort *.obj ..\lib\*.obj/Feprog.exe /map:prog.map

To reply to Jim Dempsy #11 comments:
>> Are the dot_products and Vec_Subs performed: ... <<
Basically b). In a skyline solver, the equation solution acts on the full equation matrix as a set of linear equations, with element grouping of equations basically removed. "N" is the equation bandwidth. This differs from "Frontal" solvers which retain some of the element grouping of equations. A skyline solver has a number of distinct stages of solution, being:
1) minimisation and determination of equation profile (bandwidth minimisers)
2) assembly of complete "global" stiffness matrix in a profile form, from all element stiffness matrices, as a set of equations.
3) forward reduction of the "global" stiffness matrix. This used Vec_Sum (dot_Product) and can be up to 99% of total run time.
4) assembly of the load vectors
5) forward reduction of the load vectors.
6) backward substitution of the load vectors.

In a large system of equations, 3) is by far the greatest computation phase.
In a system where the global matrix is stored on disk, 3) does a read and a write, while 2) does a write and 5) and 6) each do a read of the matrix. In a eigenvector subspace itteration of multiple load sets, repeating steps 5) and 6), these reads can become the significant run time. Hence the approach of a 64-bit in-memory solution.

Step 5) is basically solving a large set of simultaneous linear equations, using a direct solution approach. This is what I am trying to get to run faster. I have introduced 64-bit to store the equations in memory and now I want the run time improved, both of which I thought I could achieve when I read the Intel Fortran glossy.

>> Starting up and shutting down parallel regions has some overhead. The loop code x number of iterations has to be sufficiently large enough to absorb the overhead. For a simple DOT product, 100 is too low, 10,000 could be more than necessary to warrant parallization. <<
Ifind Jim's comments about the loop size a bit or a worry. Is 10,000 realy what we are looking for before multiple threads become effective? I did a histogram of the value of N for Vec_Sum for my medium size problem of 154,000 equations, using a log scale for N range. I was hoping for an N value of say 50 to 100.

Histogram of Vec_Sum N value

N Range Count

1 155,375

2 165,126

3-4 310,705

5-7 463,258

8-12 912,718

13-20 1,311,197

21-33 2,157,660

34-54 3,605,227

55-90 5,579,890 2%

91-148 8,846,732 3%

149-244 14,766,515 5%

245-403 25,187,559 9%

404-665 49,307,473 18%

666-1096 84,614,467 31%

1097-1808 46,562,808 17%

1809-2980 23,469,101 9%

2981-4914 2,380,589 1%

4915-8103 1,722,149

8104-13359 160,658

I previously reported there were very few values below 150 (8.5%), but as the table shows, there are also very few > 8100 (.06%).

I will try to take the OpenMP approach and see what can be done, although the threshold of 10000 is not very promising. Any advice on ifort command ?

My focus of attention has now come down to first showing some improvement in the dot_product calculation, seeing what I can learn from that. Is someone able to provide a code revisionand ifort command that may demonstrate this capability?

John

jimdempseyatthecove · ‎01-13-2011

>>>In a skyline solver, the equation solution acts on the full equation matrix as a set of linear equations, with element grouping of equations basically removed. "N" is the equation bandwidth.

3) forward reduction of the "global" stiffness matrix. This used Vec_Sum (dot_Product) and can be up to 99% of total run time.

I did a histogram of the value of N for Vec_Sum for my medium size problem of 154,000 equations, using a log scale for N range. I was hoping for an N value of say 50 to 100.

Histogram of Vec_Sum N value

N Range Count

1 155,375

8104-13359 160,658

<<<

Are we talking the same N (loop iteration count for vec_sum and vec_product)?

What this tells me (correct me if I am wrong)

You are using the Skyline solver to build sets (plural) of equation solutions, each set is passed through steps 1-6 with step 3) containing vec_sum and dot_product. Your Histogram indicates this is so. IOW if you only had one set, you would only have had one entry in your histogram with one count.

If my assumption is correct, then consider

parallel do loop for each set of equation solution

1) minimization and determination of equation profile (bandwidth minimizes)

2) assembly of complete "global" stiffness matrix in a profile form, from all element stiffness matrices, as a set of equations.

3) forward reduction of the "global" stiffness matrix. This used Vec_Sum (dot_Product) and can be up to 99% of total run time.

4) assembly of the load vectors

5) forward reduction of the load vectors.

6) backward substitution of the load vectors.

end parallel do loop

>>>In a system where the global matrix is stored on disk, 3) does a read and a write.

???

With 99.04% of vector size <= 8100, how can there be any disk I/O?

So I have to assume we are talking apples and oranges.

If the sets (plural) of equations are on disk (millions of sets), then you will need to notch-up your

parallel programming skill set and learn about parallel pipelines.

I would suggest you begin without parallel pipelines because the learning curve is less

and in the process you will assure your steps 1-6 are thread safe (and usable in the parallel pipeline)

Start with smaller set sizes such that all sets can be read into memory

serial section-1

read all sets into memory as RAW data do no other work

serial section-2

identify set demarcations in RAW data

parallel do i=1,nSetsFound

using set demarcation i

1) minimization

2) assembly

6) backward substitution

end parallel do

Once this is working, then the above can be easily reworked into a parallel pipeline.

Using parallel pipeline techniques you can make this a continuous operation

without regard to size of RAW data.

Jim Dempsey

TimP · ‎01-13-2011

According to the data you presented, the vectorization you achieved in the inner loops should be effective. Did you compare the performance with vectorization removed (e.g. -O1)? If it doesn't speed up by a factor of 2 or more with vectorization, we would suspect that your performance is limited by cache ineffectiveness. Your histogram shows insufficient use of loops long enough to benefit from combined vectorization and parallelization in the inner loops, according to consistent advice given by more than one of us.
You should get significant advantage from parallelizing the outer loop of the factorization by OpenMP. Again, as several of us said, you can't expect the auto-parallelizer to do the job, even if you take advantage of the options. In general, in a skyline solver, it will be difficult to balance the work among the threads without losing cache locality, but it will be interesting to try. As you no doubt saw with your search engine, several organizations considered this worthy of a research project.

jimdempseyatthecove · ‎01-13-2011

TimP,

>> In general, in a skyline solver, it will be difficult to balance the work among the threads without losing cache locality

However, under specific situations, a parallel pipeline can maintain cache locality...

Read, read, read, read, read, read,
minimization and determination of equation profile, minimization...
assembly of complete "global" stiffness matrix (for set), assembly...
forward reduction of the "global" stiffness matrix, forward ...
assembly of the load vectors, assembly ...
forward reduction of the load vectors, forward...
backward substitution of the load vectors, backward ...
Write, write, write

The above illustrating an 8-stage parallel pipeline
Stage 4 being John's 3) with vec_sum and dot_product

A Core i7 4-core with HT could do a reasonable job of running this pipeline
A Core i7 6-core with HT would do better (pipeline has 6 computational pipes and 2 I/O pipes)
An 8-core or with or without HT may be slightly better than 6-core w/HT
A 2 x 4 core might be slightly worse if single memory subsystem
A 2 x 4 core could be better if multiple memory subsystems (e.g. Nehalem)

Depending on computational loads, some of the pipes may be broken into multiple pipeline stages (more HWthreads)

Jim Dempsey