Well, people have been

mohammad_b_ · ‎11-09-2018

I am trying to create a pointer with a specific construction. INTEGER, DIMENSION (:, :), POINTER :: sp INTEGER, TARGET :: s(0: lx+1, 0: ly+1) this is what I need sp(1:lx, 1:ly) => s(1:lx, 1:ly) sp(0, 1:ly) => s(lx, 1:ly) sp(lx+1, 1:ly) => s(1, 1:ly) sp(1:lx, 0) => s(1:lx, ly) sp(1:lx, ly) => s(1:lx, 1) but when I try it, I have two errors! error #8527: If a bound remapping list is specified, data target must be simply contiguous or of rank one error #8524: The syntax of this data pointer assignment is incorrect: either 'bound spec' or 'bound remapping' is expected in this context is it possible in FORTRAN?

mecej4 · ‎11-09-2018

What you seem to want, although you did not state so explicitly, is to have different portions of a 2-D pointer variable point to different sections of a 2-D target, and to do this simultaneously. That is not possible. Each pointer assignment of sp effectively erases the previous one. In F2008, you can do the following, but this code probably does not do all the things you want:

program ptr2
implicit none
integer, parameter :: LX = 5, LY=7
INTEGER, DIMENSION (:, :), POINTER :: sp
INTEGER, TARGET :: s(0: lx+1, 0: ly+1)

sp                  => s(1:lx, 1:ly)
sp(0:0 , 1:ly)      => s(lx, 1:ly)
sp(lx+1:lx+1, 1:ly) => s(1, 1:ly)
sp(1:lx, 0:0)       => s(1:lx, ly)
sp(1:lx, ly:ly)     => s(1:lx, 1)

print *,'Lower bounds: ',lbound(sp)
print *,'Upper bounds: ',ubound(sp)

end

jimdempseyatthecove · ‎11-09-2018

Or an alternate (not fully tested):

!  slice.f90 
!
!  FUNCTIONS:
!  slice - Entry point of console application.
!

!****************************************************************************
!
program slice

    implicit none
    integer, parameter :: lx = 9
    integer, parameter :: ly = 9
    INTEGER, DIMENSION (:, :), POINTER :: sp
    INTEGER, TARGET :: s(0: lx+1, 0: ly+1)
    INTEGER :: ix, iy
    
    do ix = lbound(s,1),ubound(s,1)
        do iy = lbound(s,2),ubound(s,2)
            s(ix,iy) = ix * 100 + iy
        end do
    end do
    
    associate(section => s(1:lx, 1:ly))
        print *,'s(1:lx, 1:ly)'
        sp(1:lx, 1:ly) => section
        print *, sp
        print *,'sp(',lbound(sp,1),':',ubound(sp,1),',',lbound(sp,2),':',ubound(sp,2),')'
        print *
    end associate
    
    associate(section => s(lx, 1:ly))
        print *,'s(lx, 1:ly)'
        sp(0:0, 1:ly) => section
        print *, sp
        print *,'sp(',lbound(sp,1),':',ubound(sp,1),',',lbound(sp,2),':',ubound(sp,2),')'
        print *
    end associate
    
    associate(section => s(1, 1:ly))
        print *,'s(1, 1:ly)'
        sp(lx+1:lx+1, 1:ly) => section
        print *, sp
        print *,'sp(',lbound(sp,1),':',ubound(sp,1),',',lbound(sp,2),':',ubound(sp,2),')'
        print *
    end associate
    
    associate(section => s(1:lx, ly))
        print *,'s(1:lx, ly)'
        sp(1:lx, 0:0) => section
        print *, sp
        print *,'sp(',lbound(sp,1),':',ubound(sp,1),',',lbound(sp,2),':',ubound(sp,2),')'
        print *
    end associate
    
    associate(section => s(1:lx, 1))
        print *,'s(1:lx, 1)'
        sp(1:lx, ly:ly) => section
        print *, sp
        print *,'sp(',lbound(sp,1),':',ubound(sp,1),',',lbound(sp,2),':',ubound(sp,2),')'
        print *
    end associate

end program slice

Output:

 s(1:lx, 1:ly)
         101         201         301         401         501         601
         701         801         901        1001           2         102
         202         302         402         502         602         702
         802         902        1002           3         103         203
         303         403         503         603         703         803
         903        1003           4         104         204         304
         404         504         604         704         804         904
        1004           5         105         205         305         405
         505         605         705         805         905        1005
           6         106         206         306         406         506
         606         706         806         906        1006           7
         107         207         307         407         507         607
         707         807         907        1007           8         108
         208         308         408
 sp(           1 :           9 ,           1 :           9 )

 s(lx, 1:ly)
         901         902         903         904         905         906
         907         908         909
 sp(           0 :           0 ,           1 :           9 )

 s(1, 1:ly)
         101         102         103         104         105         106
         107         108         109
 sp(          10 :          10 ,           1 :           9 )

 s(1:lx, ly)
         109         209         309         409         509         609
         709         809         909
 sp(           1 :           9 ,           0 :           0 )

 s(1:lx, 1)
         101         201         301         401         501         601
         701         801         901
 sp(           1 :           9 ,           9 :           9 )

jimdempseyatthecove · ‎11-09-2018

*** Note that the 2D section did not skip over the 10th (0:10) cell as wished for. (Observe the 10nn values in the first printout).

The single column or row expressions did what you want.

mohammad_b_ · ‎11-09-2018

let me to restate my problem: I want to do a simulation on ISING model. in this model, you have a two dimensional array such as INTEGER :: s(1:10,1:10) !spins array The algorithm is based on integrating over neighbors of random selected spin such as s(i,j), i.e k = s(i+1,j) + s(i-1,j) + s(i, j+1) + s(i,j-1) in addition we use periodic boundary condition.there is no problem if the selected spin is from the core of array s(2:9,2:9) . but when the selected spin is from the boundary, you have to apply the periodic boundary condition. for example if the selected spin is s(1,4), so we have k= s(2,4) + s(10,4) + s(1,5) + s(1,3) then for applying periodic boundary condition we have to check each selected spin that is it on the boundary? and this is time consuming part because you have to check four conditions (due to four boundaries) for each selected spin. I thought for applying this periodic boundary condition, it's better to use pointer with larger length in such way sp(1:10,1:10) => s(1:10,1:10) sp(0,1:10) => s(10,1:10) sp(11,1:10) => s(1,1:10) sp(1:10,0) => s(1:10,10) sp(1:10,11) => s(1:10,1) if it is possible in FORTRAN, it is easy to compute the neighbor of each cell (whether it's on the boundary or not) do you have a better solution for apply the periodic boundary condition? Thanks in advance

mecej4 · ‎11-10-2018

Well, people have been solving 2-D and 3-D field problems on computers for about six decades, so the problem that you describe (how to manage periodic B.C.) has several solutions. Please think about the suggestions that Jim D. gave you in #3. Here is another way of handling the periodic B.C. that you may assess for your application (you have to adjust the array bounds to suit):

Maintain four auxiliary 1-D integer arrays iw, ie, js, jn (the second letter stands for West, East, etc.). Let iw(1:11) = [10, 1, ... 9,10], and ie(1:11) = [2,3,...,10,11,1]. Define js and jn similarly. In the body of your calculations, use s(iw(i),j) instead of s(i-1,j) , s(i,in(j)) instead of s(i,j+1), and similarly for s(i,j-1) and s(i,j+1).

The auxiliary arrays iw, ie, js, jn need only be computed once in the beginning, as soon as you have decided the size of your grid. On the other hand, it may be faster to compute the elements of these arrays on the fly, using AVX instructions to process blocks of elements instead of one at a time.

jimdempseyatthecove · ‎11-10-2018

Mohammad,

What you are asking for is optimization advise for a performance problem you have identified, but showing us too little of the code and also not describing the runtime environment. For example:

You have not stated if s(0, iy) is a copy of the value stored in s(lx, iy) (same for the 3 other boundaries), or if these cells are simply padd cells for simplifying your programming.

If your programming a completely scalar algorithm that cannot be vectorized and/or on an older generation CPU without scatter/gather instructions .AND. if the boundary cells are copies of the opposing "wall", then consider removing the encapsulation buffer cells (0th, lx+1th, ly+1th), and simply use a MOD function on the index... sp(mod(i,lx)+1,mod(j,ly)+1))...

On the other hand, if the code can be vectorized and if you run on a CPU with scatter/gather, then you might consider using a section subscript list form of indexing an array (see Array Sections). Note, you have not stated as to if your spin of (1,4), result shown as k, is to be stored into an output array at location (1,4).

Example code for system with AVX512 with scatter/gather

Note for inclusion on this forum I have restricted the s array to dimension (5,6). your actual implementation is assumed to be much larger.

module mod_Spinner
    type Spinner_t
        integer :: nX, nY
        real, allocatable :: s(:,:)
        integer, allocatable :: sNorth(:), sSouth(:), sEast(:), sWest(:)
    end type Spinner_t
    !
    contains
    
    subroutine Spinner_init(this, nX, nY)
        implicit none
        type(Spinner_t) :: this
        integer :: nX, nY
        integer :: iX, iY, iLinear, iNorth, iSouth, iEast, iWest
        this%nX = nX
        this%nY = nY
        allocate(this%s(nX,nY), this%sNorth(nX*nY), this%sSouth(nX*nY), this%sEast(nX*nY), this%sWest(nX*nY))
        do iX = 1, nX
            do iY = 1, nY
                if(iX == 1) then
                    iWest = nX
                else
                    iWest = iX-1
                endif
                if(iX == nX) then
                    iEast = 1
                else
                    iEast = iX + 1
                endif
                if(iY == 1) then
                    iNorth = nY
                else
                    iNorth = iY - 1
                endif
                if(iY == nY) then
                    iSouth = 1
                else
                    iSouth = iY + 1
                endif
                iLinear = (iY - 1) * nX + iX
                this%sNorth(iLinear) = (iNorth - 1) * nX + iX
                this%sSouth(iLinear) = (iSouth - 1) * nX + iX
                this%sEast(iLinear) = (iY - 1) * nX + iEast
                this%sWest(iLinear) = (iY - 1) * nX + iWest
            end do
        end do
    end subroutine Spinner_init
    
    subroutine Spinner_spin(this, toThat)
        implicit none
        type(Spinner_t) :: this, toThat
        if(size(this%s) /= size(toThat%s)) stop "Sizes do not match"
        call spin(this%s, toThat%s)
    contains
        subroutine spin(thisLinear, toThatLinear)
            implicit none
            real :: thisLinear(this%nX*this%nY), toThatLinear(toThat%nX*toThat%nY)
        
            toThatLinear = thisLinear(this%sNorth) + thisLinear(this%sSouth) + thisLinear(this%sEast) + thisLinear(this%sWest)
        end subroutine spin
    end subroutine Spinner_spin
end module mod_Spinner
program Spinner
    use mod_Spinner
    implicit none
    type(Spinner_t) :: A, B
    integer, parameter :: nX = 5
    integer, parameter :: nY = 6
    integer :: iX, iY
    call Spinner_init(A, nX, nY)
    call Spinner_init(B, nX, nY)
    
    do iY = 1, nY
        do iX = 1, nX
            A%s(iX, iY) = iX * 10000 + iY
        end do
    end do
    
    call Spinner_spin(A, B)

    print *,'A'
    do iY = 1, nY
        print *, A%s(:,iY)
    end do
    print *
    print *,'B'
    do iY = 1, nY
        print *, B%s(:,iY)
    end do

    
end program Spinner

And output

 A
 10001.00 20001.00 30001.00 40001.00 50001.00
 10002.00 20002.00 30002.00 40002.00 50002.00
 10003.00 20003.00 30003.00 40003.00 50003.00
 10004.00 20004.00 30004.00 40004.00 50004.00
 10005.00 20005.00 30005.00 40005.00 50005.00
 10006.00 20006.00 30006.00 40006.00 50006.00

 B
 90010.00 80010.00 120010.0 160010.0 150010.0
 90008.00 80008.00 120008.0 160008.0 150008.0
 90012.00 80012.00 120012.0 160012.0 150012.0
 90016.00 80016.00 120016.0 160016.0 150016.0
 90020.00 80020.00 120020.0 160020.0 150020.0
 90018.00 80018.00 120018.0 160018.0 150018.0

And for large arrays, the inner most loop processing 16 floats per iteration is:

        vpcmpgtq  k1, zmm0, zmm3                                ;67.14 c1
        vpcmpgtq  k0, zmm0, zmm2                                ;67.14 c1
        add       rdx, 16                                       ;67.14 c1
        kunpckbw  k6, k0, k1                                    ;67.14 c3
        vmovaps   zmm17, zmm1                                   ;67.14 c3
        vmovaps   zmm18, zmm1                                   ;67.14 c3
        vmovdqu32 zmm5{k6}{z}, ZMMWORD PTR [rcx+r14]            ;67.14 c5
        vmovdqu32 zmm16{k6}{z}, ZMMWORD PTR [rcx+r11]           ;67.14 c5
        kmovw     k2, k6                                        ;67.14 c5
        kmovw     k3, k6                                        ;67.14 c5
        vmovaps   zmm21, zmm1                                   ;67.14 c5
        vmovaps   zmm24, zmm1                                   ;67.14 c5
        kmovw     k4, k6                                        ;67.14 c7
        kmovw     k5, k6                                        ;67.14 c7
        vpaddq    zmm3, zmm3, zmm4                              ;67.14 c7
        vpaddq    zmm2, zmm2, zmm4                              ;67.14 c7
        vmovdqu32 zmm19{k6}{z}, ZMMWORD PTR [rcx+rax]           ;67.14 c11 stall 1
        vmovdqu32 zmm22{k6}{z}, ZMMWORD PTR [rcx+r8]            ;67.14 c11
        vgatherdps zmm17{k2}, DWORD PTR [-4+r10+zmm5*4]         ;67.14 c17 stall 2
        vgatherdps zmm18{k3}, DWORD PTR [-4+r10+zmm16*4]        ;67.14 c17
        vaddps    zmm20, zmm17, zmm18                           ;67.14 c23 stall 2
        vgatherdps zmm21{k4}, DWORD PTR [-4+r10+zmm19*4]        ;67.14 c23
        vgatherdps zmm24{k5}, DWORD PTR [-4+r10+zmm22*4]        ;67.14 c23
        vaddps    zmm23, zmm20, zmm21                           ;67.14 c29 stall 2
        vaddps    zmm25, zmm23, zmm24                           ;67.14 c35 stall 2
        vmovups   ZMMWORD PTR [rcx+rbx]{k6}, zmm25              ;67.14 c41 stall 2
        add       rcx, 64                                       ;67.14 c41
        cmp       rdx, rsi                                      ;67.14 c41
        jb        .B2.17        ; Prob 82%                      ;67.14 c43

Only 29 instructions, 6 8 reads, 1 write to output 16 results.

On CPU's that do not have AVX512, or for small arrays, this technique might not work as well. You will have to run tests to make this determination.

Jim Dempsey

jimdempseyatthecove · ‎11-10-2018

If you build for a CPU with AVX2

.B2.4::                         ; Preds .B2.4 .B2.3
                                ; Execution count [5.00e+000]
        movsxd    rdx, DWORD PTR [rbx+r9*4]                     ;67.14
        inc       r9                                            ;67.14
        movsxd    r15, DWORD PTR [r13+r8*4]                     ;67.14
        inc       r8                                            ;67.14
        vmovss    xmm0, DWORD PTR [-4+r12+rdx*4]                ;67.14
        movsxd    rdx, DWORD PTR [r14+rdi*4]                    ;67.14
        inc       rdi                                           ;67.14
        vaddss    xmm1, xmm0, DWORD PTR [-4+r12+r15*4]          ;67.14
        movsxd    r15, DWORD PTR [rax+rsi*4]                    ;67.14
        inc       rsi                                           ;67.14
        vaddss    xmm2, xmm1, DWORD PTR [-4+r12+rdx*4]          ;67.14
        vaddss    xmm3, xmm2, DWORD PTR [-4+r12+r15*4]          ;67.14
        vmovss    DWORD PTR [-4+r11+rcx*4], xmm3                ;67.14
        inc       rcx                                           ;67.14
        cmp       rcx, r10                                      ;67.14
        jle       .B2.4         ; Prob 82%                      ;67.14

The inner loop is 16 instructions, 7 reads, 1 write **** to output only 1 result.

When using this CPU it might be faster to use the MOD function discussed earlier.

Jim Dempsey

jimdempseyatthecove · ‎11-10-2018

Mecej4's suggestion is similar in that he suggest that the array section array represents a single row whereas my suggestion represents the entire array. At to which is faster, this will have to be tested.

The winner will depend on the array size, number of memory channels and memory streaming read capability as well as cache sizes.

Jim Dempsey

jimdempseyatthecove · ‎11-10-2018

Here is a MOD implimentation

    
    subroutine Spinner_spinMod(this, toThat)
        implicit none
        type(Spinner_t) :: this, toThat
        if(size(this%s) /= size(toThat%s)) stop "Sizes do not match"
        call spin(this%s, toThat%s)
    contains
        subroutine spin(thisLinear, toThatLinear)
            implicit none
            real :: thisLinear(0:this%nX*this%nY-1), toThatLinear(0:toThat%nX*toThat%nY-1)
            integer :: iX, iY, iNorth, iSouth, iEast, iWest
            do iY = 0, this%nY - 1
                iNorth = mod(iY + this%nY - 1, this%nY)
                iSouth = mod(iY + this%nY + 1, this%nY)
                do iX = 0, this%nX - 1
                    iEast = mod(iX + this%nX + 1, this%nX)
                    iWest = mod(iX + this%nX - 1, this%nX)
                    toThatLinear(iY * this%nX + iX) = thisLinear(iNorth * this%nX + iX) + thisLinear(iSouth * this%nX + iX) + thisLinear(iY * this%nX + iEast) + thisLinear(iY * this%nX + iWest)
                end do
            end do
        end subroutine spin
    end subroutine Spinner_spinMod

Code generated on AVX2 system for inner loop

        movss     xmm0, DWORD PTR [rsi+r10*4]                   ;82.14
        lea       eax, DWORD PTR [1+r12+r10]                    ;82.14
        cdq                                                     ;82.14
        addss     xmm0, DWORD PTR [rbx+r10*4]                   ;82.14
        idiv      r12d                                          ;82.14
        lea       eax, DWORD PTR [-1+r12+r10]                   ;82.14
        add       edx, r14d                                     ;82.14
        movsxd    rdx, edx                                      ;82.14
        addss     xmm0, DWORD PTR [r13+rdx*4]                   ;82.14
        cdq                                                     ;82.14
        idiv      r12d                                          ;82.14
        add       edx, r14d                                     ;82.14
        movsxd    rdx, edx                                      ;82.14
        addss     xmm0, DWORD PTR [r13+rdx*4]                   ;82.14
        movss     DWORD PTR [r11+r10*4], xmm0                   ;82.14
        inc       r10                                           ;82.14
        cmp       r10, rbp                                      ;82.14
        jle       .B3.6         ; Prob 82%                      ;82.14

18 instructions, 4 reads, 1 write per 1 output cell. This is likely faster on AVX2 system.

Jim Dempsey

jimdempseyatthecove · ‎11-11-2018

Is the #7 (AVX512) posting the best that can be done?

Most likely not, for larger problems the index arrays would not fit in L1 cache or possibly L2 cache, although they are candidates for streaming reads, and which the hardware pre-fetch might detect and engage. A slight modification of this technique would be to make three sets of indexing arrays: one set for first row, one set for middle rows, and one set for last row (essentially what mecej4 suggested).

A further improvement for AVX512, using more complex code, would be constructed using the knowledge that each interior SIMD vector would have the same relative offsets to the start of the vector. IOW you could eliminate the gather by using 4 base indices for input and one for output (and handle the perimeters separately).

An additional consideration is parallelization (useful for larger problem sets). However, when considering the whole problem (large problem, many threads available, many time steps in simulation, ...) an entirely different approach migh be useful as with:

High Performance Parallelism Pearls, Chapter 5, Plesiochronous Phasing Barriers

Jim Dempsey

mohammad_b_ · ‎11-12-2018

mecej4 wrote:

Well, people have been solving 2-D and 3-D field problems on computers for about six decades, so the problem that you describe (how to manage periodic B.C.) has several solutions. Please think about the suggestions that Jim D. gave you in #3. Here is another way of handling the periodic B.C. that you may assess for your application (you have to adjust the array bounds to suit):

Maintain four auxiliary 1-D integer arrays iw, ie, js, jn (the second letter stands for West, East, etc.). Let iw(1:11) = [10, 1, ... 9,10], and ie(1:11) = [2,3,...,10,11,1]. Define js and jn similarly. In the body of your calculations, use s(iw(i),j) instead of s(i-1,j) , s(i,in(j)) instead of s(i,j+1), and similarly for s(i,j-1) and s(i,j+1).

The auxiliary arrays iw, ie, js, jn need only be computed once in the beginning, as soon as you have decided the size of your grid. On the other hand, it may be faster to compute the elements of these arrays on the fly, using AVX instructions to process blocks of elements instead of one at a time.

thank you for fascinating suggestion iw, ie, .... I think this solution improve my code performance because as you mentioned it only computed once at the beginning. in addition you have mentioned that the P.B.C problem has several solutions. if it's possible for you, please introduce me some references.

mohammad_b_ · ‎11-12-2018

jimdempseyatthecove wrote:

Mohammad,

What you are asking for is optimization advise for a performance problem you have identified, but showing us too little of the code and also not describing the runtime environment. For example:

You have not stated if s(0, iy) is a copy of the value stored in s(lx, iy) (same for the 3 other boundaries), or if these cells are simply padd cells for simplifying your programming.

If your programming a completely scalar algorithm that cannot be vectorized and/or on an older generation CPU without scatter/gather instructions .AND. if the boundary cells are copies of the opposing "wall", then consider removing the encapsulation buffer cells (0th, lx+1th, ly+1th), and simply use a MOD function on the index... sp(mod(i,lx)+1,mod(j,ly)+1))...

On the other hand, if the code can be vectorized and if you run on a CPU with scatter/gather, then you might consider using a section subscript list form of indexing an array (see Array Sections). Note, you have not stated as to if your spin of (1,4), result shown as k, is to be stored into an output array at location (1,4).

Example code for system with AVX512 with scatter/gather

Note for inclusion on this forum I have restricted the s array to dimension (5,6). your actual implementation is assumed to be much larger.

module mod_Spinner
    type Spinner_t
        integer :: nX, nY
        real, allocatable :: s(:,:)
        integer, allocatable :: sNorth(:), sSouth(:), sEast(:), sWest(:)
    end type Spinner_t
    !
    contains
    
    subroutine Spinner_init(this, nX, nY)
        implicit none
        type(Spinner_t) :: this
        integer :: nX, nY
        integer :: iX, iY, iLinear, iNorth, iSouth, iEast, iWest
        this%nX = nX
        this%nY = nY
        allocate(this%s(nX,nY), this%sNorth(nX*nY), this%sSouth(nX*nY), this%sEast(nX*nY), this%sWest(nX*nY))
        do iX = 1, nX
            do iY = 1, nY
                if(iX == 1) then
                    iWest = nX
                else
                    iWest = iX-1
                endif
                if(iX == nX) then
                    iEast = 1
                else
                    iEast = iX + 1
                endif
                if(iY == 1) then
                    iNorth = nY
                else
                    iNorth = iY - 1
                endif
                if(iY == nY) then
                    iSouth = 1
                else
                    iSouth = iY + 1
                endif
                iLinear = (iY - 1) * nX + iX
                this%sNorth(iLinear) = (iNorth - 1) * nX + iX
                this%sSouth(iLinear) = (iSouth - 1) * nX + iX
                this%sEast(iLinear) = (iY - 1) * nX + iEast
                this%sWest(iLinear) = (iY - 1) * nX + iWest
            end do
        end do
    end subroutine Spinner_init
    
    subroutine Spinner_spin(this, toThat)
        implicit none
        type(Spinner_t) :: this, toThat
        if(size(this%s) /= size(toThat%s)) stop "Sizes do not match"
        call spin(this%s, toThat%s)
    contains
        subroutine spin(thisLinear, toThatLinear)
            implicit none
            real :: thisLinear(this%nX*this%nY), toThatLinear(toThat%nX*toThat%nY)
        
            toThatLinear = thisLinear(this%sNorth) + thisLinear(this%sSouth) + thisLinear(this%sEast) + thisLinear(this%sWest)
        end subroutine spin
    end subroutine Spinner_spin
end module mod_Spinner
program Spinner
    use mod_Spinner
    implicit none
    type(Spinner_t) :: A, B
    integer, parameter :: nX = 5
    integer, parameter :: nY = 6
    integer :: iX, iY
    call Spinner_init(A, nX, nY)
    call Spinner_init(B, nX, nY)
    
    do iY = 1, nY
        do iX = 1, nX
            A%s(iX, iY) = iX * 10000 + iY
        end do
    end do
    
    call Spinner_spin(A, B)

    print *,'A'
    do iY = 1, nY
        print *, A%s(:,iY)
    end do
    print *
    print *,'B'
    do iY = 1, nY
        print *, B%s(:,iY)
    end do

    
end program Spinner

And output

 A
 10001.00 20001.00 30001.00 40001.00 50001.00
 10002.00 20002.00 30002.00 40002.00 50002.00
 10003.00 20003.00 30003.00 40003.00 50003.00
 10004.00 20004.00 30004.00 40004.00 50004.00
 10005.00 20005.00 30005.00 40005.00 50005.00
 10006.00 20006.00 30006.00 40006.00 50006.00

 B
 90010.00 80010.00 120010.0 160010.0 150010.0
 90008.00 80008.00 120008.0 160008.0 150008.0
 90012.00 80012.00 120012.0 160012.0 150012.0
 90016.00 80016.00 120016.0 160016.0 150016.0
 90020.00 80020.00 120020.0 160020.0 150020.0
 90018.00 80018.00 120018.0 160018.0 150018.0

And for large arrays, the inner most loop processing 16 floats per iteration is:

        vpcmpgtq  k1, zmm0, zmm3                                ;67.14 c1
        vpcmpgtq  k0, zmm0, zmm2                                ;67.14 c1
        add       rdx, 16                                       ;67.14 c1
        kunpckbw  k6, k0, k1                                    ;67.14 c3
        vmovaps   zmm17, zmm1                                   ;67.14 c3
        vmovaps   zmm18, zmm1                                   ;67.14 c3
        vmovdqu32 zmm5{k6}{z}, ZMMWORD PTR [rcx+r14]            ;67.14 c5
        vmovdqu32 zmm16{k6}{z}, ZMMWORD PTR [rcx+r11]           ;67.14 c5
        kmovw     k2, k6                                        ;67.14 c5
        kmovw     k3, k6                                        ;67.14 c5
        vmovaps   zmm21, zmm1                                   ;67.14 c5
        vmovaps   zmm24, zmm1                                   ;67.14 c5
        kmovw     k4, k6                                        ;67.14 c7
        kmovw     k5, k6                                        ;67.14 c7
        vpaddq    zmm3, zmm3, zmm4                              ;67.14 c7
        vpaddq    zmm2, zmm2, zmm4                              ;67.14 c7
        vmovdqu32 zmm19{k6}{z}, ZMMWORD PTR [rcx+rax]           ;67.14 c11 stall 1
        vmovdqu32 zmm22{k6}{z}, ZMMWORD PTR [rcx+r8]            ;67.14 c11
        vgatherdps zmm17{k2}, DWORD PTR [-4+r10+zmm5*4]         ;67.14 c17 stall 2
        vgatherdps zmm18{k3}, DWORD PTR [-4+r10+zmm16*4]        ;67.14 c17
        vaddps    zmm20, zmm17, zmm18                           ;67.14 c23 stall 2
        vgatherdps zmm21{k4}, DWORD PTR [-4+r10+zmm19*4]        ;67.14 c23
        vgatherdps zmm24{k5}, DWORD PTR [-4+r10+zmm22*4]        ;67.14 c23
        vaddps    zmm23, zmm20, zmm21                           ;67.14 c29 stall 2
        vaddps    zmm25, zmm23, zmm24                           ;67.14 c35 stall 2
        vmovups   ZMMWORD PTR [rcx+rbx]{k6}, zmm25              ;67.14 c41 stall 2
        add       rcx, 64                                       ;67.14 c41
        cmp       rdx, rsi                                      ;67.14 c41
        jb        .B2.17        ; Prob 82%                      ;67.14 c43

Only 29 instructions, 6 8 reads, 1 write to output 16 results.

On CPU's that do not have AVX512, or for small arrays, this technique might not work as well. You will have to run tests to make this determination.

Jim Dempsey

First of all I should appreciate for your suggestion, but I am not professional in coding and I don't know what is AVX512. what I understand from your suggestion is that you analyses the computational price of your code. how you did this? I think this is high valuable to find which part of your code take more time. my compiler version is Intel 64 compiler Xe version 15.0.090 if it is possible for you please introduce me some intermediate book about professional programing in FORTRAN language. thank you in advance

jimdempseyatthecove · ‎11-12-2018

A lot of the optimal technique is dependent upon the CPU that will run the program. While one can write a generic algorithm for any x64 CPU, targeting a specific CPU, especially for a simulation program that may take hours, days, months to run can significantly reduce the time to completion.

Can you provide information about the system(s) that will run the simulation(s) (CPU model number, number of CPU sockets, number of cores per CPU, RAM and if applicable number of systems in cluster. Also, state the typical problem size. Your sample program has relatively small arrays, is this typical of your actual application?

Note, the strategy for optimal tuning of a small data set is quite different than that for a medium, and that for a large data set.

In looking at your ising.f90, you should be aware that RANDOM_NUMBER can return an array of random numbers more efficiently than returning the same number of random numbers individually.

For example:

Replace:

DO i= 1, lx
  DO j= 1, ly
    CALL RANDOM_NUMBER(x)
    s(i,j)= -1
    IF ( x .LT. 0.5) s(i,j)= 1
  ENDDO
ENDDO

With:

real :: rn(lx, ly)
...
call random_number(rn)
where(rn .lt. 0.5)
  s = 1.0
elsewhere
  s = -1.0
endwhere

You can use RANDOM_NUMBER in a similar manner to return an array of results in the subroutine monte_carlo. Then substitute the DO loop with WHERE syntax.

We've already covered how to convert the subroutine boundary into an array of indices for use in indexed form of subscripting.

The intent of the replies on this forum is to inform you of techniques to improve your programming, and is not intended to re-write your programs. You should have sufficient information by this time to think about how to improve your program.

Jim Dempsey

mohammad_b_ · ‎11-12-2018

my system information is cpu model = Intel(R) Core-TM i3-2348M 2.3 GHz 2 cores and 4 Logical process RAM= 4.00 GB (3.86 GB usable)

jimdempseyatthecove · ‎11-12-2018

Typing in i3-2348M to Google search yields an ark.intel.com posting about your CPU. It has the first generation AVX instructions. It does not have the scatter/gather nor masked move capabilities of later architectures.

Possible problems with the code you posted

            x= x*lx
            y= y*lx ! *** this should be y*ly lx might not always equal ly in later revisions

Also,

DO t= 3., 0.01, -0.1
DO loops of type real is an obsolete (deleted) feature of Fortran

I surmise you are taking a Computer Science test. What do you think you can use to replace the obsolete DO loop?

Jim Dempsey

mecej4 · ‎11-12-2018

If the arrays x and s have the same shape, namely, (lx, ly), the DO loops displayed in #14 can be replaced with

call random_number(x)
s = 2*x - 1

Pointer