- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
What you seem to want, although you did not state so explicitly, is to have different portions of a 2-D pointer variable point to different sections of a 2-D target, and to do this simultaneously. That is not possible. Each pointer assignment of sp effectively erases the previous one. In F2008, you can do the following, but this code probably does not do all the things you want:
program ptr2 implicit none integer, parameter :: LX = 5, LY=7 INTEGER, DIMENSION (:, :), POINTER :: sp INTEGER, TARGET :: s(0: lx+1, 0: ly+1) sp => s(1:lx, 1:ly) sp(0:0 , 1:ly) => s(lx, 1:ly) sp(lx+1:lx+1, 1:ly) => s(1, 1:ly) sp(1:lx, 0:0) => s(1:lx, ly) sp(1:lx, ly:ly) => s(1:lx, 1) print *,'Lower bounds: ',lbound(sp) print *,'Upper bounds: ',ubound(sp) end
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Or an alternate (not fully tested):
! slice.f90 ! ! FUNCTIONS: ! slice - Entry point of console application. ! !**************************************************************************** ! program slice implicit none integer, parameter :: lx = 9 integer, parameter :: ly = 9 INTEGER, DIMENSION (:, :), POINTER :: sp INTEGER, TARGET :: s(0: lx+1, 0: ly+1) INTEGER :: ix, iy do ix = lbound(s,1),ubound(s,1) do iy = lbound(s,2),ubound(s,2) s(ix,iy) = ix * 100 + iy end do end do associate(section => s(1:lx, 1:ly)) print *,'s(1:lx, 1:ly)' sp(1:lx, 1:ly) => section print *, sp print *,'sp(',lbound(sp,1),':',ubound(sp,1),',',lbound(sp,2),':',ubound(sp,2),')' print * end associate associate(section => s(lx, 1:ly)) print *,'s(lx, 1:ly)' sp(0:0, 1:ly) => section print *, sp print *,'sp(',lbound(sp,1),':',ubound(sp,1),',',lbound(sp,2),':',ubound(sp,2),')' print * end associate associate(section => s(1, 1:ly)) print *,'s(1, 1:ly)' sp(lx+1:lx+1, 1:ly) => section print *, sp print *,'sp(',lbound(sp,1),':',ubound(sp,1),',',lbound(sp,2),':',ubound(sp,2),')' print * end associate associate(section => s(1:lx, ly)) print *,'s(1:lx, ly)' sp(1:lx, 0:0) => section print *, sp print *,'sp(',lbound(sp,1),':',ubound(sp,1),',',lbound(sp,2),':',ubound(sp,2),')' print * end associate associate(section => s(1:lx, 1)) print *,'s(1:lx, 1)' sp(1:lx, ly:ly) => section print *, sp print *,'sp(',lbound(sp,1),':',ubound(sp,1),',',lbound(sp,2),':',ubound(sp,2),')' print * end associate end program slice
Output:
s(1:lx, 1:ly) 101 201 301 401 501 601 701 801 901 1001 2 102 202 302 402 502 602 702 802 902 1002 3 103 203 303 403 503 603 703 803 903 1003 4 104 204 304 404 504 604 704 804 904 1004 5 105 205 305 405 505 605 705 805 905 1005 6 106 206 306 406 506 606 706 806 906 1006 7 107 207 307 407 507 607 707 807 907 1007 8 108 208 308 408 sp( 1 : 9 , 1 : 9 ) s(lx, 1:ly) 901 902 903 904 905 906 907 908 909 sp( 0 : 0 , 1 : 9 ) s(1, 1:ly) 101 102 103 104 105 106 107 108 109 sp( 10 : 10 , 1 : 9 ) s(1:lx, ly) 109 209 309 409 509 609 709 809 909 sp( 1 : 9 , 0 : 0 ) s(1:lx, 1) 101 201 301 401 501 601 701 801 901 sp( 1 : 9 , 9 : 9 )
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
*** Note that the 2D section did not skip over the 10th (0:10) cell as wished for. (Observe the 10nn values in the first printout).
The single column or row expressions did what you want.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Well, people have been solving 2-D and 3-D field problems on computers for about six decades, so the problem that you describe (how to manage periodic B.C.) has several solutions. Please think about the suggestions that Jim D. gave you in #3. Here is another way of handling the periodic B.C. that you may assess for your application (you have to adjust the array bounds to suit):
Maintain four auxiliary 1-D integer arrays iw, ie, js, jn (the second letter stands for West, East, etc.). Let iw(1:11) = [10, 1, ... 9,10], and ie(1:11) = [2,3,...,10,11,1]. Define js and jn similarly. In the body of your calculations, use s(iw(i),j) instead of s(i-1,j) , s(i,in(j)) instead of s(i,j+1), and similarly for s(i,j-1) and s(i,j+1).
The auxiliary arrays iw, ie, js, jn need only be computed once in the beginning, as soon as you have decided the size of your grid. On the other hand, it may be faster to compute the elements of these arrays on the fly, using AVX instructions to process blocks of elements instead of one at a time.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Mohammad,
What you are asking for is optimization advise for a performance problem you have identified, but showing us too little of the code and also not describing the runtime environment. For example:
You have not stated if s(0, iy) is a copy of the value stored in s(lx, iy) (same for the 3 other boundaries), or if these cells are simply padd cells for simplifying your programming.
If your programming a completely scalar algorithm that cannot be vectorized and/or on an older generation CPU without scatter/gather instructions .AND. if the boundary cells are copies of the opposing "wall", then consider removing the encapsulation buffer cells (0th, lx+1th, ly+1th), and simply use a MOD function on the index... sp(mod(i,lx)+1,mod(j,ly)+1))...
On the other hand, if the code can be vectorized and if you run on a CPU with scatter/gather, then you might consider using a section subscript list form of indexing an array (see Array Sections). Note, you have not stated as to if your spin of (1,4), result shown as k, is to be stored into an output array at location (1,4).
Example code for system with AVX512 with scatter/gather
Note for inclusion on this forum I have restricted the s array to dimension (5,6). your actual implementation is assumed to be much larger.
module mod_Spinner type Spinner_t integer :: nX, nY real, allocatable :: s(:,:) integer, allocatable :: sNorth(:), sSouth(:), sEast(:), sWest(:) end type Spinner_t ! contains subroutine Spinner_init(this, nX, nY) implicit none type(Spinner_t) :: this integer :: nX, nY integer :: iX, iY, iLinear, iNorth, iSouth, iEast, iWest this%nX = nX this%nY = nY allocate(this%s(nX,nY), this%sNorth(nX*nY), this%sSouth(nX*nY), this%sEast(nX*nY), this%sWest(nX*nY)) do iX = 1, nX do iY = 1, nY if(iX == 1) then iWest = nX else iWest = iX-1 endif if(iX == nX) then iEast = 1 else iEast = iX + 1 endif if(iY == 1) then iNorth = nY else iNorth = iY - 1 endif if(iY == nY) then iSouth = 1 else iSouth = iY + 1 endif iLinear = (iY - 1) * nX + iX this%sNorth(iLinear) = (iNorth - 1) * nX + iX this%sSouth(iLinear) = (iSouth - 1) * nX + iX this%sEast(iLinear) = (iY - 1) * nX + iEast this%sWest(iLinear) = (iY - 1) * nX + iWest end do end do end subroutine Spinner_init subroutine Spinner_spin(this, toThat) implicit none type(Spinner_t) :: this, toThat if(size(this%s) /= size(toThat%s)) stop "Sizes do not match" call spin(this%s, toThat%s) contains subroutine spin(thisLinear, toThatLinear) implicit none real :: thisLinear(this%nX*this%nY), toThatLinear(toThat%nX*toThat%nY) toThatLinear = thisLinear(this%sNorth) + thisLinear(this%sSouth) + thisLinear(this%sEast) + thisLinear(this%sWest) end subroutine spin end subroutine Spinner_spin end module mod_Spinner program Spinner use mod_Spinner implicit none type(Spinner_t) :: A, B integer, parameter :: nX = 5 integer, parameter :: nY = 6 integer :: iX, iY call Spinner_init(A, nX, nY) call Spinner_init(B, nX, nY) do iY = 1, nY do iX = 1, nX A%s(iX, iY) = iX * 10000 + iY end do end do call Spinner_spin(A, B) print *,'A' do iY = 1, nY print *, A%s(:,iY) end do print * print *,'B' do iY = 1, nY print *, B%s(:,iY) end do end program Spinner
And output
A 10001.00 20001.00 30001.00 40001.00 50001.00 10002.00 20002.00 30002.00 40002.00 50002.00 10003.00 20003.00 30003.00 40003.00 50003.00 10004.00 20004.00 30004.00 40004.00 50004.00 10005.00 20005.00 30005.00 40005.00 50005.00 10006.00 20006.00 30006.00 40006.00 50006.00 B 90010.00 80010.00 120010.0 160010.0 150010.0 90008.00 80008.00 120008.0 160008.0 150008.0 90012.00 80012.00 120012.0 160012.0 150012.0 90016.00 80016.00 120016.0 160016.0 150016.0 90020.00 80020.00 120020.0 160020.0 150020.0 90018.00 80018.00 120018.0 160018.0 150018.0
And for large arrays, the inner most loop processing 16 floats per iteration is:
vpcmpgtq k1, zmm0, zmm3 ;67.14 c1 vpcmpgtq k0, zmm0, zmm2 ;67.14 c1 add rdx, 16 ;67.14 c1 kunpckbw k6, k0, k1 ;67.14 c3 vmovaps zmm17, zmm1 ;67.14 c3 vmovaps zmm18, zmm1 ;67.14 c3 vmovdqu32 zmm5{k6}{z}, ZMMWORD PTR [rcx+r14] ;67.14 c5 vmovdqu32 zmm16{k6}{z}, ZMMWORD PTR [rcx+r11] ;67.14 c5 kmovw k2, k6 ;67.14 c5 kmovw k3, k6 ;67.14 c5 vmovaps zmm21, zmm1 ;67.14 c5 vmovaps zmm24, zmm1 ;67.14 c5 kmovw k4, k6 ;67.14 c7 kmovw k5, k6 ;67.14 c7 vpaddq zmm3, zmm3, zmm4 ;67.14 c7 vpaddq zmm2, zmm2, zmm4 ;67.14 c7 vmovdqu32 zmm19{k6}{z}, ZMMWORD PTR [rcx+rax] ;67.14 c11 stall 1 vmovdqu32 zmm22{k6}{z}, ZMMWORD PTR [rcx+r8] ;67.14 c11 vgatherdps zmm17{k2}, DWORD PTR [-4+r10+zmm5*4] ;67.14 c17 stall 2 vgatherdps zmm18{k3}, DWORD PTR [-4+r10+zmm16*4] ;67.14 c17 vaddps zmm20, zmm17, zmm18 ;67.14 c23 stall 2 vgatherdps zmm21{k4}, DWORD PTR [-4+r10+zmm19*4] ;67.14 c23 vgatherdps zmm24{k5}, DWORD PTR [-4+r10+zmm22*4] ;67.14 c23 vaddps zmm23, zmm20, zmm21 ;67.14 c29 stall 2 vaddps zmm25, zmm23, zmm24 ;67.14 c35 stall 2 vmovups ZMMWORD PTR [rcx+rbx]{k6}, zmm25 ;67.14 c41 stall 2 add rcx, 64 ;67.14 c41 cmp rdx, rsi ;67.14 c41 jb .B2.17 ; Prob 82% ;67.14 c43
Only 29 instructions, 6 8 reads, 1 write to output 16 results.
On CPU's that do not have AVX512, or for small arrays, this technique might not work as well. You will have to run tests to make this determination.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If you build for a CPU with AVX2
.B2.4:: ; Preds .B2.4 .B2.3 ; Execution count [5.00e+000] movsxd rdx, DWORD PTR [rbx+r9*4] ;67.14 inc r9 ;67.14 movsxd r15, DWORD PTR [r13+r8*4] ;67.14 inc r8 ;67.14 vmovss xmm0, DWORD PTR [-4+r12+rdx*4] ;67.14 movsxd rdx, DWORD PTR [r14+rdi*4] ;67.14 inc rdi ;67.14 vaddss xmm1, xmm0, DWORD PTR [-4+r12+r15*4] ;67.14 movsxd r15, DWORD PTR [rax+rsi*4] ;67.14 inc rsi ;67.14 vaddss xmm2, xmm1, DWORD PTR [-4+r12+rdx*4] ;67.14 vaddss xmm3, xmm2, DWORD PTR [-4+r12+r15*4] ;67.14 vmovss DWORD PTR [-4+r11+rcx*4], xmm3 ;67.14 inc rcx ;67.14 cmp rcx, r10 ;67.14 jle .B2.4 ; Prob 82% ;67.14
The inner loop is 16 instructions, 7 reads, 1 write **** to output only 1 result.
When using this CPU it might be faster to use the MOD function discussed earlier.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Mecej4's suggestion is similar in that he suggest that the array section array represents a single row whereas my suggestion represents the entire array. At to which is faster, this will have to be tested.
The winner will depend on the array size, number of memory channels and memory streaming read capability as well as cache sizes.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Here is a MOD implimentation
subroutine Spinner_spinMod(this, toThat) implicit none type(Spinner_t) :: this, toThat if(size(this%s) /= size(toThat%s)) stop "Sizes do not match" call spin(this%s, toThat%s) contains subroutine spin(thisLinear, toThatLinear) implicit none real :: thisLinear(0:this%nX*this%nY-1), toThatLinear(0:toThat%nX*toThat%nY-1) integer :: iX, iY, iNorth, iSouth, iEast, iWest do iY = 0, this%nY - 1 iNorth = mod(iY + this%nY - 1, this%nY) iSouth = mod(iY + this%nY + 1, this%nY) do iX = 0, this%nX - 1 iEast = mod(iX + this%nX + 1, this%nX) iWest = mod(iX + this%nX - 1, this%nX) toThatLinear(iY * this%nX + iX) = thisLinear(iNorth * this%nX + iX) + thisLinear(iSouth * this%nX + iX) + thisLinear(iY * this%nX + iEast) + thisLinear(iY * this%nX + iWest) end do end do end subroutine spin end subroutine Spinner_spinMod
Code generated on AVX2 system for inner loop
movss xmm0, DWORD PTR [rsi+r10*4] ;82.14 lea eax, DWORD PTR [1+r12+r10] ;82.14 cdq ;82.14 addss xmm0, DWORD PTR [rbx+r10*4] ;82.14 idiv r12d ;82.14 lea eax, DWORD PTR [-1+r12+r10] ;82.14 add edx, r14d ;82.14 movsxd rdx, edx ;82.14 addss xmm0, DWORD PTR [r13+rdx*4] ;82.14 cdq ;82.14 idiv r12d ;82.14 add edx, r14d ;82.14 movsxd rdx, edx ;82.14 addss xmm0, DWORD PTR [r13+rdx*4] ;82.14 movss DWORD PTR [r11+r10*4], xmm0 ;82.14 inc r10 ;82.14 cmp r10, rbp ;82.14 jle .B3.6 ; Prob 82% ;82.14
18 instructions, 4 reads, 1 write per 1 output cell. This is likely faster on AVX2 system.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Is the #7 (AVX512) posting the best that can be done?
Most likely not, for larger problems the index arrays would not fit in L1 cache or possibly L2 cache, although they are candidates for streaming reads, and which the hardware pre-fetch might detect and engage. A slight modification of this technique would be to make three sets of indexing arrays: one set for first row, one set for middle rows, and one set for last row (essentially what mecej4 suggested).
A further improvement for AVX512, using more complex code, would be constructed using the knowledge that each interior SIMD vector would have the same relative offsets to the start of the vector. IOW you could eliminate the gather by using 4 base indices for input and one for output (and handle the perimeters separately).
An additional consideration is parallelization (useful for larger problem sets). However, when considering the whole problem (large problem, many threads available, many time steps in simulation, ...) an entirely different approach migh be useful as with:
High Performance Parallelism Pearls, Chapter 5, Plesiochronous Phasing Barriers
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
mecej4 wrote:thank you for fascinating suggestion iw, ie, .... I think this solution improve my code performance because as you mentioned it only computed once at the beginning. in addition you have mentioned that the P.B.C problem has several solutions. if it's possible for you, please introduce me some references.
Well, people have been solving 2-D and 3-D field problems on computers for about six decades, so the problem that you describe (how to manage periodic B.C.) has several solutions. Please think about the suggestions that Jim D. gave you in #3. Here is another way of handling the periodic B.C. that you may assess for your application (you have to adjust the array bounds to suit):
Maintain four auxiliary 1-D integer arrays iw, ie, js, jn (the second letter stands for West, East, etc.). Let iw(1:11) = [10, 1, ... 9,10], and ie(1:11) = [2,3,...,10,11,1]. Define js and jn similarly. In the body of your calculations, use s(iw(i),j) instead of s(i-1,j) , s(i,in(j)) instead of s(i,j+1), and similarly for s(i,j-1) and s(i,j+1).
The auxiliary arrays iw, ie, js, jn need only be computed once in the beginning, as soon as you have decided the size of your grid. On the other hand, it may be faster to compute the elements of these arrays on the fly, using AVX instructions to process blocks of elements instead of one at a time.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
jimdempseyatthecove wrote:First of all I should appreciate for your suggestion, but I am not professional in coding and I don't know what is AVX512. what I understand from your suggestion is that you analyses the computational price of your code. how you did this? I think this is high valuable to find which part of your code take more time. my compiler version is Intel 64 compiler Xe version 15.0.090 if it is possible for you please introduce me some intermediate book about professional programing in FORTRAN language. thank you in advance
Mohammad,
What you are asking for is optimization advise for a performance problem you have identified, but showing us too little of the code and also not describing the runtime environment. For example:
You have not stated if s(0, iy) is a copy of the value stored in s(lx, iy) (same for the 3 other boundaries), or if these cells are simply padd cells for simplifying your programming.
If your programming a completely scalar algorithm that cannot be vectorized and/or on an older generation CPU without scatter/gather instructions .AND. if the boundary cells are copies of the opposing "wall", then consider removing the encapsulation buffer cells (0th, lx+1th, ly+1th), and simply use a MOD function on the index... sp(mod(i,lx)+1,mod(j,ly)+1))...
On the other hand, if the code can be vectorized and if you run on a CPU with scatter/gather, then you might consider using a section subscript list form of indexing an array (see Array Sections). Note, you have not stated as to if your spin of (1,4), result shown as k, is to be stored into an output array at location (1,4).
Example code for system with AVX512 with scatter/gather
Note for inclusion on this forum I have restricted the s array to dimension (5,6). your actual implementation is assumed to be much larger.
module mod_Spinner type Spinner_t integer :: nX, nY real, allocatable :: s(:,:) integer, allocatable :: sNorth(:), sSouth(:), sEast(:), sWest(:) end type Spinner_t ! contains subroutine Spinner_init(this, nX, nY) implicit none type(Spinner_t) :: this integer :: nX, nY integer :: iX, iY, iLinear, iNorth, iSouth, iEast, iWest this%nX = nX this%nY = nY allocate(this%s(nX,nY), this%sNorth(nX*nY), this%sSouth(nX*nY), this%sEast(nX*nY), this%sWest(nX*nY)) do iX = 1, nX do iY = 1, nY if(iX == 1) then iWest = nX else iWest = iX-1 endif if(iX == nX) then iEast = 1 else iEast = iX + 1 endif if(iY == 1) then iNorth = nY else iNorth = iY - 1 endif if(iY == nY) then iSouth = 1 else iSouth = iY + 1 endif iLinear = (iY - 1) * nX + iX this%sNorth(iLinear) = (iNorth - 1) * nX + iX this%sSouth(iLinear) = (iSouth - 1) * nX + iX this%sEast(iLinear) = (iY - 1) * nX + iEast this%sWest(iLinear) = (iY - 1) * nX + iWest end do end do end subroutine Spinner_init subroutine Spinner_spin(this, toThat) implicit none type(Spinner_t) :: this, toThat if(size(this%s) /= size(toThat%s)) stop "Sizes do not match" call spin(this%s, toThat%s) contains subroutine spin(thisLinear, toThatLinear) implicit none real :: thisLinear(this%nX*this%nY), toThatLinear(toThat%nX*toThat%nY) toThatLinear = thisLinear(this%sNorth) + thisLinear(this%sSouth) + thisLinear(this%sEast) + thisLinear(this%sWest) end subroutine spin end subroutine Spinner_spin end module mod_Spinner program Spinner use mod_Spinner implicit none type(Spinner_t) :: A, B integer, parameter :: nX = 5 integer, parameter :: nY = 6 integer :: iX, iY call Spinner_init(A, nX, nY) call Spinner_init(B, nX, nY) do iY = 1, nY do iX = 1, nX A%s(iX, iY) = iX * 10000 + iY end do end do call Spinner_spin(A, B) print *,'A' do iY = 1, nY print *, A%s(:,iY) end do print * print *,'B' do iY = 1, nY print *, B%s(:,iY) end do end program SpinnerAnd output
A 10001.00 20001.00 30001.00 40001.00 50001.00 10002.00 20002.00 30002.00 40002.00 50002.00 10003.00 20003.00 30003.00 40003.00 50003.00 10004.00 20004.00 30004.00 40004.00 50004.00 10005.00 20005.00 30005.00 40005.00 50005.00 10006.00 20006.00 30006.00 40006.00 50006.00 B 90010.00 80010.00 120010.0 160010.0 150010.0 90008.00 80008.00 120008.0 160008.0 150008.0 90012.00 80012.00 120012.0 160012.0 150012.0 90016.00 80016.00 120016.0 160016.0 150016.0 90020.00 80020.00 120020.0 160020.0 150020.0 90018.00 80018.00 120018.0 160018.0 150018.0And for large arrays, the inner most loop processing 16 floats per iteration is:
vpcmpgtq k1, zmm0, zmm3 ;67.14 c1 vpcmpgtq k0, zmm0, zmm2 ;67.14 c1 add rdx, 16 ;67.14 c1 kunpckbw k6, k0, k1 ;67.14 c3 vmovaps zmm17, zmm1 ;67.14 c3 vmovaps zmm18, zmm1 ;67.14 c3 vmovdqu32 zmm5{k6}{z}, ZMMWORD PTR [rcx+r14] ;67.14 c5 vmovdqu32 zmm16{k6}{z}, ZMMWORD PTR [rcx+r11] ;67.14 c5 kmovw k2, k6 ;67.14 c5 kmovw k3, k6 ;67.14 c5 vmovaps zmm21, zmm1 ;67.14 c5 vmovaps zmm24, zmm1 ;67.14 c5 kmovw k4, k6 ;67.14 c7 kmovw k5, k6 ;67.14 c7 vpaddq zmm3, zmm3, zmm4 ;67.14 c7 vpaddq zmm2, zmm2, zmm4 ;67.14 c7 vmovdqu32 zmm19{k6}{z}, ZMMWORD PTR [rcx+rax] ;67.14 c11 stall 1 vmovdqu32 zmm22{k6}{z}, ZMMWORD PTR [rcx+r8] ;67.14 c11 vgatherdps zmm17{k2}, DWORD PTR [-4+r10+zmm5*4] ;67.14 c17 stall 2 vgatherdps zmm18{k3}, DWORD PTR [-4+r10+zmm16*4] ;67.14 c17 vaddps zmm20, zmm17, zmm18 ;67.14 c23 stall 2 vgatherdps zmm21{k4}, DWORD PTR [-4+r10+zmm19*4] ;67.14 c23 vgatherdps zmm24{k5}, DWORD PTR [-4+r10+zmm22*4] ;67.14 c23 vaddps zmm23, zmm20, zmm21 ;67.14 c29 stall 2 vaddps zmm25, zmm23, zmm24 ;67.14 c35 stall 2 vmovups ZMMWORD PTR [rcx+rbx]{k6}, zmm25 ;67.14 c41 stall 2 add rcx, 64 ;67.14 c41 cmp rdx, rsi ;67.14 c41 jb .B2.17 ; Prob 82% ;67.14 c43Only 29 instructions,
68 reads, 1 write to output 16 results.On CPU's that do not have AVX512, or for small arrays, this technique might not work as well. You will have to run tests to make this determination.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
A lot of the optimal technique is dependent upon the CPU that will run the program. While one can write a generic algorithm for any x64 CPU, targeting a specific CPU, especially for a simulation program that may take hours, days, months to run can significantly reduce the time to completion.
Can you provide information about the system(s) that will run the simulation(s) (CPU model number, number of CPU sockets, number of cores per CPU, RAM and if applicable number of systems in cluster. Also, state the typical problem size. Your sample program has relatively small arrays, is this typical of your actual application?
Note, the strategy for optimal tuning of a small data set is quite different than that for a medium, and that for a large data set.
In looking at your ising.f90, you should be aware that RANDOM_NUMBER can return an array of random numbers more efficiently than returning the same number of random numbers individually.
For example:
Replace: DO i= 1, lx DO j= 1, ly CALL RANDOM_NUMBER(x) s(i,j)= -1 IF ( x .LT. 0.5) s(i,j)= 1 ENDDO ENDDO With: real :: rn(lx, ly) ... call random_number(rn) where(rn .lt. 0.5) s = 1.0 elsewhere s = -1.0 endwhere
You can use RANDOM_NUMBER in a similar manner to return an array of results in the subroutine monte_carlo. Then substitute the DO loop with WHERE syntax.
We've already covered how to convert the subroutine boundary into an array of indices for use in indexed form of subscripting.
The intent of the replies on this forum is to inform you of techniques to improve your programming, and is not intended to re-write your programs. You should have sufficient information by this time to think about how to improve your program.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Typing in i3-2348M to Google search yields an ark.intel.com posting about your CPU. It has the first generation AVX instructions. It does not have the scatter/gather nor masked move capabilities of later architectures.
Possible problems with the code you posted
x= x*lx y= y*lx ! *** this should be y*ly lx might not always equal ly in later revisions
Also,
DO t= 3., 0.01, -0.1
DO loops of type real is an obsolete (deleted) feature of Fortran
I surmise you are taking a Computer Science test. What do you think you can use to replace the obsolete DO loop?
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If the arrays x and s have the same shape, namely, (lx, ly), the DO loops displayed in #14 can be replaced with
call random_number(x) s = 2*x - 1
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page