- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Hi, everyone! I am writing a sort of scientific computing program which contains a part looks as follows:

```
! derived type
type, public :: block_arrayRR_type
real(8), allocatable :: matA(:,:)
end type
type(block_arrayRR_type), allocatable :: Dx(:), Dy(:)
! temporary array
real(8) invM_R(8,maxNp) ! maxNp is the biggest vector size
real(8) invM_Sx_f(8)
real(8) invM_Sy_f(8)
...
! main loop
do i = 1, ELE_num
...
! fetch sub vector and matrix size Np
associate( Np => ELE_Ept_num(i) )
do Ept = 1, Np
invM_Sx_f = 0.0
invM_Sy_f = 0.0
! fetch sub vector index, pt varis with Ept in unit stride
! such as Ept=1,2,3..., pt=10,11,12,...,
associate( pt => ELE_EP2P(Ept,i) )
do Ept2 = 1, Np
! fetch sub vector index, pt2 varis with Ept2 in unit stride
associate( pt2 => ELE_EP2P(Ept2,i) )
invM_Sx_f(1) = invM_Sx_f(1) + Dx(i)%matA(Ept2,Ept)*Pt_f(1,Pt2)
invM_Sx_f(3) = invM_Sx_f(3) + Dx(i)%matA(Ept2,Ept)*Pt_f(3,Pt2)
invM_Sx_f(5) = invM_Sx_f(5) + Dx(i)%matA(Ept2,Ept)*Pt_f(5,Pt2)
invM_Sx_f(6) = invM_Sx_f(6) + Dx(i)%matA(Ept2,Ept)*Pt_f(6,Pt2)
invM_Sx_f(7) = invM_Sx_f(7) + Dx(i)%matA(Ept2,Ept)*Pt_f(7,Pt2)
invM_Sx_f(8) = invM_Sx_f(8) + Dx(i)%matA(Ept2,Ept)*Pt_f(8,pt2)
invM_Sy_f(2) = invM_Sy_f(2) + Dy(i)%matA(Ept2,EPt)*Pt_f(2,pt2)
invM_Sy_f(4) = invM_Sy_f(4) + Dy(i)%matA(Ept2,EPt)*Pt_f(4,pt2)
invM_Sy_f(5) = invM_Sy_f(5) + Dy(i)%matA(Ept2,EPt)*Pt_f(5,pt2)
invM_Sy_f(6) = invM_Sy_f(6) + Dy(i)%matA(Ept2,EPt)*Pt_f(6,pt2)
invM_Sy_f(7) = invM_Sy_f(7) + Dy(i)%matA(Ept2,EPt)*Pt_f(7,pt2)
invM_Sy_f(8) = invM_Sy_f(8) + Dy(i)%matA(Ept2,EPt)*Pt_f(8,pt2)
end associate
end do ! Ept2
Pt_df(1,pt) = RK_a*Pt_df(1,pt) + dt*( invM_R(1,Ept)-ea(1,1)*invM_Sx_f(1))
Pt_df(2,pt) = RK_a*Pt_df(2,pt) + dt*( invM_R(2,Ept)-ea(2,2)*invM_Sy_f(2))
Pt_df(3,pt) = RK_a*Pt_df(3,pt) + dt*( invM_R(3,Ept)-ea(1,3)*invM_Sx_f(3))
Pt_df(4,pt) = RK_a*Pt_df(4,pt) + dt*( invM_R(4,Ept)-ea(2,4)*invM_Sy_f(4))
Pt_df(5,pt) = RK_a*Pt_df(5,pt) + dt*( invM_R(5,Ept)-ea(1,5)*invM_Sx_f(5)&
-ea(2,5)*invM_Sy_f(5))
Pt_df(6,pt) = RK_a*Pt_df(6,pt) + dt*( invM_R(6,Ept)-ea(1,6)*invM_Sx_f(6)&
-ea(2,6)*invM_Sy_f(6))
Pt_df(7,pt) = RK_a*Pt_df(7,pt) + dt*( invM_R(7,Ept)-ea(1,7)*invM_Sx_f(7)&
-ea(2,7)*invM_Sy_f(7))
Pt_df(8,pt) = RK_a*Pt_df(8,pt) + dt*( invM_R(8,Ept)-ea(1,8)*invM_Sx_f(8)&
-ea(2,8)*invM_Sy_f(8))
end associate
end do! Ept
end associate
...
end do! i
```

All variables appear in the code above are intrinsic array or scalar data type (except Dx and Dy).

It is the fastest version I can get so far, but the speed is still not satisfactory yet.

Can anyone provide help? Thanks a lot!

Link Copied

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Patiently await the response from Jim or Steve or mecej4.

I would suggest that repeating the same operation is not necessarily optimal.

I use multiple threads, but your problem is small in terms of steps to thread.

Once you start run timing profiles to see if it actually helps.

Good luck.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Forgot Fortran Fan - it all depends who is watching on a Sunday afternoon when there is no football.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Hello, everyone! I am the oringinal poster. Thanks for all the responses!

I am freshman in forum and I am so careless that I post this thread early before I finish writing.

I forgot to provide some additional info which may help you to better understand the situation :

The core of the code is perform matrix multiplication( Dx * Pt_f , Dy * Pt_f ) and update Pt_df.

It looks a bit weird and lengthy because of the following reasons:

1. Not all 'alpha' in Pt_f (alpha,pt) needs to evaluate because some ea(:,alpha) equals zero exactly.

So I have already omitted them and manually unrolled the alpha loop, as can be seen in the above codes.

It makes the code lengthy but also makes it faster. I guess manually unrolling loop helps compiler to vectorize the codes.

2. For different element ' i ' , the matrix and vector size 'Np' may be different.

To reduce memory consumption, I pack the vectors into a long one( such as Pt_f(alpha, pt) )

and arrange the matrices to structured type( such as Dx(i)%maxA(Np,Np) ).

As you see ELE_EP2P(Ept,i) is used to fetch the sub vector index 'pt'.

The matrix and vetor size Np ranges about 5~200 depending on specific application.

The main loop count ELE_num is a large number.

I have already tried many modification to this code, but the above seems the fastest one I can get.

By VTune profiling, I found the percentage of vectorization Ops is low(<20%) and FLOPS speed is also not ideal.

I do believe there is definitly much better way to perform this task because it is indeed matrix multiplication!!!!

I have tried the intrinsic function 'matmul' and MKL blas subroutine 'gemv', but I can not get performance gains.

Can anyone provide some useful suggestions??

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

1) I would suggest you change the Ept2 loop

from constructing two non-unit stride temporary arrays (two 8-element arrays containing 6 values not compacted)

to constructing three four-element contiguous arrays

2) Prior to the DO i loop construct three four-element copies of the ea array in the selection order as used after the Ept2 looop

These two changes should permit better AVX2 optimization.

Jim Dempsey

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Jim Is there no way to break up the operations and use more threads?

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Hi Jim

I am not sure whether I get you ideas exactly but I try the following version:

```
! derived type
type, public :: block_arrayRR_type
real(8), allocatable :: matA(:,:)
end type
type(block_arrayRR_type), allocatable :: Dx(:), Dy(:)
! temporary array
real(8) invM_R(8,maxNp) ! maxNp is the biggest vector size
real(8) invM_Sx_f(4)
real(8) invM_Sy_f(4)
real(8) invM_S1_f(4)
...
! main loop
do i = 1, ELE_num
...
! fetch sub vector and matrix size Np
associate( Np => ELE_Ept_num(i) )
do Ept = 1, Np
invM_Sx_f = 0.0
invM_Sy_f = 0.0
invM_S1_f = 0.0
! fetch sub vector index, pt varis with Ept in unit stride
! such as Ept=1,2,3..., pt=10,11,12,...,
associate( pt => ELE_EP2P(Ept,i) )
do Ept2 = 1, Np
! fetch sub vector index, pt2 varis with Ept2 in unit stride
associate( pt2 => ELE_EP2P(Ept2,i) )
invM_S1_f(1) = invM_S1_f(1) + Dx(i)%matA(Ept2,Ept)*Pt_f(1,Pt2)
invM_S1_f(2) = invM_S1_f(2) + Dy(i)%matA(Ept2,EPt)*Pt_f(2,pt2)
invM_S1_f(3) = invM_S1_f(3) + Dx(i)%matA(Ept2,Ept)*Pt_f(3,Pt2)
invM_S1_f(4) = invM_S1_f(4) + Dy(i)%matA(Ept2,EPt)*Pt_f(4,pt2)
invM_Sx_f(1) = invM_Sx_f(1) + Dx(i)%matA(Ept2,Ept)*Pt_f(5,Pt2)
invM_Sx_f(2) = invM_Sx_f(2) + Dx(i)%matA(Ept2,Ept)*Pt_f(6,Pt2)
invM_Sx_f(3) = invM_Sx_f(3) + Dx(i)%matA(Ept2,Ept)*Pt_f(7,Pt2)
invM_Sx_f(4) = invM_Sx_f(4) + Dx(i)%matA(Ept2,Ept)*Pt_f(8,pt2)
invM_Sy_f(1) = invM_Sy_f(1) + Dy(i)%matA(Ept2,EPt)*Pt_f(5,pt2)
invM_Sy_f(2) = invM_Sy_f(2) + Dy(i)%matA(Ept2,EPt)*Pt_f(6,pt2)
invM_Sy_f(3) = invM_Sy_f(3) + Dy(i)%matA(Ept2,EPt)*Pt_f(7,pt2)
invM_Sy_f(4) = invM_Sy_f(4) + Dy(i)%matA(Ept2,EPt)*Pt_f(8,pt2)
end associate
end do ! Ept2
! ea1 = [eaT(1,1),eaT(2,2),eaT(3,1),eaT(4,2)]
Pt_df(1,pt) = RK_a*Pt_df(1,pt) + dt*( invM_R(1,Ept)-ea1(1)*invM_S1_f(1))
Pt_df(2,pt) = RK_a*Pt_df(2,pt) + dt*( invM_R(2,Ept)-ea1(2)*invM_S1_f(2))
Pt_df(3,pt) = RK_a*Pt_df(3,pt) + dt*( invM_R(3,Ept)-ea1(3)*invM_S1_f(3))
Pt_df(4,pt) = RK_a*Pt_df(4,pt) + dt*( invM_R(4,Ept)-ea1(4)*invM_S1_f(4))
Pt_df(5,pt) = RK_a*Pt_df(5,pt) + dt*( invM_R(5,Ept)-eaT(5,1)*invM_Sx_f(1)&
-eaT(5,2)*invM_Sy_f(1))
Pt_df(6,pt) = RK_a*Pt_df(6,pt) + dt*( invM_R(6,Ept)-eaT(6,1)*invM_Sx_f(2)&
-eaT(6,2)*invM_Sy_f(2))
Pt_df(7,pt) = RK_a*Pt_df(7,pt) + dt*( invM_R(7,Ept)-eaT(7,1)*invM_Sx_f(3)&
-eaT(7,2)*invM_Sy_f(3))
Pt_df(8,pt) = RK_a*Pt_df(8,pt) + dt*( invM_R(8,Ept)-eaT(8,1)*invM_Sx_f(4)&
-eaT(8,2)*invM_Sy_f(4))
end associate
end do! Ept
end associate
...
end do! i
```

On standalone tests, the above version did work!!! So do I make the correct change?

It can save 30% time compared with the original one (both compiled with /QxCORE-AVX2 and running on i7-7400)

But when I integrate it into my whole program, I found this part gives little gains... It is weird.( Maybe still my fault!)

So I wonder :

1. Is there anything else that I should pay attention to when targeting vectorization?

2. Apart from the current modification, is it possible to use ' matmul' or BLAS routines to achieve better performance and how?

Thanks

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Good job on converting hints to code.

If the post Ep2 loop code is not vectorizing completely then switch from

real(8) :: eaT(5:8,2)

to

real(8) :: eaT1(5:8), eaT2(5:8)

Additional hint:

I find it best to use VTune, and then look at the code bottlenecks in the Disassembly view.

You do not have to fully understand assembly code to assess if your temporaries are scalar or vector and/or registered/RAM.

Jim Dempsey

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

I don t really understand why switching from 2D array to two 1D array will help?

Memory access should be continuous in both cases I guess?

Unfortunately, I almost know nothing about assembly codes.

When I use VTune, I find its help manual not detailed enough for me(freshman).

I guess I'll spend a long time to learn necessary knowledge before I can pin point the problem in my code using VTune...

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Are you using OpenMP to parallelize your code? That makes a huge performance difference, and is not difficult to do.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Of course, OpenMP has been used on outmost loop and I have already reached a fairly good parallel efficiency even on NUMA arch.

What I want to discuss here is about vectorization and Cache.

I recently read an article also written by Jim:

Superscalar programming 101 (Matrix Multiply) Part 3 of 5

then and I realize there is stil a lot of potential on single core performance.

However the tricks in that article seem still difficult for me to undertand and use so far.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

If you use OpenMP, and Dx and Dy are large, then the bottleneck might well be (relatively slow) RAM access due to cache misses. You might want to experiment with the number of parallel threads to empirically find the optimum for your application.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

I agree with Ulrich: memory access may be the bottleneck. Just try REAL(4) instead of REAL(8). It may be twice as fast.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Yes, Dx is really a large block matrix and in parallel processing, each thread is trying to access different part of Dx at same time.

This can cause large bandwidth requirement.

Some story behind the codes:

On the old machine with single-channel DDR4, RAM access is once a real problem. To deal with it, I tried buiding the each Dx(i) matrix at runtime (whenever it is used) instead of pre-computing & storing the entire one. It requires fewer data access but apparently increase the computing effort! So you see it is no more than a trade-off.

Later, I upated the computer and add multi-channel RAM. the bandwidth is improved. Then I found the present version seems more profitable. The old version is abandoned.

Even with real(8), the code so far is still not very likely to make full use of AVX256 / AVX512 at all time.

plus the fact that memory access issue is not so severe now, using real(4) can only give about 15% speedup depending on applications.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

>>I don t really understand why switching from 2D array to two 1D array will help

The suggestion was:* If* the...

While in abstract terms the two methods should be equivalent, the compiler optimization code may (potentially) miss the optimization opportunity. IF that is the case, then by simplifying the source code (2D array to two 1D arrays) may yield the desired results.

VTune is an invaluable tool to identify bottlenecks in your code. You do not need to learn "All about VTune" to get started. Start with

Collect Hot Spots

View Bottom-Up (lists hottest procedures in descending order)

Double-Click on top (hottest) procedure to view source

If interested, click the Assembly window to view assembly side-by-side.

While learning all about VTune is complicated, the basics is quite easy.

Jim Dempsey

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

It seems swiching form 2D to 1D does not make difference in my case here.

I will try more VTune to see if I can get something new. And I'll update if I have problem.

Thanks for your reply! It does help a lot!

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

While the compiler optimization should have caught this, replace the associate pt and pt2/end associate with the fetch of the index into ipt and ipt2. This might save a little bit of overhead.

```
I am not sure whether I get you ideas exactly but I try the following version:
! derived type
type, public :: block_arrayRR_type
real(8), allocatable :: matA(:,:)
end type
type(block_arrayRR_type), allocatable :: Dx(:), Dy(:)
! temporary array
real(8) invM_R(8,maxNp) ! maxNp is the biggest vector size
real(8) invM_Sx_f(4)
real(8) invM_Sy_f(4)
real(8) invM_S1_f(4)
integer iNp, ipt, ipt2
...
! main loop
do i = 1, ELE_num
...
! fetch sub vector and matrix size Np
iNp = ELE_Ept_num(i)
do Ept = 1, iNp
invM_Sx_f = 0.0
invM_Sy_f = 0.0
invM_S1_f = 0.0
do Ept2 = 1, Np
! fetch sub vector index, pt2 varis with Ept2 in unit stride
ipt2 = ELE_EP2P(Ept2,i)
block
real(8) rDxDyDxDy(4)
rDxDyDxDy(1) = Dx(i)%matA(Ept2,Ept)
rDxDyDxDy(2) = Dy(i)%matA(Ept2,EPt)
rDxDyDxDy(3:4) = rDxDyDxDy(1:2)
invM_S1_f = invM_S1_f + rDxDyDxDy*Pt_f(1:4,iPt2)
invM_Sx_f = invM_Sx_f + rDxDyDxDy(1)*Pt_f(5:8,iPt2)
invM_Sy_f = invM_Sy_f + rDxDyDxDy(2)*Pt_f(5:8,ipt2)
end block
end do ! Ept2
! ea1 = [eaT(1,1),eaT(2,2),eaT(3,1),eaT(4,2)]
! fetch sub vector index, pt varis with Ept in unit stride
! such as Ept=1,2,3..., pt=10,11,12,...,
ipt = ELE_EP2P(Ept,i)
Pt_df(1:4,ipt) = RK_a*Pt_df(1:4,ipt) &
+ dt*( invM_R(1:4,Ept)-ea1(1:4)*invM_S1_f(1:4))
Pt_df(5:8,ipt) = RK_a*Pt_df(5:8,ipt) &
+ dt*( invM_R(5:8,Ept)-eaT(5:8,1)*invM_Sx_f(1:4) - eaT(5:8,2)*invM_Sy_f(1:4))
end do! Ept
...
end do! i
```

Something like the above (untested) code.

Jim Dempsey

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Hi Jim

Define temperary array within block construct...emmm.... it is really fresh to me!

I have tested your codes but it seems no improvement and even a little bit slower than the previous version.....

```
invM_S1_f(1) = invM_S1_f(1) + Dx(i)%matA(Ept2,Ept)*Pt_f(1,Pt2)
invM_S1_f(2) = invM_S1_f(2) + Dy(i)%matA(Ept2,EPt)*Pt_f(2,pt2)
invM_S1_f(3) = invM_S1_f(3) + Dx(i)%matA(Ept2,Ept)*Pt_f(3,Pt2)
invM_S1_f(4) = invM_S1_f(4) + Dy(i)%matA(Ept2,EPt)*Pt_f(4,pt2)
invM_Sx_f(1) = invM_Sx_f(1) + Dx(i)%matA(Ept2,Ept)*Pt_f(5,Pt2)
invM_Sx_f(2) = invM_Sx_f(2) + Dx(i)%matA(Ept2,Ept)*Pt_f(6,Pt2)
invM_Sx_f(3) = invM_Sx_f(3) + Dx(i)%matA(Ept2,Ept)*Pt_f(7,Pt2)
invM_Sx_f(4) = invM_Sx_f(4) + Dx(i)%matA(Ept2,Ept)*Pt_f(8,pt2)
```

Adding Block and temperaray storage of Dx Dy in the innermost loop seems harmful.....

By the way, I always compiled with O3 option.

I guess the compiler should have detected the same reference and fetch it from cache rather than RAM after its first reference.

So there is no need to do these things manually.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

RE: O3

I assume you also included /QxCORE-AVX2

RE: Block

The intention of block is to declare a local temporary four element array in hope that the compiler will registerize it. From your test it appears that it did not.

RE: fetch it from cach

The compiler has little to do with cache. That is a memory hierarchy issue. In the attempt at using block rDxDyDxDy end block the compiler could have constructed the 4-wide array in register, but apparently did not (resulting in a store, store, store, store, load). I am not faulting the compiler writers as they do an excellent job at optimization. This specific case may be new to them and/or of low priority.

Jim Dempsey

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

So far, Mannully unrolling the innermost, incontinuous 'alpha' loop gives fairly good performance.

Even though it lookes redundant.

but It's hard to apply this pattern to other similar tasks in my application where the 'alpha' list varies with differerent element 'i'.

for example, the most direct way to implement variable 'alpha' list is to use a array 'alphaList' and a innermost 'ia' loop to access the 'alpha'

```
!(1) indirect access to alpha: very slow
do Fpt = 1, Nfp
do ia= 1, alphaListSize
FLux(ia)=...
end do
do Ept = 1, Np
do ia= 1, alphaListSize
invM_R(alphaList(ia),Ept)= invM_R(alphaList(ia),Ept) + Array(Ept,Fpt)*FLux(ia)
end do
end do
end do
!(2) static unrolled: fastest, but not general
do Fpt = 1, Nfp
FLux(1)=...
FLux(2)=...
FLux(3)=...
FLux(4)=...
do Ept = 1, Np
invM_R(1,Ept)= invM_R(1,Ept) + Array(Ept,Fpt)*FLux(1)
invM_R(2,Ept)= invM_R(2,Ept) + Array(Ept,Fpt)*FLux(2)
invM_R(5,Ept)= invM_R(5,Ept) + Array(Ept,Fpt)*FLux(3)
invM_R(6,Ept)= invM_R(6,Ept) + Array(Ept,Fpt)*FLux(4)
end do
end do
!(3) evaluate all alpha even though some are unnecessary: slower than (2)
do Fpt = 1, Nfp
FLux(1:8)=...
do Ept = 1, Np
invM_R(1:8,Ept)= invM_R(1:8,Ept) + Array(Ept,Fpt)*FLux(1:8)
end do
end do
```

the above is only a sample. In the real task the alphaList may be much more complicated.

Is there any solution to it? Or do I get stuck in the wrong direction?

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

In your !(2), I would suggest you experiment with exchanging the indices of invM_R (unless the performance order is important elsewhere).

The order you use there results in strided read/modify/write.

Jim Dempsey

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page