Solved: Poor speed up in Phi

Masrul · ‎12-22-2015

I am trying to speed up a matrix multiplication code in in intel coprocessor. I am getting exact speed up in host but In coprocessor performance is very poor compare to host. I compiled code with -O3 flag to enable auto vectorization. And i am using following environment variables

export MIC_ENV_PREFIX=XEONPHI
export XEONPHI_KMP_PLACE_THREADS=60C,4t/30C,2t......
export XEONPHI_KMP_AFFINITY=compact,granularity=fine

Code:

program matmulti
    implicit none
    double precision,allocatable::A(:,:),B(:,:),C(:,:)
    integer::i,j,k,s,m,n,p
    m=5000
    n=m
    p=m
    allocate(A(m,n),B(n,p),C(m,p))
    A=1
    B=1
    C=0
    !$omp target device(0) map(to: A,B,m,n,p) map(from: C)
    !$omp parallel shared(A,B,C,m,n,p) private(i,j,k)
        !$omp do schedule(static)
            do i=1,m
                do j=1,p
                    C(i,j)=0.d0                
                    do k=1,n
                        C(i,j)=C(i,j)+A(i,k)*B(k,j)
                    end do
                end do
            end do
        !$omp end do  nowait       
    !$omp end parallel
    !$omp end target
end program matmulti

Andrey_Vladimirov · ‎12-22-2015

If you need to run matrix-matrix multiplication fast, don't write your own code, use the DGEMM function in MKL.

If you are trying to learn, then in your specific code you need to:

1) Change the order of loop nesting: jki order will work better than ijk in your code

2) Implement loop tiling to improve data re-use in caches

3) Take care of data alignment and alignment hints

4) Do not benchmark the first offload to Xeon Phi. Perform several offloads and measure sustained performance of the 2nd offload and later.

5) Retain buffers for matrices on Xeon Phi between offloads to avoid memory allocation overhead

This video lecture is a good place to start: https://www.youtube.com/watch?v=kKhMRWVkXT8

More information is available here: http://colfaxresearch.com/cdt-v02/

View solution in original post

Sunny_G_Intel · ‎12-22-2015

Hi Masrul,

I would like to encourage you to read the following articles on Optimization and Performance tuning for Intel Xeon Phi Coprocessors:

https://software.intel.com/en-us/articles/optimization-and-performance-tuning-for-intel-xeon-phi-coprocessors-part-1-optimization

https://software.intel.com/en-us/articles/optimization-and-performance-tuning-for-intel-xeon-phi-coprocessors-part-2-understanding

In order to get better performance on Intel Xeon Phi Coprocessors you will have to look out for Vectorization and Scaling for upto >240 threads.

In addition to this you can consider using proper caching and loop ordering techniques to get optimized performance.

Andrey_Vladimirov · ‎12-22-2015

If you need to run matrix-matrix multiplication fast, don't write your own code, use the DGEMM function in MKL.

If you are trying to learn, then in your specific code you need to:

1) Change the order of loop nesting: jki order will work better than ijk in your code

2) Implement loop tiling to improve data re-use in caches

3) Take care of data alignment and alignment hints

4) Do not benchmark the first offload to Xeon Phi. Perform several offloads and measure sustained performance of the 2nd offload and later.

5) Retain buffers for matrices on Xeon Phi between offloads to avoid memory allocation overhead

This video lecture is a good place to start: https://www.youtube.com/watch?v=kKhMRWVkXT8

More information is available here: http://colfaxresearch.com/cdt-v02/

TimP · ‎12-22-2015

MATMUL at -O3 (with opt-matmul for large cases) makes more sense than specifying a bad loop nesting order. jik or jki order is evidently better than what you show; OpenMP tends to require literal adherence to what you write. Qopt-report would tell you whether any loop interchanges were performed, and possibly which alternatives should be considered.

You would need to check whether the compiler has recognized your implicit dot_product, if it follows jik order. In ijk order, you would be asking for each element of each destination cache line to belong to a different thread, rather than simply spreading the destination cache line across the lanes of a single logical processor.

As Andrey points out, a case such as yours which is large enough to benefit from MIC will need tiling for cache locality, once you permit threads to have some locality of data references.

dgemm offers the possibility of taking advantage of Compiler Assisted Offload etc; at least opt-matmul permits engaging optimized MKL library.

Masrul · ‎12-22-2015

Thanks both of you, I will work on the issues. Actually, i am trying to learn optimization in mic architecture.

James_C_Intel2 · ‎12-23-2015

Actually, i am trying to learn optimization in mic architecture.

In which case it would certainly be worth your time to read through Tutorial – Real World Examples For Vectorization (Manel Fernandez, Chief Consultant, Bayncore Ltd.) which spends a while optimizing matrix multiplication (and makes me realize why reaching for MKL is a good thing to do when you can!)

Other presentations from the London HPC Developers' conference may also be interesting. (There are a few 404s there which I have reported; I hope they'll get fixed!)

Masrul · ‎12-23-2015

Thanks all, I am very glad to let you know only changing the loop order boosted the performance of phi by factor of 50. I will work on other issues.I would like to ask you guys some off related questions to this topic. We are planning to develop a general purpose computational chemistry code from very scratch that will be well optimized for MIC architecture specially . So, What would be the better option for offloading : OMP 4.0 directives(!$omp target) or intel LEO(!dir$ offload). Do both options offer equal flexibility or LEO offers more advantages over OMP 4.0 ? We also planning to use MPI for communicating between nodes in cluster( master thread communication only on host side) and work expensive calculation will be offloaded to phi. Will this approach work in second generation phi;Knights landing (As one of the Knights landing genre will be standalone processor rather than co-processor)?

TimP · ‎12-23-2015

I think there's a lot of guesswork here as I don't see much public information on KNL beyond generalities.

Apparently, the stand-alone KNL processors will come before PCI bus ones, and it's not clear whether "offload" support for the stand-alone processors would come immediately (unless I've missed some announcement). But I think you're asking about the older LEO vs. the OpenMP 4 offload directives, which still may not be as widely used. OpenMP 4.5 appears to offer some new facilities for the future, while until now there may be more capability in LEO, which seems widely enough used that it must remain supported.

As new platforms are better optimized for MPI, the advantage for some applications of combining MPI between nodes with offload to coprocessor may not persist. There must be a strong push to enable existing MPI/OpenMP hybrid applications to take better advantage of Xeon Phi(tm).

Masrul · ‎12-23-2015

Tim thanks. Then, how application will run in cluster of nodes ? Or phi is only for single node ? Or Native paradigm should be used?

jimdempseyatthecove · ‎12-23-2015

A (potential) disadvantage of using !$omp target is that the serial portions of the application will be required to run on the Host. Thus this will generally exhibit a higher degree of entering and exiting offload regions. When this is a design goal of your application then this is acceptable. When the serial portions of the application can be advantageously run in the offload region, then you might consider using the !dir$ offload instead. Mixing MPI and OpenMP is doable but may be difficult. One of the threads on this forum addresses this issue in greater detail. I do not have the link handy, but you should be able to search for it.

Jim Dempsey