Software Archive
Read-only legacy content
17061 Discussions

Deciding Between Automatic Offload and the Automatic/Compiler Assisted Offload Combination

Brandon_P_
Beginner
308 Views

Hi all, I'm using mpiifort to compile a set of Fortran scripts that make a few Lapack and Blas calls. I've been able to use automatic offloading for the ZGETRF Lapack routine, which is LU factorization by increasing my problem size so that the matrices the ZGETRF call is processing are, in fact, greater than 8192 x 8192. However, there are some other Lapack and Blas routines not supported for automatic offloading in the scripts as well. I'm wondering if also denoting some of those routines for offloading will be worth it, because explicit offloading for me in the past has only increased computational time.

It's worth noting I'm using mpiifort to keep the MPICH2 calls inside the code intact. If that is the source of the slowdown, let me know.

Thanks.

0 Kudos
1 Reply
Zhang_Z_Intel
Employee
308 Views

When deciding whether to explicitly offload the other LAPACK and BLAS routines, you should take into consideration the overhead of data transfer (back and forth between host and MIC) as the biggest factor. It's hard to give a general answer. You would have to do some benchmarking to determine the trade-off of data transfer and computation speed. But there are a few guidelines:

  1. Small problem sizes (using your LU matrix size as a hint) are typically not good candidates for offloading.
  2. Minimizing data transfer helps performance. For example, consider reconstructing your code such that you accumulate all input data to those other BLAS/LAPACK routines into a big parcel and ship it to MIC at once; then do as much computation as possible on MIC before shipping the results back to the host.
  3. Similarly, offloading individual routines one by one is expensive. Consider offloading a "region of code" which contains many BLAS/LAPACK computations.
  4. Pay attention to input/output data alignment on MIC. Make sure they are aligned with the 64-byte boundaries.

Hope this helps.

0 Kudos
Reply