Performance: offload vs native mode and Fortran module

Minh_H_ · ‎12-19-2014

Hi,

System: Centos 7.0, compiler: parallel_studio_xe_2015, MPSS 3.4.2.

I have Fortran code, when I compile in native mode. The running time is about 2.4 (seconds). When I compile in offload the running time is about 5.2 (seconds). The transfer data size back to CPU in upload mode is about 144MB, which I estimated is about 0.024 (seconds) for 6GB/s PCI Express. The data for offload from CPU to Xeon Phi is about 3.6MB. My question is why the offload code is much slower than the native mode code? To compile the offload code I run

source /opt/intel/composer_xe_2015/bin/compilervars.sh intel64

then using ifort command.

I have other questions regarding offload in Fortran:

1.1) Using data in module:

MODULE shared_data

REAL global_x(1000)

END MODULE shared_data

subroutine fun1(y)

use shared_data

!access global_x here

! however global_x = 0 here?

end subroutine

subroutine fun2(y)

calll fun1(y)

end subroutine

program MAIN

use shared_data

real y(10)

global_x = 1.0

call fun2(y)

end

I try to use this module in subroutines which are offloaded to Xeon Phi. I have tried different Intel-specific directives or OpenMP 4.0 directives. However I could not transfer the value from CPU to Xeon Phi. Could anyone give a simple example how this can be done? Note that the code works fine in native mode.

1.2) Transfer subarray of 2-D/N-D array to Xeon Phi in offload:

I have array

real x(1000,5)

I would like to offload to each Xeon Phi card x(:,i) , i=1:4 to 4 Xeon Phi cards to distribute work but when I put that in offload data directives, compiler reports errors of non-continous array. How can this be done?

Many thanks in advance for your help.

M