I have this openmp fortran application to be used on a computer with 2 Xeon E5-2620 processors, 2 Xeon Phi boards, 64 GB RAM, CentOS 7.2 and mpss-3.7. The application core goes something like
allocate (matrix_1(0:511,0:511)); allocate (matrix_2(0:511,0:511))
allocate (vet_1(0:511)); allocate (vet_2(0:511))
vet_1(0:511) = matrix_1(0:511,ii); vet_2(0:511) = matrix_2(0:511,ii)
matrix_1(0:511,ii) = vet_1(0:511); matrix_2(1:511,ii) =vet_2(0:511)
!$omp end do
Without any further optimization, the peak performance I get from the Xeon E5 processors is a reasonable 7.4 speed-up with 11 threads (~19 sec execution time). Using an offload version of this code, the peak from the Xeon Phi coprocessors occurs at 12 threads with a speed-up of 4.8 (~336 sec execution time) when compared to the execution time from 1 Xeon Phi thread. Despite the similar speed-ups, the execution time from Xeon processors is roughly 18 times smaller than the execution time from the Xeon Phi coprocessors. Considering there are 2*(57-1)*4 = 448 useful threads in my system, how can this piece of code be improved in offload model?
Any advice will be much welcomed,
Insufficient detail was provided.
You might want to consider having the computational intensive data (intended for computation in the Xeon Phis) reside within the respective Xeon Phi(s). Thus elimination of transport of the input data into the offload region, and if possible eliminate transport of the output data back to the host. IOW only when results required on host do you return the (required) data back to the host.
When your offload code is mostly scalar, you will experience significant slowdown. The Xeon Phi KNC requires significant portion of the code to use 512-bit vectors in order to be effective.