Asynchronus offloading problem on the Intel Xeon Phic 7120P coprocessor

Coronado__Edoardo · ‎03-22-2018

Hello,

I coded the Conjugate Gradient algorithm using the MKL library functions on an the Intel Xeon familiy product.
The code's version of the CG runs fine on the Intel Xeon processor (without offloadin); the problem surges when I
try to run the code by offloading some operations (the sparse matrix vector products) to the Intel Xeon Phi 7120P
coprocessor.

In line 209 of the cg_mkl_csr_intel.c (that I am attaching) I initiate an asynchronus transfer of the matrix's
arrays while performing some operation until line 237 (of the same file) where the execution waits to receive the
data in order to perform the A * x product. From the cg_execution.txt file that contains the execution of the
cg_mkl_csr_intel.c executable (also attached to this post) I observe that the starting asynchronus data transfer
has no problem, but when the data is needed to perform the product of line 238 (of the cg_mkl_csr_intel.c file)
the following error is generated: "offload error: process on the device 0 was terminated by signal 11 (SIGSEGV)".
I had been unable to identify the cause for this error, hence this post.

I compile the cg_mkl_csr_intel.c file with the following command line:
icc -O3 -qopenmp cg_mkl_csr_intel.c -lm -mkl -o cg_mkl_csr_intel

I run the executable with:
./cg_mkl_csr_intel msym8.txt 8 1e-12

where the msym8.csr is a text file containing a sparse symmetric matrix in CSR format (which I am also attaching
to this post).

I appreciate any help you can provide to solve this issue.

Kindly regards.
Edoardo

Ying_H_Intel · ‎03-26-2018

Hi Edoardo,

A: MKL provide mic offload sample under mkl install fodler, could you please try them first and see if they are workable?

B: There is simple sample in MKL user guide

/* Upload A and B to the card, and do not deallocate them after the pragma.
* C is uploaded and downloaded back, but the allocated memory is retained. */
#pragma offload target(mic:0) \
in(A: length(matrix_elements) alloc_if(1) free_if(0)) \
in(B: length(matrix_elements) alloc_if(1) free_if(0)) \
in(transa, transb, N, alpha, beta) \
inout(C:length(matrix_elements) alloc_if(1) free_if(0))
{
sgemm(&transa, &transb, &N, &N, &N, &alpha, A, &N, B, &N,
&beta, C, &N);
}
/* Change C here */
/* Reuse A and B on the card, and upload the new C. Free all the memory on
* the card. */
#pragma offload target(mic:0) \
nocopy(A: length(matrix_elements) alloc_if(0) free_if(1)) \
nocopy(B: length(matrix_elements) alloc_if(0) free_if(1)) \
in(transa, transb, N, alpha, beta) \
inout(C:length(matrix_elements) alloc_if(0) free_if(1))
{
sgemm(&transa, &transb, &N, &N, &N, &alpha, A, &N, B, &N,
&beta, C, &N);
}
See Also
Intel® Software Documentation Library for Intel® Compiler User and Reference Guides
for Intel® Compile

and in your code , seem the x and out array Ax haven't transferred or allocated on coprocessor, please consider this.

#pragma offload_transfer target(mic:0) signal(mat.val) in(nrows, nnz) in(mat.row:length(nrows+1) ALLOC RETAIN) in(mat.col:length(nnz) ALLOC RETAIN) in(mat.val:length(nnz) ALLOC RETAIN)
{}

#pragma offload target(mic:0) wait(mat.val) in(transa, nrows) nocopy(mat.row:length(nrows+1) REUSE RETAIN) nocopy(mat.col:length(nnz) REUSE RETAIN) nocopy(mat.val:length(nnz) REUSE RETAIN) in(x:length(nrows)) out(Ax:length(nrows)) num_threads( numThrds )

Best Regards,
Ying

Coronado__Edoardo · ‎04-13-2018

A) The sgemm.c example that is in the MKL install folders runs fine.

B) I removed the instruction where I started the asynchronus transfer. Now, I start all transfers on the first sparse matrix-vector product (outside the loop):

#pragma offload target( mic: 0 ) \
                in( transa, nrows ) \
                in( mat.val: length(nnz)     ALLOC RETAIN ) \
                in( mat.row: length(nrows+1) ALLOC RETAIN ) \
                in( mat.col: length(nnz)     ALLOC RETAIN ) \
                in(       x: length(nrows)   ALLOC FREE   ) \
                out(     Ax: length(nrows)   ALLOC FREE   ) \
                num_threads( numThrds )
{
     mkl_cspblas_dcsrgemv( &transa, &nrows, mat.val, mat.row, mat.col, x, Ax );     //     Ax = A * x
}

on the second product (inside the loop) I have:

#pragma offload target( mic: 0 ) \
                in( transa, nrows ) \
                nocopy( mat.val: length(nnz)     REUSE RETAIN ) \
                nocopy( mat.row: length(nrows+1) REUSE RETAIN ) \
                nocopy( mat.col: length(nnz)     REUSE RETAIN ) \
                in(           p: length(nrows)   ALLOC FREE   ) \
                out(          v: length(nrows)   ALLOC FREE   ) \
                num_threads( numThrds )
{
     mkl_cspblas_dcsrgemv( &transa, &nrows, mat.val, mat.row, mat.col, p, v ); //      v = A * p
}

I free the allocated memory if the convergence condition is fulfilled and after the loop is completed with :

#pragma offload_transfer target( mic: 0) \
                         nocopy( mat.val: length(nnz)     REUSE FREE ) \
                         nocopy( mat.row: length(nrows+1) REUSE FREE ) \
                         nocopy( mat.col: length(nnz)     REUSE FREE )

As you can see I am allocating and deallocating all requested memory on the device, and I am still having the same error message:

offload error: process on the device 0 was terminated by signal 11 (SIGSEGV)

Again, I am compiling the source file with:

icc -O3 cg_mkl_csr_intel.c -lm -mkl -o cg_mkl_csr_intel

and running the executable with:

./cg_mkl_csr_intel msym8.txt 10 1e-12

I am attaching the source file (cg_mkl_csr_intel.c) and the matrix file (msym8.txt).

Regards

Edoardo