Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Joao_Paulo_Z_
Beginner
49 Views

Asynchronous Data Tarnsfer

Hello all,

I am implementing a linear solver where some kernels, like SpMV, will be done on MIC. I am using the three array version of the BSR format, I am trying to do an asynchronous data transfer test as follows:

//Initialize the Coprocessor with an empty transfer
#pragma offload_transfer target(mic:0)

//Passing the rowIndex array asynchronous
#pragma offload_transfer target(mic:0) in(rowIndex : length(size + 1)) signal(rowIndex)

double *x = (double*) mkl_malloc(size*blockSize*sizeof(double),64);
double *y = (double*) mkl_malloc(size*blockSize*sizeof(double),64);

for (int i = 0; i < size*blockSize; i++) x = i;

//Passing the others arguments and wait for the rowIndex vector
#pragma offload target(mic:0) wait(rowIndex) in(columns:length(nnz)) in(values:length(nnz*blockSize*blockSize)) in(x:length(size*blockSize)) inout(x:length(size*blockSize))
{
       spmv(size, blockSize, rowIndex, columns, values, x, y);
} 

But when I run the code I receive the following error message:

error: pointer variable "rowIndex" in this offload region must be specified in an in/out/inout/nocopy clause

I can't visualize what I am doing wrong, can someone help me?

 

Thanks!

0 Kudos
4 Replies
Kevin_D_Intel
Employee
49 Views

Our apologies for the delayed reply.

I cannot reproduce the error with the snippet provided and I’m not sure how you are producing a run-time fail with the snippet shown. The snippet will not compile due to variable ‘x’ appearing in two data movement clauses, in and inout, in line 13.  Even correcting this does not yield the run-time error noted.

$  icpc -V
Intel(R) C++ Intel(R) 64 Compiler for applications running on Intel(R) 64, Version 16.0.3.210 Build 20160415
Copyright (C) 1985-2016 Intel Corporation.  All rights reserved.

$  icpc u656695.cpp
u656695.cpp(24): error: variable "x" is listed in incompatible input/output clauses
  #pragma offload target(mic:0) wait(rowIndex) in(columns:length(nnz)) in(values:length(nnz*blockSize*blockSize)) in(x:length(size*blockSize)) inout(x:length(size*blockSize))
  ^

compilation aborted for u656695.cpp (code 2)

If you are suffering a run-time error then try setting the environment variable OFFLOAD_REPORT=3, re-run, and then capture the output and provide that in a reply.

Even better would be if you can provide a complete reproducer along with the version of the compiler you are using (icpc -V).

Kevin_D_Intel
Employee
49 Views

Now I see the issue but you're not receiving that at run-time but rather at compile-time. 

Also I believe the earlier error that I noted was a typo where the instance of 'x' in the inout clause was likely meant to be 'y'. With that change and some consultation with Development what you received was a compile-time error like this:

$  icpc u656695.cpp
u656695.cpp(26): error: pointer variable "rowIndex" in this offload region must be specified in an in/out/inout/nocopy clause
  #pragma offload target(mic:0) wait(rowIndex) in(columns:length(nnz)) in(values:length(nnz*blockSize*blockSize)) in(x:length(size*blockSize)) inout(y:length(size*blockSize))
  ^

compilation aborted for u656695.cpp (code 2)

That occurs because rowIndex is completely absent from your #pragma offload in line 13. When adding it in line 13, use length(0) and add appropriate free_if/alloc_if modifiers on lines 5 and 13 to manage the allocation to ensure its reusable.

Here's your snippet with all the necessary changes:

//Initialize the Coprocessor with an empty transfer
#pragma offload_transfer target(mic:0)

//Passing the rowIndex array asynchronous
#pragma offload_transfer target(mic:0) in(rowIndex : length(size + 1) free_if(0)) signal(rowIndex)

!double *x = (double*) mkl_malloc(size*blockSize*sizeof(double),64);
!double *y = (double*) mkl_malloc(size*blockSize*sizeof(double),64);

for (int i = 0; i < size*blockSize; i++) x = i;

//Passing the others arguments and wait for the rowIndex vector
#pragma offload target(mic:0) wait(rowIndex) in(columns:length(nnz)) in(values:length(nnz*blockSize*blockSize)) in(x:length(size*blockSize)) inout(y:length(size*blockSize)) in(rowIndex:length(0) alloc_if(0))
{
       spmv(size, blockSize, rowIndex, columns, values, x, y);
}

 

Joao_Paulo_Z_
Beginner
49 Views

Hi, Kevin!

Thank you very much, now I could do the offload. One more question, how to measure the data transfer time? I used the OFFLOAD_REPORT=2 but it is not quite what I am looking for, now I am using time counters, one before and other after the transfer to measure it but I don't think this the right way. 

Thanks =)

Kevin_D_Intel
Employee
49 Views

Glad to hear that helped.

I haven’t spent much time on benchmarking/timing aspects. I believe such timing as you described would also include time executing and not just data transfer.

I believe you should be able to derive [data transfer time] using the OFFLOAD_REPORT=2 and calculating: [data transfer time] = [CPU-Time] - [MIC-Time]

The overall data bandwidth for a particular offload can be found using OFFLOAD_REPORT=2 and calculating: [data bandwidth] = [total bytes of data transferred]  /  [data transfer time]
    -OR- 
[data bandwidth] =  ([CPU->MIC Data] +  [MIC->CPU Data]) / ([CPU-Time] - [MIC-Time])  

Since your code includes one perhaps you are already aware that when starting to look at timing offloads certain BKMs are used to “warm up” the coprocessor with an empty offload (as you have) or using export OFFLOAD_INIT=on_start. This relates to the initial buffer allocations on the coprocessor taking a bit of time and not wanting those to skew the timings.

There are other posts in this forum related to this same interest so (if you haven’t already) you might search for those. Hopefully others who do more in-depth timing can correct me too if/where I’m wrong here too.

Reply