Asynchronous data transfer does not work

Lukas_E_ · ‎06-12-2015

I try to perform an asynchronous data transfer to an Intel Xeon Phi. Note that asynchronous computation works as expected. If I try to combine data transfer and computation (in an offload statement) timing indicates that the data transfer is done synchronously while the following computation is done asynchronously.

A test example that illustrates the point is given below. The output is
0.928997 0.288048
which indicates that almost a second is spend in the asynchronous call while only 0.28 seconds are spend in waiting for that asynchronous call.

Any help would be appreciated.

#include <stdlib.h>
#include <iostream>
using namespace std;

#include "timer.hpp"

#define ALLOC   alloc_if(1)
#define FREE    free_if(1)
#define RETAIN  free_if(0)
#define REUSE   alloc_if(0)

int main() {
    int n = 1000*1000*100;
    double *p = (double*)malloc(sizeof(double)*n);
    int rep = 10;

    #pragma offload target(mic:0) in(p:length(n) ALLOC RETAIN)
    {}

    timer t1, t2;
    for(int i=0;i<rep;i++) {
        t1.start();
        #pragma offload_transfer target(mic:0) out(p:length(n) REUSE RETAIN) signal(p)
        /* This works as expected 
        #pragma offload_transfer target(mic:0) signal(p)
        { usleep(2e6); }
        */
        t1.stop();

        t2.start();
        #pragma offload_transfer target(mic:0) wait(p)
        t2.stop();

    }

    cout << t1.total() << " " << t2.total() << endl;

    #pragma offload target(mic:0) nocopy(p:length(n) REUSE FREE)
    {}

}

The hardware seems to work properly

MicCheck 3.4.3-r1
Copyright 2013 Intel Corporation All Rights Reserved

Executing default tests for host
  Test 0: Check number of devices the OS sees in the system ... pass
  Test 1: Check mic driver is loaded ... pass
  Test 2: Check number of devices driver sees in the system ... pass
  Test 3: Check mpssd daemon is running ... pass
Executing default tests for device: 0
  Test 4 (mic0): Check device is in online state and its postcode is FF ... pass
  Test 5 (mic0): Check ras daemon is available in device ... pass
  Test 6 (mic0): Check running flash version is correct ... pass
  Test 7 (mic0): Check running SMC firmware version is correct ... pass

Status: OK

Frances_R_Intel · ‎06-13-2015

I'm not sure I understand exactly what you are saying when you say: "If I try to combine data transfer and computation (in an offload statement) timing indicates that the data transfer is done synchronously while the following computation is done asynchronously." The example you give doesn't show any computations being done.

In the section between t1.start() and t1.stop(), the first offload_transfer directive tells the compiler that you want to asynchronously transfer the array p from the coprocessor to the host. The second, commented out offload_transfer, tells the compiler that you want to transfer nothing. The offload_transfer directive does not cause any computation to be done on the coprocessor.

Is the point you are making that the entire transfer of the array p seems to be occurring between t1:start() and t1.stop()?

Lukas_E_ · ‎06-13-2015

Yes the point is that the entire transfer of the array p occurs between t1.start() and t1.stop().

The second observation is that if I use (sorry for the confusion I meant to use offload not offload_transfer in the comments of my code)

t1.start();
#pragma offload target(mic:0) signal(p)
{ usleep(2e6); }
t1.stop();

The output of the program is
0.000119681 20.104
which means that indeed the computation is done asynchronously (this is the behavior I would expect).

If I use

t1.start();
#pragma offload target(mic:0) out(p:length(n) REUSE RETAIN) signal(p)
{ usleep(2e6); }
t1.stop();

The output of the program is
0.000212038 21.2975
which again means that the computation is done asynchronously (this is the behavior I would expect).

However, If I use

t1.start();
#pragma offload target(mic:0) in(p:length(n) REUSE RETAIN) signal(p)
{ usleep(2e6); }
t1.stop();

The output of the program is
0.897727 20.3981
which I believe means that first the transfer of p is done synchronously (between t1.start() and t1.stop()) while the following computation is done asynchronously (instead of doing the entire offload block asynchronously which is the behavior I would expect).