- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I try to perform an asynchronous data transfer to an Intel Xeon Phi. Note that asynchronous computation works as expected. If I try to combine data transfer and computation (in an offload statement) timing indicates that the data transfer is done synchronously while the following computation is done asynchronously.
A test example that illustrates the point is given below. The output is
0.928997 0.288048
which indicates that almost a second is spend in the asynchronous call while only 0.28 seconds are spend in waiting for that asynchronous call.
Any help would be appreciated.
#include <stdlib.h> #include <iostream> using namespace std; #include "timer.hpp" #define ALLOC alloc_if(1) #define FREE free_if(1) #define RETAIN free_if(0) #define REUSE alloc_if(0) int main() { int n = 1000*1000*100; double *p = (double*)malloc(sizeof(double)*n); int rep = 10; #pragma offload target(mic:0) in(p:length(n) ALLOC RETAIN) {} timer t1, t2; for(int i=0;i<rep;i++) { t1.start(); #pragma offload_transfer target(mic:0) out(p:length(n) REUSE RETAIN) signal(p) /* This works as expected #pragma offload_transfer target(mic:0) signal(p) { usleep(2e6); } */ t1.stop(); t2.start(); #pragma offload_transfer target(mic:0) wait(p) t2.stop(); } cout << t1.total() << " " << t2.total() << endl; #pragma offload target(mic:0) nocopy(p:length(n) REUSE FREE) {} }
The hardware seems to work properly
MicCheck 3.4.3-r1 Copyright 2013 Intel Corporation All Rights Reserved Executing default tests for host Test 0: Check number of devices the OS sees in the system ... pass Test 1: Check mic driver is loaded ... pass Test 2: Check number of devices driver sees in the system ... pass Test 3: Check mpssd daemon is running ... pass Executing default tests for device: 0 Test 4 (mic0): Check device is in online state and its postcode is FF ... pass Test 5 (mic0): Check ras daemon is available in device ... pass Test 6 (mic0): Check running flash version is correct ... pass Test 7 (mic0): Check running SMC firmware version is correct ... pass Status: OK
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm not sure I understand exactly what you are saying when you say: "If I try to combine data transfer and computation (in an offload statement) timing indicates that the data transfer is done synchronously while the following computation is done asynchronously." The example you give doesn't show any computations being done.
In the section between t1.start() and t1.stop(), the first offload_transfer directive tells the compiler that you want to asynchronously transfer the array p from the coprocessor to the host. The second, commented out offload_transfer, tells the compiler that you want to transfer nothing. The offload_transfer directive does not cause any computation to be done on the coprocessor.
Is the point you are making that the entire transfer of the array p seems to be occurring between t1:start() and t1.stop()?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes the point is that the entire transfer of the array p occurs between t1.start() and t1.stop().
The second observation is that if I use (sorry for the confusion I meant to use offload not offload_transfer in the comments of my code)
t1.start(); #pragma offload target(mic:0) signal(p) { usleep(2e6); } t1.stop();
The output of the program is
0.000119681 20.104
which means that indeed the computation is done asynchronously (this is the behavior I would expect).
If I use
t1.start(); #pragma offload target(mic:0) out(p:length(n) REUSE RETAIN) signal(p) { usleep(2e6); } t1.stop();
The output of the program is
0.000212038 21.2975
which again means that the computation is done asynchronously (this is the behavior I would expect).
However, If I use
t1.start(); #pragma offload target(mic:0) in(p:length(n) REUSE RETAIN) signal(p) { usleep(2e6); } t1.stop();
The output of the program is
0.897727 20.3981
which I believe means that first the transfer of p is done synchronously (between t1.start() and t1.stop()) while the following computation is done asynchronously (instead of doing the entire offload block asynchronously which is the behavior I would expect).

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page