Offload transfer question

H__Kamil · ‎01-21-2016

Hi, i have a question about transfer data from host do coprocessor. Look at samplce code below. Are data transferred asynchronously to coprocessor? I would like to overlap transfer and computation performed on Intel Xeon Phi with computation carried out by CPU. When i use combination of offload transfer signal() and offload wait() performance of computation is a lower than in code presented below.

/*...*/

char* offload;
const int n = 1000;

for(int i=0; i<2000; i++)
{
	#pragma offload target(mic : 0) \
		in( tab1 : length(n) alloc_if(0) free_if(0) ) \
		in( tab2 : length(n) alloc_if(0) free_if(0) ) \
		in( tab3 : length(n) alloc_if(0) free_if(0) ) \
		signal(offload)
	{
		// Computation on Intel Xeon Phi
	}

	// Computation on Host

	#pragma offload_wait target(mic : 0) wait(signal)
}	
/*...*/

jimdempseyatthecove · ‎01-21-2016

What happens when you increase n and/or increase the computation on the Xeon Phi?

Meaning, there is a cost of setting up the signal, managing the signal and for the wait on the signal. When the runtime on the Xeon Phi (per offload) is less than about 2x that of the additional overhead, it may not be worth it to use the asynchronous offload. You can create a test with the above sketch code varying n and/or varying the computation load per n on the Xeon Phi to get some metrics as to when it is advantageous to use asynchronous offloading for your application.

Jim Dempsey

H__Kamil · ‎01-21-2016

I optimized communication between host and coprocessor to a minimum. I copy to the coprocessor only nesesary data to computation. Every timestepe i transfer three arrays that containing near to 1000 elements (as shown in a source code in previous post).

/*...*/
 
char* transfer;
char* offload;
const int n = 1000;
 
for(int i=0; i<2000; i++)
{
    #pragma offload_transfer target(mic : 0) \
        in( tab1 : length(n) alloc_if(0) free_if(0) ) \
        in( tab2 : length(n) alloc_if(0) free_if(0) ) \
        in( tab3 : length(n) alloc_if(0) free_if(0) ) \
        signal(transfer)
    
	// First part of computation on Host
	
	#pragma offload_transfer target(mic : 0) wait(transfer) \
		nocopy(tab1) nocopy(tab2) nocopy(tab3) signal(offload)
	{
        // Computation on Intel Xeon Phi
    }
 
    // Second part of computation on Host
	
    #pragma offload_wait target(mic : 0) wait(offload)
}  

/*...*/

This code shown my idea of asynchronous transfer. First part of computation on a CPU takes more time than asynchronous transfer. I think that the aggregated time of calling first pragma ( offload_transfer) and second pragma (offload wait() ... signal() ) is more time-consuming than data transfer presented in previous post.

Thanks for your reply, Mr. Dempsey. :)