Software Archive
Read-only legacy content
Announcements
FPGA community forums and blogs on community.intel.com are migrating to the new Altera Community and are read-only. For urgent support needs during this transition, please visit the FPGA Design Resources page or contact an Altera Authorized Distributor.
17060 Discussions

Offload transfer question

H__Kamil
Beginner
595 Views

Hi, i have a question about transfer data from host do coprocessor. Look at samplce code below. Are data transferred asynchronously to coprocessor? I would like to overlap transfer and computation performed on Intel Xeon Phi with computation carried out by CPU. When i use combination of offload transfer signal() and offload wait() performance of computation is a lower than in code presented below.

/*...*/

char* offload;
const int n = 1000;

for(int i=0; i<2000; i++)
{
	#pragma offload target(mic : 0) \
		in( tab1 : length(n) alloc_if(0) free_if(0) ) \
		in( tab2 : length(n) alloc_if(0) free_if(0) ) \
		in( tab3 : length(n) alloc_if(0) free_if(0) ) \
		signal(offload)
	{
		// Computation on Intel Xeon Phi
	}

	// Computation on Host

	#pragma offload_wait target(mic : 0) wait(signal)
}	
/*...*/

 

0 Kudos
2 Replies
jimdempseyatthecove
Honored Contributor III
595 Views

What happens when you increase n and/or increase the computation on the Xeon Phi?

Meaning, there is a cost of setting up the signal, managing the signal and for the wait on the signal. When the runtime on the Xeon Phi (per offload) is less than about 2x that of the additional overhead, it may not be worth it to use the asynchronous offload. You can create a test with the above sketch code varying n and/or varying the computation load per n on the Xeon Phi to get some metrics as to when it is advantageous to use asynchronous offloading for your application.

Jim Dempsey

0 Kudos
H__Kamil
Beginner
595 Views

I optimized communication between host and coprocessor to a minimum. I copy to the coprocessor only nesesary data to computation. Every timestepe i transfer three arrays that containing  near to 1000 elements (as shown in a source code in previous post). 

/*...*/
 
char* transfer;
char* offload;
const int n = 1000;
 
for(int i=0; i<2000; i++)
{
    #pragma offload_transfer target(mic : 0) \
        in( tab1 : length(n) alloc_if(0) free_if(0) ) \
        in( tab2 : length(n) alloc_if(0) free_if(0) ) \
        in( tab3 : length(n) alloc_if(0) free_if(0) ) \
        signal(transfer)
    
	// First part of computation on Host
	
	#pragma offload_transfer target(mic : 0) wait(transfer) \
		nocopy(tab1) nocopy(tab2) nocopy(tab3) signal(offload)
	{
        // Computation on Intel Xeon Phi
    }
 
    // Second part of computation on Host
	
    #pragma offload_wait target(mic : 0) wait(offload)
}  

/*...*/

This code shown my idea of asynchronous transfer. First part of computation on a CPU takes more time than asynchronous transfer. I think that the aggregated time of calling first pragma ( offload_transfer) and second pragma (offload wait() ... signal() ) is more time-consuming than data transfer presented in previous post.

Thanks for your reply, Mr. Dempsey. :)

 

0 Kudos
Reply