Unexpected Performance for Separate Process of offload

Jiawen_L_ · ‎01-11-2016

Hi everyone,

When I tried to separate the offload process for axpy(y = x * a + y) (allocate/copy memory for x/y to coprocessor(xeon phi)-> run the kernel on the coprocessor(xeon phi)-> get the result back from coprocessor to host(cpu) -> free the memory in coprocessor(xeon phi) ).

I found that the time of allocate/copy memory for x/y is longer than the whole process(all process running together with inout pragma for x/y)

Could anyone explain why this situation happens? Is there any better way to separate the offload process?(The purpose of separate offload process is to collect the time of every subprocess, not just for axpy, but other applications.)

Following is the performance. The attached file is the axpy.c

Thanks,

Jiawen

[liu@fornax Test_offomp]$ ./a.out

Checking for Intel(R) Xeon Phi(TM) (Target CPU) devices...

Number of Target devices installed: 2

Offload sections will execute on: Target CPU (offload mode)

Copy back to host successfully!

PASS axpy

Copy time = 0.01594615 sec

Kernel time = 0.00443697 sec

Free time = 0.00104403 sec

Total time for separate process = 0.02142906 sec

Total time for inout combined = 0.01055193 sec

Ravi_N_Intel · ‎01-11-2016

For the combined in/out/compute, you do a single invocation on the card which updates the pointers x and y and executes the code
For the 3 separate transfer_in/compute/transfer_out you have 3 invocation to the card, in the transfer_in, the pointers x and y need to be updated on the card since you used alloc_if(1), the compute invocation is to execute the code and the transfer_out is to free up some booking keeping on the card for the memory created for x and y since you used free_if(1).

Ideally when you separate out the in/outs into different pragmas, you would transfer data and compute many times in between allocation and freeing.

Time the following 2 pragmas and see the difference in time

double copy_time = omp_get_wtime();
#pragma offload_transfer target(mic:0) in (x: length(SIZE) alloc_if(1) free_if(0)) \
in (y: length(SIZE) alloc_if(1) free_if(0))

copy_time = omp_get_wtime() - copy_time;

double copy_time1 = omp_get_wtime();
#pragma offload_transfer target(mic:0) in (x: length(SIZE) alloc_if(0) free_if(0)) \
in (y: length(SIZE) alloc_if(0) free_if(0))

copy_time1 = omp_get_wtime() - copy_time1;
printf("Copy time1 = %.8f sec\n\n", copy_time1);

double kernel_time = omp_get_wtime();