- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi everyone,
When I tried to separate the offload process for axpy(y = x * a + y) (allocate/copy memory for x/y to coprocessor(xeon phi)-> run the kernel on the coprocessor(xeon phi)-> get the result back from coprocessor to host(cpu) -> free the memory in coprocessor(xeon phi) ).
I found that the time of allocate/copy memory for x/y is longer than the whole process(all process running together with inout pragma for x/y)
Could anyone explain why this situation happens? Is there any better way to separate the offload process?(The purpose of separate offload process is to collect the time of every subprocess, not just for axpy, but other applications.)
Following is the performance. The attached file is the axpy.c
Thanks,
Jiawen
[liu@fornax Test_offomp]$ ./a.out
Checking for Intel(R) Xeon Phi(TM) (Target CPU) devices...
Number of Target devices installed: 2
Offload sections will execute on: Target CPU (offload mode)
Copy back to host successfully!
PASS axpy
Copy time = 0.01594615 sec
Kernel time = 0.00443697 sec
Free time = 0.00104403 sec
Total time for separate process = 0.02142906 sec
Total time for inout combined = 0.01055193 sec
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
For the combined in/out/compute, you do a single invocation on the card which updates the pointers x and y and executes the code
For the 3 separate transfer_in/compute/transfer_out you have 3 invocation to the card, in the transfer_in, the pointers x and y need to be updated on the card since you used alloc_if(1), the compute invocation is to execute the code and the transfer_out is to free up some booking keeping on the card for the memory created for x and y since you used free_if(1).
Ideally when you separate out the in/outs into different pragmas, you would transfer data and compute many times in between allocation and freeing.
Time the following 2 pragmas and see the difference in time
double copy_time = omp_get_wtime();
#pragma offload_transfer target(mic:0) in (x: length(SIZE) alloc_if(1) free_if(0)) \
in (y: length(SIZE) alloc_if(1) free_if(0))
copy_time = omp_get_wtime() - copy_time;
double copy_time1 = omp_get_wtime();
#pragma offload_transfer target(mic:0) in (x: length(SIZE) alloc_if(0) free_if(0)) \
in (y: length(SIZE) alloc_if(0) free_if(0))
copy_time1 = omp_get_wtime() - copy_time1;
printf("Copy time1 = %.8f sec\n\n", copy_time1);
double kernel_time = omp_get_wtime();
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page