Software Archive
Read-only legacy content

offload overhead

Guangming_T_
Beginner
294 Views

If we don't use native mode, is there a way to disable creating memory buffer in the offload region? The CPU time is too much so that my accelerated program cannot achieve speedup. Note that all the IN-variables are scalar.

[Offload] [MIC 0] [Line]            144

[Offload] [MIC 0] [Tag]             Tag 1598

[Offload] [HOST]  [Tag 1598] [State]   Start Offload

[Offload] [HOST]  [Tag 1598] [State]   Initialize function __offload_entry_AcceleratorUtilitiesOp_C_144doArrayDa_cfaca3494cc6212aae7ad712694b42c4

[Offload] [HOST]  [Tag 1598] [State]   Create buffer from Host memory

[Offload] [HOST]  [Tag 1598] [State]   Create buffer from MIC memory

[Offload] [HOST]  [Tag 1598] [State]   Send pointer data

[Offload] [HOST]  [Tag 1598] [State]   CPU->MIC pointer data 1

[Offload] [HOST]  [Tag 1598] [State]   Gather copyin data

[Offload] [HOST]  [Tag 1598] [State]   CPU->MIC copyin data 68 

[Offload] [HOST]  [Tag 1598] [State]   Compute task on MIC

[Offload] [HOST]  [Tag 1598] [State]   Receive pointer data

[Offload] [HOST]  [Tag 1598] [State]   MIC->CPU pointer data 0

[Offload] [MIC 0] [Tag 1598] [State]   Start target function __offload_entry_AcceleratorUtilitiesOp_C_144doArrayDa_cfaca3494cc6212aae7ad712694b42c4

[Offload] [MIC 0] [Tag 1598] [Var]     dst_begin  IN

[Offload] [MIC 0] [Tag 1598] [Var]     src_begin  IN

[Offload] [MIC 0] [Tag 1598] [Var]     num_depth  IN

[Offload] [MIC 0] [Tag 1598] [Var]     ngroups  IN

[Offload] [MIC 0] [Tag 1598] [Var]     dst_offset  IN

[Offload] [MIC 0] [Tag 1598] [Var]     src_offset  IN

[Offload] [MIC 0] [Tag 1598] [Var]     dst_addr  IN

[Offload] [MIC 0] [Tag 1598] [Var]     src_addr  IN

[Offload] [MIC 0] [Tag 1598] [Var]     box_inc0  IN

[Offload] [MIC 0] [Tag 1598] [Var]     op  IN

[Offload] [MIC 0] [Tag 1598] [Var]     box_inc1  IN

[Offload] [MIC 0] [Tag 1598] [Var]     dst_inc0  IN

[Offload] [MIC 0] [Tag 1598] [Var]     src_inc0  IN

[Offload] [MIC 0] [Tag 1598] [Var]     box_inc2  IN

[Offload] [MIC 0] [Tag 1598] [Var]     dst_inc1  IN

[Offload] [MIC 0] [Tag 1598] [Var]     src_inc1  IN

[Offload] [MIC 0] [Tag 1598] [State]   Scatter copyin data

[Offload] [MIC 0] [Tag 1598] [State]   Gather copyout data

[Offload] [MIC 0] [Tag 1598] [State]   MIC->CPU copyout data   0

[Offload] [HOST]  [Tag 1598] [State]   Scatter copyout data

[Offload] [HOST]  [Tag 1598] [CPU Time]        0.002393(seconds)

[Offload] [MIC 0] [Tag 1598] [CPU->MIC Data]   69 (bytes)

[Offload] [MIC 0] [Tag 1598] [MIC Time]        0.000994(seconds)

[Offload] [MIC 0] [Tag 1598] [MIC->CPU Data]   0 (bytes)

0 Kudos
2 Replies
Kevin_D_Intel
Employee
294 Views

It would help seeing some code to better understand your interest. From the report, it appears you have pointer data within the scope of the offload although no associated data transfer for that.
The alloc_if/free_if/length can influence the allocation. Just playing with things, nocopy() with a length(0) or alloc_if(0) appears to avoid the allocation but makes the pointer data unusable within the scope of the offload but I don't know whether that's helpful.
You might refer to Minimize Coprocessor Memory Allocation Overhead and Explicitly managed Heap-allocated Data on the Effective Use of the Intel Compiler's Offload Features article to see whether any information there helps with your interest.

 

0 Kudos
Guangming_T_
Beginner
294 Views

Hi, Kevin,

The code snippet is posted.  Thanks!

        void *dst_addr = (void *)dst; //a persist pointer on MIC

        void *src_addr = (void *)src; //a persist pointer on MIC

        int dst_inc0,dst_inc1,dst_inc2;

        int src_inc0,src_inc1,src_inc2;

        int box_inc0,box_inc1,box_inc2;

            dst_inc0 = dst_inc[0];

          src_inc0 = src_inc[0];

          box_inc0 = box_inc[0];

          dst_inc1 = dst_inc[1];

          src_inc1 = src_inc[1];

          box_inc1 = box_inc[1];

          dst_inc2 = dst_inc[2];

          src_inc2 = src_inc[2];

          box_inc2 = box_inc[2];

#pragma offload target(mic)

{

        TYPE *dst_ptr = (TYPE *) dst_addr;

        TYPE *src_ptr = (TYPE *) src_addr;

 

        int dst_begin_tmp = dst_begin;

        int src_begin_tmp = src_begin;

  for (int d = 0; d < num_depth; ++d) {

      for (int nb = 0; nb < box_inc1*box_inc2; nb++) {

              int dst_counter = dst_begin_tmp + (nb/box_inc1)*dst_inc1*dst_inc0 + (nb%box_inc1) * dst_inc0;

              int src_counter = src_begin_tmp + (nb/box_inc1)*src_inc1*src_inc0 + (nb%box_inc1) * src_inc0;

              for (int i = 0; i < box_inc0; i++) {

                op(dst_ptr[dst_counter+i], src_ptr[src_counter+i]); //op is copy/max/min/sum

              }

            }

}

   dst_begin_tmp += dst_offset;

   src_begin_tmp += src_offset;

}

0 Kudos
Reply