- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If we don't use native mode, is there a way to disable creating memory buffer in the offload region? The CPU time is too much so that my accelerated program cannot achieve speedup. Note that all the IN-variables are scalar.
[Offload] [MIC 0] [Line] 144
[Offload] [MIC 0] [Tag] Tag 1598
[Offload] [HOST] [Tag 1598] [State] Start Offload
[Offload] [HOST] [Tag 1598] [State] Initialize function __offload_entry_AcceleratorUtilitiesOp_C_144doArrayDa_cfaca3494cc6212aae7ad712694b42c4
[Offload] [HOST] [Tag 1598] [State] Create buffer from Host memory
[Offload] [HOST] [Tag 1598] [State] Create buffer from MIC memory
[Offload] [HOST] [Tag 1598] [State] Send pointer data
[Offload] [HOST] [Tag 1598] [State] CPU->MIC pointer data 1
[Offload] [HOST] [Tag 1598] [State] Gather copyin data
[Offload] [HOST] [Tag 1598] [State] CPU->MIC copyin data 68
[Offload] [HOST] [Tag 1598] [State] Compute task on MIC
[Offload] [HOST] [Tag 1598] [State] Receive pointer data
[Offload] [HOST] [Tag 1598] [State] MIC->CPU pointer data 0
[Offload] [MIC 0] [Tag 1598] [State] Start target function __offload_entry_AcceleratorUtilitiesOp_C_144doArrayDa_cfaca3494cc6212aae7ad712694b42c4
[Offload] [MIC 0] [Tag 1598] [Var] dst_begin IN
[Offload] [MIC 0] [Tag 1598] [Var] src_begin IN
[Offload] [MIC 0] [Tag 1598] [Var] num_depth IN
[Offload] [MIC 0] [Tag 1598] [Var] ngroups IN
[Offload] [MIC 0] [Tag 1598] [Var] dst_offset IN
[Offload] [MIC 0] [Tag 1598] [Var] src_offset IN
[Offload] [MIC 0] [Tag 1598] [Var] dst_addr IN
[Offload] [MIC 0] [Tag 1598] [Var] src_addr IN
[Offload] [MIC 0] [Tag 1598] [Var] box_inc0 IN
[Offload] [MIC 0] [Tag 1598] [Var] op IN
[Offload] [MIC 0] [Tag 1598] [Var] box_inc1 IN
[Offload] [MIC 0] [Tag 1598] [Var] dst_inc0 IN
[Offload] [MIC 0] [Tag 1598] [Var] src_inc0 IN
[Offload] [MIC 0] [Tag 1598] [Var] box_inc2 IN
[Offload] [MIC 0] [Tag 1598] [Var] dst_inc1 IN
[Offload] [MIC 0] [Tag 1598] [Var] src_inc1 IN
[Offload] [MIC 0] [Tag 1598] [State] Scatter copyin data
[Offload] [MIC 0] [Tag 1598] [State] Gather copyout data
[Offload] [MIC 0] [Tag 1598] [State] MIC->CPU copyout data 0
[Offload] [HOST] [Tag 1598] [State] Scatter copyout data
[Offload] [HOST] [Tag 1598] [CPU Time] 0.002393(seconds)
[Offload] [MIC 0] [Tag 1598] [CPU->MIC Data] 69 (bytes)
[Offload] [MIC 0] [Tag 1598] [MIC Time] 0.000994(seconds)
[Offload] [MIC 0] [Tag 1598] [MIC->CPU Data] 0 (bytes)
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It would help seeing some code to better understand your interest. From the report, it appears you have pointer data within the scope of the offload although no associated data transfer for that.
The alloc_if/free_if/length can influence the allocation. Just playing with things, nocopy() with a length(0) or alloc_if(0) appears to avoid the allocation but makes the pointer data unusable within the scope of the offload but I don't know whether that's helpful.
You might refer to Minimize Coprocessor Memory Allocation Overhead and Explicitly managed Heap-allocated Data on the Effective Use of the Intel Compiler's Offload Features article to see whether any information there helps with your interest.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi, Kevin,
The code snippet is posted. Thanks!
void *dst_addr = (void *)dst; //a persist pointer on MIC
void *src_addr = (void *)src; //a persist pointer on MIC
int dst_inc0,dst_inc1,dst_inc2;
int src_inc0,src_inc1,src_inc2;
int box_inc0,box_inc1,box_inc2;
dst_inc0 = dst_inc[0];
src_inc0 = src_inc[0];
box_inc0 = box_inc[0];
dst_inc1 = dst_inc[1];
src_inc1 = src_inc[1];
box_inc1 = box_inc[1];
dst_inc2 = dst_inc[2];
src_inc2 = src_inc[2];
box_inc2 = box_inc[2];
#pragma offload target(mic)
{
TYPE *dst_ptr = (TYPE *) dst_addr;
TYPE *src_ptr = (TYPE *) src_addr;
int dst_begin_tmp = dst_begin;
int src_begin_tmp = src_begin;
for (int d = 0; d < num_depth; ++d) {
for (int nb = 0; nb < box_inc1*box_inc2; nb++) {
int dst_counter = dst_begin_tmp + (nb/box_inc1)*dst_inc1*dst_inc0 + (nb%box_inc1) * dst_inc0;
int src_counter = src_begin_tmp + (nb/box_inc1)*src_inc1*src_inc0 + (nb%box_inc1) * src_inc0;
for (int i = 0; i < box_inc0; i++) {
op(dst_ptr[dst_counter+i], src_ptr[src_counter+i]); //op is copy/max/min/sum
}
}
}
dst_begin_tmp += dst_offset;
src_begin_tmp += src_offset;
}
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page