Solved: Some issue about zero copy.

Xiaoying__Y · ‎07-23-2019

Hello, I have some doubt about the zerocopy function.

Suppose the code is something like below:

What I want to do is to zerocopy a pre-aligned memory M and transfer to a opencl function OpenCL_Foo, and after the calculation save the result to M.

Do I still get access to M(with the right calculation result from opencl function) even after the OpenCL_Foo ends?

#or do I need to memcopy it?

Thank you in advance for you kindly help.

////////////////////////////////////////////
void OpenCL_Foo(float*array_as_output,int SIZE, float*array_as_output_cpy)
{//Some opencl process
...

//set the parameter array as output buffer of some calculation.
cl_mem	buffer_out = clCreateBuffer(ctx, CL_MEM_WRITE | CL_MEM_USE_HOST_PTR,SIZE * sizeof(cl_float), array_as_output, &err);
...

//is this part below necessay? or can I just access array_as_output outside this function?
/*
void* ptr1 = clEnqueueMapBuffer(queue, buffer_out , CL_TRUE, CL_MAP_READ, 0, SIZE*sizeof(float), 0, NULL, NULL, NULL);
memcpy(array_as_output_cpy,ptr1,SIZE*sizeof(float));
err=clEnqueueUnmapMemObject(queue,buffer_out , ptr1, 0, NULL, NULL);
*/


}// end of OpenCL_Foo

////////////////////////////////////////////
int main()
{
int SIZE = 1024;

float*array_as_output;
float*array_as_output_cpy;

posix_memalign((void**)&array_as_output,4096,SIZE * sizeof (cl_float));
posix_memalign((void**)&array_as_output_cpy,4096,SIZE * sizeof (cl_float));

OpenCL_Foo(array_as_output,SIZE,array_as_output_cpy);

...
}//end of main

Michael_C_Intel1 · ‎07-24-2019

Hello XiaoyingY,

Thanks for the question and the discussion about performance critical development.

I recommend reading the notes section of clEnqueueMapBuffer(...) in Khronos documentation. Following the standards and observing the mapping requirement is key. It maintains cross platform portability, future/derivative platform portability, and it avoids undefined behavior. Per the standard, intended bits should be available under a pointer dereference after the map command has completed.

A memcpy to see the memory should not necessarily be required... But developers may wish to consider when and where they wish to unmap in their program. Per standard it is required to operate on output data from the mapped pointer prior to Unmapping. So if the pointer is unmapped before used, then mechanisms of accessing that data are relying on undefined, potentially inconsistent behavior.

Recommendation: by default... consider operating on the pointer *after mapping operation* in application logic, then unmapping only after the data is not needed... In many cases a memcpy would not be needed.

An example of when memcpy's may be needed is when interfacing with other host library memory requirements.

Reference on zero copy

Sidebar:

You may wish to convince your self with some varying buffer sizes and some time.h timespec or std::chrono timers to show performance delta of zero copy... versus other memory mechanisms on the your platforms of interest.

Thanks for the interest,

-MichaelC

View solution in original post

Michael_C_Intel1 · ‎07-24-2019

Hello XiaoyingY,

Thanks for the question and the discussion about performance critical development.

I recommend reading the notes section of clEnqueueMapBuffer(...) in Khronos documentation. Following the standards and observing the mapping requirement is key. It maintains cross platform portability, future/derivative platform portability, and it avoids undefined behavior. Per the standard, intended bits should be available under a pointer dereference after the map command has completed.

A memcpy to see the memory should not necessarily be required... But developers may wish to consider when and where they wish to unmap in their program. Per standard it is required to operate on output data from the mapped pointer prior to Unmapping. So if the pointer is unmapped before used, then mechanisms of accessing that data are relying on undefined, potentially inconsistent behavior.

Recommendation: by default... consider operating on the pointer *after mapping operation* in application logic, then unmapping only after the data is not needed... In many cases a memcpy would not be needed.

An example of when memcpy's may be needed is when interfacing with other host library memory requirements.

Reference on zero copy

Sidebar:

You may wish to convince your self with some varying buffer sizes and some time.h timespec or std::chrono timers to show performance delta of zero copy... versus other memory mechanisms on the your platforms of interest.

Thanks for the interest,

-MichaelC

Xiaoying__Y · ‎07-25-2019

Hi MichaelC.

Thank you for your reply.

>A memcpy to see the memory should not necessarily be required... But developers may wish to consider when and where they wish to unmap in their program.

I understand that, and please allow me to rephrase my problem as below.

The cl_mem for calculation result is locally defined inside a function[opencl_function1],

in this case, the local_out_buffer will have to be released before the end of the function, otherwise it will be destoried right?[as it is a local variable].

So I guess I can only unmap it inside the fuction?

////////////////////////////////////////

Input

↓

opencl_function1

{

cl_mem local_out_buffer//created here

//somecalculation

Map local_out_buffer

//some memcpy

UnMap local_out_buffer

}

↓output1asInput2

non_opencl_function2

Regards,

Xiaoying.Y

Michael_C_Intel1 · ‎07-25-2019

XiaoyingY,

Thanks for the question, I think it can help a lot of forum viewers.

I think a full program is needed to give a circumspect answer... For example, without seeing a full program, it's not clear why there are two buffers created... . but here are some general comments.

Unmapping requires a valid cl_mem variable. The cl_mem variable is no longer valid after scope ends as it's a stack variable. However, there is no destructor called like in a C++ sense. In this C-API case, without a clReleaseMemObject(...) the cl_mem object will not be released. The cl_mem object will be effectively lost.
For C++, the cl2.hpp wrapper cl::Buffer has a destructor called and the cl_mem reference is released (see line 1540). For your goals it may be useful to consider the C++ wrapper offered in cl2.hpp from Khronos.
In this C-api case, it may be advantageous to operate on the pointer result from the mapping directly if possible.
- using the void* ptr1
  - before it is unmapped
  - before the cl_mem object is released
  - avoiding a memcopy
  - perhaps passing it back to the caller function, unmapping and releasing later... this means the cl_mem reference needs to still be available later.
An alternative could be to allow the OpenCL API to create the buffer with CL_MEM_ALLOC_HOST_PTR instead of the posix alignment routines.
A typical program may see the cl_mem variable in more accessible storage so it can be unmapped before some other command queue operation and perhaps released more toward the end of the program....
This thread from the forum may also be helpful.
It also may be helpful to review the mapping behavior of the conformance tests from Khronos.
A full program can be attached to this forum so long as it's not privileged code or proprietary code.

-MichaelC

Xiaoying__Y · ‎07-25-2019

Hi
MichaelC.

Again, thank you for the reply.
Sorry for the long post.

I may not post the whole code to the forum(company proprietary code as you mentioned).

It is a C-based program.
The structure for the whole program is like below:
   (Input memory is broken into serial bands as to form a pipeline process.)
   two main problems:
       ❶band unit process,
       ❷data can be shared between different process without a memcpy.

///////begin of program

               Input data
               ↓
               func_CPU_Process1(indata1==Input data,   outdata1)
               func_GPU_Process2(indata2==outdata1,       outdata2)
               func_CPU_Process3(indata3==outdata2,       outdata3)
               func_GPU_Process4(indata4==outdata3,       outdata4)
               ...
               func_GPU_Process_end(indataend==outdataend-1,       outdata)
               ↓
               Output data

///////end of program

Not all the processes are suitable for GPU handling some functions are on GPU,some CPU, in a tangled way.

Since cl_mem for each GPU_process is defined locally inside the GPU_process, at the end of GPU_process the cl_mem is released.

Still,the data of the cl_mem has to be accessed(as input) during the next CPU_process,

>>In this C-api case, it may be advantageous to operate on the pointer result from the mapping directly if possible.
>>A typical program may see the cl_mem variable in more accessible storage so it can be unmapped before some other command queue operation and perhaps released more toward the end of the program....
As describled above, the whole program is like a band-pipelined program, and to my understanding, the cl_mem variable cannot be defined as global, as each time the data of band differs.
Is there a way to replace the actual data inside a clmem variable(same process for different bands), so I can globally define the clmem variable and released at the end of the whole program,in which way the actual data can be access during the program with different process.

Thank you for you time.

Regards

Xiaoying, Y

Michael_C_Intel1 · ‎07-30-2019

XiaoyingY,

Source that shows full behavior and current understanding of contraints still may be needed to derive the answers you seek... however,

Per my understanding of the high level workflow, it may be possible one large contiguous block of memory could be used instead of multiple buffers... Kernel enqueues have the parameters to promote offsetting. I don't necessarily know what you mean by band but its possible that your memory could be composed into a large block, and then striding your offsets.

Of course, extra care always should be taken with pointer arithmetic, but especially in a case such as this.

I'm hoping this simple solution fits your goals.

-MichaelC