- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I am trying to develop opencl code on the intel's cpu, and I have a question on the memcpy using opencl.
Does the Opencl on CPU has a efficient way to copy a sub section of data from a large array into a new buffer?
e.g. for a array that saved the image data with sz 1000x1000, I want to cp a 19x19 section of the image into a new array and do some computing on the section. I could not find a efficient way to do that. Just copy the data one by one is extremly inefficient. And because of the alignment problem I can not use vectors to do the copy. Does anyone know the good practise for memcpy in opencl?
Thanks
zhuzxy
I am trying to develop opencl code on the intel's cpu, and I have a question on the memcpy using opencl.
Does the Opencl on CPU has a efficient way to copy a sub section of data from a large array into a new buffer?
e.g. for a array that saved the image data with sz 1000x1000, I want to cp a 19x19 section of the image into a new array and do some computing on the section. I could not find a efficient way to do that. Just copy the data one by one is extremly inefficient. And because of the alignment problem I can not use vectors to do the copy. Does anyone know the good practise for memcpy in opencl?
Thanks
zhuzxy
Link Copied
2 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
consider
consider
clEnqueueReadBufferRect/clEnqueueWriteBufferRect/clEnqueueCopyBufferRect for host side copying.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks, but I am afraid my algorithm cannot do that. Here I would like explain how my algorithms works.
there are multiple kernels, each later kernel will denpends on the previous kernel's output.
data--.> kernel1 --> output1/input for kernel2 --> kernel2 --> output2/input for kernel3 --> kernel3 -->finished
to make the latency minimized( the whole algorithm is part of real time app), I did not call the clwaitforevent until the last kernel was enqueued. And the the memcpy happend in the kernel3. the copy position comes from the output of kernel2. I need copy thousands of small data into a new array so that I can utilize the cache memory. But now I found the memcpy is a problem. the performance is really bad. Can anyone suggest a good way to do the memcpy?
Thanks
there are multiple kernels, each later kernel will denpends on the previous kernel's output.
data--.> kernel1 --> output1/input for kernel2 --> kernel2 --> output2/input for kernel3 --> kernel3 -->finished
to make the latency minimized( the whole algorithm is part of real time app), I did not call the clwaitforevent until the last kernel was enqueued. And the the memcpy happend in the kernel3. the copy position comes from the output of kernel2. I need copy thousands of small data into a new array so that I can utilize the cache memory. But now I found the memcpy is a problem. the performance is really bad. Can anyone suggest a good way to do the memcpy?
Thanks

Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page