OpenCL* for CPU
Ask questions and share information on Intel® SDK for OpenCL™ Applications and OpenCL™ implementations for Intel® CPU
This forum covers OpenCL* for CPU only. OpenCL* for GPU questions can be asked in the GPU Compute Software forum. Intel® FPGA SDK for OpenCL™ questions can be ask in the FPGA Intel® High Level Design forum.
1663 Discussions

fastest way to pass an image to the GPU



I am using Intel's VME extension to calculate ME and the time that it take to pass the image to the GPU is very long about 1Msec

I have tried 2 methods: 

#1 Map /Unmap - about 0.4 Msec for 1280*720 image

queue.enqueueMapImage(*pRefImage,CL_TRUE,CL_MAP_WRITE_INVALIDATE_REGION,origin, region,  &row_pitch,NULL,NULL,NULL);
memcpy(prefImageMemory,pRefBuf,arraySizeImageBytes); // Memory use HOST memory 

#2 enqueueWriteImage  - about 0.7 Msec for 1280*720 image

queue.enqueueWriteImage(srcImage, CL_TRUE, origin, region, currImage->PitchY, 0, currImage->Y);

Why doesn't it take so long?

How can I improve this?

Can I call map once and than unmap after each change in the image memory and save the "map" time/



0 Kudos
2 Replies

This discussion has continued in email.  For other readers on this forum, no timelines yet but I've requested some updates to the VME sample in the future.


Raz N. wrote:

Why doesn't it take so long?

How can I improve this?


First, make sure you dind't include file i/o into the measurements. Ideally I would suggest to measure the routine several times without updating the frame, this would allow to exclude general i/o, paging, etc. I'm sure that the first iteration will be exclusively slower than the rest.

Even for the cache-cold data, updating the 1280x720 1-channel image for 0.7 millisecond translates to the ~1.3 GB/sec which is indeed slow.  We will try to reproduce and investigate.

In general to alleviate the bottleneck, you can employ simple async scheme, where the next frame is being read from the file( with pCapture->GetSample) and  uploaded to the OpenCL image (e.g. with enqueueWriteImage) without waiting for the current frame to be processed. Maybe the easiest way (without introducing multi-threading) would be to create a dedicated OpenCL queue just for image updates. You would need to create an additional instance of image (recall a regular double-buffering technique in graphics) and juggle accordingly.

As a proof-of-concept I just added a loop around enqueueWriteImage, make the call async ( by changing the second arg to CL_FALSE) and added a single queue.finish() after the loop. Now when I do the image write say 10 times, my avfreage time is significantly better than of the single (synchronous) call.