fastest way to pass an image to the GPU

rnitz · ‎08-28-2014

Hi,

I am using Intel's VME extension to calculate ME and the time that it take to pass the image to the GPU is very long about 1Msec

I have tried 2 methods:

#1 Map /Unmap - about 0.4 Msec for 1280*720 image

queue.enqueueMapImage(*pRefImage,CL_TRUE,CL_MAP_WRITE_INVALIDATE_REGION,origin, region, &row_pitch,NULL,NULL,NULL);
memcpy(prefImageMemory,pRefBuf,arraySizeImageBytes); // Memory use HOST memory
queue.enqueueUnmapMemObject(*pSrcImage,psrcImageMemory,NULL);

#2 enqueueWriteImage - about 0.7 Msec for 1280*720 image

queue.enqueueWriteImage(srcImage, CL_TRUE, origin, region, currImage->PitchY, 0, currImage->Y);

Why doesn't it take so long?

How can I improve this?

Can I call map once and than unmap after each change in the image memory and save the "map" time/

Regards,

Raz

Jeffrey_M_Intel1 · ‎09-02-2014

This discussion has continued in email. For other readers on this forum, no timelines yet but I've requested some updates to the VME sample in the future.

Maxim_S_Intel · ‎09-09-2014

Raz N. wrote:

Why doesn't it take so long?

How can I improve this?

Hi,

First, make sure you dind't include file i/o into the measurements. Ideally I would suggest to measure the routine several times without updating the frame, this would allow to exclude general i/o, paging, etc. I'm sure that the first iteration will be exclusively slower than the rest.

Even for the cache-cold data, updating the 1280x720 1-channel image for 0.7 millisecond translates to the ~1.3 GB/sec which is indeed slow. We will try to reproduce and investigate.

In general to alleviate the bottleneck, you can employ simple async scheme, where the next frame is being read from the file( with pCapture->GetSample) and uploaded to the OpenCL image (e.g. with enqueueWriteImage) without waiting for the current frame to be processed. Maybe the easiest way (without introducing multi-threading) would be to create a dedicated OpenCL queue just for image updates. You would need to create an additional instance of image (recall a regular double-buffering technique in graphics) and juggle accordingly.

As a proof-of-concept I just added a loop around enqueueWriteImage, make the call async ( by changing the second arg to CL_FALSE) and added a single queue.finish() after the loop. Now when I do the image write say 10 times, my avfreage time is significantly better than of the single (synchronous) call.

-Max