Asynchronous C/C++ GPU optimization

Daniil_Osokin · ‎09-21-2013

Hi!

If I need to handle quite big image (for example, 4k), could it be efficient to split input image into tiles (tile size can be based on number of GMA cores, to provide full GPU utilzation) and execute whole sequence of image processing operations by tiles, e.g. compute Sobel on 0-tile, next on 1st-tile, ...?

Igor_A_Intel · ‎09-24-2013

Hi Daniil,

it's not clear either you are going to use IPP Async functionality for this purpose or are going to use some own implementation. If talking about IPP Async - you should not care about tiling - it is done internaly. Almost all Async functions work with 16x8 pixel blocks for the best Gen EU utilization.

regards, Igor

Daniil_Osokin · ‎09-25-2013

Igor, thanks for response!

I'm going to use IPP Async, but I want to achive an additional speedup from data locality, e.g. execute whole set of image processing operations through IPP Async on first image tile (not the GPU tile), then on next tile. So this pipeline is like: divide image in slices and for each slice call same sequence of hpp* image processing functions. Could it be efficient in comparison with reqular pipeline: passing whole image through sequence hpp* functions?

Igor_A_Intel · ‎09-26-2013

It depends on image size - Gen video memory can hold surface max 8Kx8K byte - so if your image has greater dimentions - it is better to apply tiling at the application side. Too small tiles will lead to the corresponding number of enqueues (==number of tiles), each enqueue adds a huge const overhead to processing as it goes through video driver. Async library has some internal logic/analyzer that will be extended in the future to the full graph analyzer like DMIP. So "slices" approach can be effective for classic sync IPP, while for Async it's better to use regular pipeline if image fits in 8x8k.

regards, Igor

Daniil_Osokin · ‎09-26-2013

Thank you Igor,

now it's much clearer. It was extremely kind of you!