OpenVX Performance and Speed

Kosh__Stephen · ‎05-07-2018

Hello everyone.

I 'm going to speed up my study module by using OpenVX.

My study module is written by C++ code.

And I've written my study module as OpenVX graph.

In my opinion, I expect that openVX can help me to speed up 4 times.

But OpenVX is not so fast.

Sometimes OpenVX is slower than old code.

Sometimes it is faster but not so much.

I've analyzed the speed between C++ code and OpenVX by using both tiled user kernel and un-tiled user kernel.

But no effect.

1. What is the points to speed up by using OpenVX?.

2. Is there any tricks to speed up?

3. What is the consideration for using OpenVX?

4. OpenVX is using GPU?

Thanks.

Best Regards.

Ryan_M_Intel1 · ‎05-08-2018

Hi Stephen,

I'm very interested to understand in a bit more detail what the graph / pipeline that you are running using OpenVX. Can you give a brief description of the graph structure? If you are primarily processing images, what are the sizes/format of the images being processed? Finally, what is the Intel platform that you are using? Understanding some of these questions will help us toprovide better advice for your specific use-case.

For now, I can try to answer your questions in more generally:

1. What is the points to speed up by using OpenVX? What is the consideration for using OpenVX?

The fundamental difference between OpenVX and other traditional image processing frameworks is its deferred execution model. Meaning that a user describes all of the work that needs to happen ahead of time by assembling a pipeline / graph… and later on (after some verification stage), executes all of that work at once. During the verification stage, the OpenVX runtime is able to observe the workload at a high level, and is thus able to perform some really neat optimizations. For example, where there are virtual images, the OpenVX engine is free to use more efficient memory allocation schemes, or eliminate these buffers altogether (through node fusion), as the user has acknowledged that they do not need access to this image outside of graph execution.

Usually, imaging / traditional computer vision workloads that can be represented as a pipeline (or graph) are good candidates for OpenVX.

Intel’s implementation of OpenVX achieves many of its performance benefits through image tiling. To name a few:

CPU multithreading is achieved through the dispatching of image tiles to the CPU worker threads.
Using tile buffers between nodes in a graph, and some intelligent scheduling can drastically reduce cache misses.
Heterogeneity: The graph processing can be efficiently pipelined between CPU & GPU, making good use of device concurrency.

2. Is there any tricks to speed up?

Certainly. This one is harder to answer without more knowledge of your workload… but one general piece of advice would be to try changing the default tile size used by a graph. This can be done by setting the graph attributes (VX_GRAPH_TILE_WIDTH_INTEL, VX_GRAPH_TILE_HEIGHT_INTEL) attributes (take a look at the color copy pipeline sample for example usage). If the tile size is too big, the work is not granular enough to keep the executors busy. If it’s too small, there would be some runtime overhead with dispatching so many tasks.

3. OpenVX is using GPU?

It can, but by default all nodes in a graph will be executed using the CPU. For many of the built-in nodes that are provided with Intel's implementation, a user can set the 'target affinity' of these nodes to the GPU using the vxSetNodeTarget API. I would recommend to take a look at the OpenVX samples for some examples usage of this API.Of course, tiled and non-tiled user nodes will use the CPU.There is an OpenCL variant of this user kernel API for importing kernels intended to run on the GPU.

Looking forward to hearing more details!

Best Regards,

Ryan