streaming video thru Phi

Andres_G_1 · ‎03-12-2015

I am currently developing a real-time video processing application that runs on a dedicated 2-CPU Xeon linux box. The application supports multiple video inputs and multiple video outputs with standard image processing like picture-in-a-picture, graphics and language-specific text overlay, etc. It is basically a pipeline-based architecture where a given input video stream is over laid with language-specific text overlays, then each language specific stream is output on a separate output.

I currently know nothing about Phi or GPU programming; only that it is for applications that can be structured for parallel processing like vector processing for example. I do not know if it is a good choice for my particular application so I thought I would ask a high level newbie question.

Q: Is the memory transfer bandwidth between host Xeon CPU memory to the Phi memory sufficient to support multiple video streams?

The Phi seems appropriate for image processing once the image is in the Phi memory, but since this is obviously not what I would call a massively parallel application I am not sure the Phi is a good choice for this particular application.

I am looking for an excuse to start the learning curve for Phi/GPU programming but probably should not go down that path if this application is not a viable match.

-Andres

Charles_C_Intel1 · ‎03-27-2015

Andres:

I think PCI-Express x16 2.0 runs at about 8 GB/s bidirectional. So that might be enough for a few uncompressed video streams, or a fair number of compressed streams. But to make use of that capacity in a meaningful way requires considerable care in terms of overlapped compute and transfer, appropriate buffer sizes, making sure neither the host nor the coprocessor is waiting on each other, etc. In short, it's not trivial.

What concerns me is that you say "this is obviously not what I would call a massively parallel application". If you can't max out the Xeon host most of the time with highly parallel work that is well-vectorized and makes optimal use of the host's cache architecture, then it's not looking good for a coprocessor. The Intel Xeon Phi coprocessor is a pretty slow serial machine that really needs a high degree of parallelism, vectorization, and good memory use to shine. And by "high degree" I mean using all threads well north of 90% of the time (look up Amdahl's law to see why). Anything you do to optimize the Intel Xeon code will benefit a Intel Xeon Phi implementation if you eventually go that way.

Sorry, Charles