I'd like to be able to decode a real-time video stream (currently I'm using a video file for testing) that is 1920X1080 progressive H.264 for video analysis. I have my custom video analysis DirectShow filter that take a few milliseconds per frame. Just using QuickSync to decode and VPP to convert to RGB32, and then store the frame takes from 8 to 10 milliseconds. After the analysis, writing to the frame, more stores of the modified frames into system memory,takes up another 5-8 msec (I'm running on a SandyBridge quad core I7 CPU. As a result I can't maintain the 60 FPS analysis (it is closer to about 30 frames, as there is also overhead for a splitter etc.) that the real time will require. Switching to a vanilla Haswell chip might gain 20%, but that would not be sufficient; I'm looking for a 100% or more speedup. Since a fair amount of time is taken up by the memory transfers/copies (i.e. I believe I'm primarily memory bound), it occurred to me that an Iris Graphics Pro Haswell chip with its much higher memory bandwidth (using the EDRAM) may be the solution. However, the QuickSync IOSurf recommends 32 surfaces, and that would not fit into the 128 Mbyte EDRAM.
Is there a way to get the decoder to use the EDRAM for as many surfaces that can fit in the 128Mbytes and the rest in system memory?? Would this speed up the Decode/VPP??
In general is there any documentation available as to how to target the EDRAM for use by QuickSync as well as general CPU programming. All I was able to find searching the Internet was a reference at the last IDF about using the Intel drivers and some vague comments about sharing it with the CPU. Surely, there must be some info I missed that explains how to do this in detail.
Any help would be appreciated.
I'm not aware of any API that allows application to control use of the EDRAM.
I'm not sure your needs require it, however. There have been several improvements since 2nd Generation Intel® Core™ i7 Processors (SandyBridge) was launched.
Using 1080p -> Decode (NV12 VidMem) -> VPP (RGB32 SysMem), I ran quick test and see the following:
Intel(R) Core(TM) i7-2600 (Intel(R) HD Graphics): ~100fps, Latency ~10.0ms avg (23.1 ms max) Intel(R) Core(TM) i7-4770S (Intel(R) HD Graphics 4600): ~568fps, Latency ~ 1.7ms avg ( 3.2 ms max)
I did not have access to Intel(R) Iris(TM) Pro Graphics 5200, for comparison, but it seems like you might not need that much power to get the performance you seek.
Also, please be aware that Intel Graphics can feature "Dynamic Frequency" and can run slow to save power if you do not provide sufficient workload. For example, the Graphics clocks in Intel(R) Core(TM) i7-4770S can run as slow as 350MHz or as fast as 1.2GHz. In this case, I believe you are probably submitting workloads frequently enough to cause graphics engine to run fast, but I wanted to make sure this was known in case you create any 'focused' tests.
Thanks for the quick reply. The speedup in decode is encouraging, but I don't think will be sufficient. at 30 pfs of current execution, an 8 msec improvement will still leave me with a 25msec/frame time, considerably short of the less than 16 msec (about 10 msec/frame would be desirable to leave some leeway for the inevitable deviation from average execution times) that is required. The preferred solution would be for the decode/VPP to leave the result in the EDRAM, and then copies/analysis could execute out of it and take advantage of the fast EDRAM memory access.
I didn't expect that there would be an API that would control the use of the EDRAM; however there must be strategies/.memory assignments that would allow predictable and deterministic us of the EDRAM. After all Intel would not have invested semiconductor real estate and require a price premium for the Iris Graphics Pro feature to leave its potential to random (non-deterministic) use.. Perhaps your contacts within Intel can shed some light on this or point me to existing documents that do shed some light on taking advantage of the EDRAM. Any help you could provide in this regard would be greatly appreciated.
I want to point out that speedup is seen in VPP, memory and format conversion.
For example if I supply VPP 1080P NV12 Video Memory -> RGB32 System Memory I see ~5.75ms on i7-2600 vs. ~0.64ms on i7-4770S.
(This is direct us of MSDK API with blank NV12 (no DShow overhead)
Thanks for continuing to track this issue. However, it doesn't really answer the question raised in my previous post. I suspect that the higher memory speed is due to the second port on the Haswell, and the fact that the memory copies being sequential are coming out of cache. However, in my case the analysis software needs to access the memory in a somewhat random order (depending on what it detects) and do so multiple times. Hence the memory bandwidth imposed by the memory bus is the dominant factor. The EDRAM (having a memory bandwidth at least an order of magnitude faster) is the solution.
Is there some proprietary issue here??? Otherwise I don't understand why it seems to be very difficult to get useful information about how to use its functionality, especially given that Intel has touted this feature extensively. So if possible could you do some checking into how I can get some enlightenment on my question in the previous post!!!