I've been developing a solution with a full in-gpu pipeline for a customer. It decodes multiple streams, blits the frames into a common RGB32 surface, uses VPP to convert it to NV12, and encodes it into a new stream. The pipeline looks like so :
H.264 streams -> decoders -> NV12 surfaces -> blit -> RGB32 surface -> VPP conversion -> NV12 surface -> encoder
On Ivy Bridge i3-3225 (HD4000) I got a result of 183fps (1080p, lowest quality, every parameter is equal between tests).
Of a new haswell i5 (HD4600), I got a very disappointing result of 139fps, much lower than before.
Investigating further, I've found the culprit to be the VPP colour space conversion. If I removed just this operation from the pipeline, The old HD4000's performance moved up modestly to 201fps, while the HD4600 jumped to about twice its former performance, 263fps.
Is this a known issue ? Are there plans to fix it ? Is there a way around it ?
Did you see my response to question here? http://software.intel.com/en-us/forums/topic/476401
Also, There are some processing options that might be getting applied that would affect performance and you may not be intending to use. (See section 4.11 of the Developers Guide for discussion of "Hint-based VPP filters")
I've seen your response. However, currently I've lost access to the Haswell machine I was testing on, so it'll take another week until I can test it again.
Your comment here makes more sense than the comment there, that Haswell adds extra filters I didnt' ask for. I'll check that, thanks.
I have tried to disable all hint-based processing. However, only 3 types (the first 3 specified below) indeed "allowed" themselves to be "do not use"'d, while the rest caused initialization failure.
Running with these 3 options didn't help at all, I'm afraid, leaving the performance as bad as it was, far inferior to Ivybridge i3-3225.
We are still looking into this, and I believe there are a few factors at work here. One issue may be a bug in the drivers (please watch for updates soon), but another issue my be the architecture of the application. Please be sure that your code is written to allow full asynchronous use of available resources, as there are some implemenation differences between the two platforms, and there are some known cases where synchronous workloads might be slower on the newer platform.