"I belong to FFDShow Tryout development team and we are trying to reproduce your work.
For your information we imported the MPC-HC DXVA implementation into our project recently
The goal is to decode the frames with DXVA 1 & 2 and then copy back the frames into system memory to process them and then write them back.
But we have a problem : we don't get the same speed results as yours (I have a Q9450 with a radeon 5750 in PciExpress 16)
With memcpy or the SSE4.1 optimized copy method, it takes 80ms to copy 1 frame
Do you have an idea about what is wrong ?"
"Either we are doing something wrong (but I begin to doubt it), or else the sense GPU=>CPU gives by designed slow transfers
I hope that one will be able to get in touch with the intel's guy who wrote this article (but I guess that he only tried with low res videos)
...Also note that we are talking about reading (GPU=>CPU), writing is very fast though."
I apologize for not responding sooner - somehow I missed seeing notification of your post.
My own testing involved copying hardware decoded high def video frames back to conventional Read/Write memory from Intel integrated graphics USWC memory on an Intel Core i5 processor.
I don't believe the MOVNTDQA instruction provides much if any benefit for discrete graphics processors with USWC memory mapped over PCI express. The benefits mainly apply to system memory mapped as USWC, including system memory mapped for integrated graphics use.