Sorry for the delay. I was actually writing the response to the original as you posted the 2nd message. To answer your question, yes the decode must copy the data from the GPU back to system memory. There will be an inherent performance penalty when not using D3D surfaces.
I have been testing memcpy() from video memory to system memory after H264 decoding(1080P).
Howerver cpu occupancy are very varied as following stream counts(window 7 32bit, i5-2400).
In 1~5 streams, cpu occupancy is 1~2%.
In more than 6 stream, cpu coopuancy is 90%.(I think HW decoding is change to SW decoding)
I saw your answered messages that copyframe and copybuffer are less than memcpy() in cpu's occupancy.
copyframe and copybuffer are used mfxcoreinterface of mfxplugin.
How can I use copyframe and copybuffer useless of mfxplugin.