video memory lock, memory type

ME · ‎09-14-2014

I've got a very slow user plugin when using video memory. Is the memory address returned from sts = m_pAlloc->Lock(m_pAlloc->pthis, frame->Data.MemId, &frame->Data); system memory (as in, copy from video to system on lock, copy back on unlock), or is it simply video memory being mapped to user space? Video memory is taking 100+ ms to perform the plugin Process (between the locks/unlocks, and the lock/unlock contribute no measureable time). System memory (same lib*HW) is practically 0 ms. I am syncing after dec, after user plugin, and after enc, for this test. I ran again with only the single (async depth = 1) sync after enc, and same (slow) result. Is this normal? Is it recommened to stick to system memory for user-plugin in/out surfaces? ----- (Barely related, and for posterity: in an older driver (3346 I think) I was getting MFX_ERR_INVALID_HANDLE doing a lock on video memory when using the lib*HW; doing the same video memory lock when using lib*SW was okay.)

Sravanthi_K_Intel · ‎09-17-2014

Hello there,

If the plugin being used is a software plugin, system surfaces should be used. If it is HW, then video surfaces should be used. Mixing them can cause performance issues and some errors as well.

if you can provide us with some information on : the plugin being used (is it your own plugin, SW or HW), are you using MSDK sample/tutorial, and some more details - that could surely be helpful.

ME · ‎09-18-2014

Thanks, I think I've got it under control. Doing nothing else different but copying from video memory to my system memory got me a 5x (500%) gain, and that was a naive set of memcpy calls. Using dx9 video memory. I am switching to d3d11 memory. I also took a look at the Thomas C white paper (fast copy FROM video memory to system memory), and I expect a large speed increase with that, at least from video memory to system. When a Vpp resize is also involved, using video memory (for dec/vpp-rezize/(.)/enc) is faster than all system memory, even at this stage (naive memcpys). If no resize is in the xcode, it's a bit faster all system memory. I think I can get it to do better using video memory with only a little effort. BTW, the CopyBuffer() on my June 2014 HW driver is a stub. Tracing it shows it won't do anything useful in any possible path. I take it this code is similar to the the non-temporal copy from USWC memory in the white paper. Simple enough to do. That should moves things alone a lot faster, though I still have to see what the D3D11 Locks do (still looking at it). Q: Do you know why the opposite direction - from system memory to USWC memory - is not discussed? Not even a mention as to why it's not discussed, anywhere, other than "not covered here".

ME · ‎09-20-2014

Exec summart: use D3D11 memory Good news. D3D11 surfaces made all this simple (no extra work needed). Before, it was 100+ ms just for my plugin (max MHz?), and now: ImSdk::Transcode calls= 7050 accum secs= 51.113 ms per call= 7.250 Async depth = 1. The CPU is speed-stepped down to 498 MHz all the time now. All things relative, maybe 20x faster than the first go (well, more since the MHz difference), and not much work, and no extra code than what is already in the simple (tutorial) samples as far as using D3D11 memory goes. I settled on: Dec -(v)-> VppResize -(v)-> VppPlugin -(s)-> Enc Putting the plugin out/encoder-in into system memory (s) was about 2 ms per call faster (than all video memory (v)). Without the VppResize, all system memory (s) is the way to go. The plugin code is not optimized, just memcpys (debug build even). I might shave even more off but that's another problem.

ME · ‎09-20-2014

FTR: Here it is all system memory (much slower if vpp-resize is used) Dec -(s)-> VppResize -(s)-> VppPlugin -(s)-> Enc ImSdk::Transcode calls= 7050 accum secs= 72.296 ms per call= 10.255 If not using the vpp-resize, but using the plugin (dec->plugin-enc), then using all system is faster by about 3 ms per call. Dec -(s)-> VppPlugin -(s)-> Enc <=== this is 3 ms faster per call than below Dec -(v)-> VppPlugin -(s)-> Enc <=== slower than above by 3 ms One tricky thing is getting the memory types correct. Be sure the WILL_READ (you should not need WILL_WRITE) is seen in the lock, and that the desc.BindFlags is not mistakenly set to a wrong type.