Analyzers
Talk to fellow users of Intel Analyzer tools (Intel VTune™ Profiler, Intel Advisor)

NVD3DUM Eating CPU Cycles

LeeBamber
Beginner
2,655 Views

I've been doing a little profiling work on my games engine and have noticed that an inordinate amount of time can be spent inside the NVD3DUM.DLL module, and VTune reports up to 50% of my CPU workload happens inside.  For now I have resigned to accept that this is probably the driver handling the preparation and submission of my graphics data to the GPU but that's just a guess, and trying to find information about this module is pretty tough with almost no information on it.  There also does not appear to be any VS symbol (PDB) files available for it, so I don't know exactly what it's doing when I drill down into it so that's another shroud of mystery to tackle.

Does anyone have any links or information on this module, what it is doing, how it can be optimized and perhaps how to get more symbol style information on it, at least enough to give me a clue what general area I should be optimizing.  If anyone posts something useful, I promise to add it to my daily blog and lavish you with praise for digging out information on what seems to be a black box library that is stealing more than half of my processing cycles!

All I can guess is that NVD3DUM stands for NVIDIA DirectX 3D Unmanaged Driver, but beyond that, it's a mystery module!

0 Kudos
41 Replies
Bernard
Valued Contributor I
1,496 Views

>>>I've been doing a little profiling work on my games engine and have noticed that an inordinate amount of time can be spent inside the NVD3DUM.DLL module, and VTune reports up to 50% of my CPU workload happens inside.>>>

It is recommended  to double check the results with the help of different profiling tool. For that purpose you can use Windows Performance Recorder (formerly Xperf). Start profiling your 3D engine and pay attention to the CPU hotspots and to driver DPC routines. 

Regarding VTune screenshot unfortunately it does not contain relevant information related to the various CPU metrics like: retired instructions, CPI,CPU time etc....

For more clearer picture please run at least  HotSpot analysis and post the screenshots. You can also run Advanced Hotspot analysis and try to use DirectX frame analysis (if appropriate in your case).

>>>If anyone posts something useful, I promise to add it to my daily blog and lavish you with praise for digging out information on what seems to be a black box library that is stealing more than half of my processing cycles!>>>

I would like to advise you using one of freely available tools able to parse PE sections of dll or sys module.You can use free version of IDA Pro to quickly parse Export and Import sections of aferomentioned driver and try to understand at very shallow level what is the purpose of that driver. If you need more help please send me a private message.

 

0 Kudos
LeeBamber
Beginner
1,496 Views

Tried to install Windows Performance Recorder, but like most MS installers it found a nice ambiguous way to fail with some suitably unhelpful generic error message:

WPR-crashes.png

Much more luck with the Hotspot Analysis but did take ten minutes to find the way into the tool. It's assumed the button is obvious so no-one mentions it in any help documentation except in passing. Here is the summary hotspot shot as requested:

BasicHotspotAnalysis.png

Aside from the fact that my app is terribly bad at concurrency (to be solved later on), I am assuming the 10 second spin wait score is due to my application being GPU bound, with the driver having to lock the CPU while to communicates all the rendering instructions to the graphics card. If this may be the case, what are the steps to confirming this?

 

0 Kudos
Bernard
Valued Contributor I
1,496 Views

>>>Tried to install Windows Performance Recorder, but like most MS installers it found a nice ambiguous way to fail with some suitably unhelpful generic error message>>>

If you still want to use Xperf you can try to download it from this link http://www.microsoft.com/en-us/download/details.aspx?id=3138

Short description of the installation process can be found here http://blogs.msdn.com/b/jimmymay/archive/2009/11/24/xperf-install-windows-performance-toolkit-wpt-with-242mb-download-not-2-5gb-windows-7-sdk-part-2.aspx

 

I have a few questions related to the VTune output:

It looks like the main hotspot belongs to the Render module. Can you explain if that module calls into DirectX directly if it does there is a specific VTune analysis which can track and profile DirectX frames which hog the CPU. You should also review additional tabs at the top of the summary window.You need to see exactly which portion of the Render module takes a lot of CPU time. I can see that spin time is also high and it seems that your Engine spawns only 3 threads as it is seen on attached screen shots. Do you use any synchronization between those threads? Do those threads are spining while waiting on the resources held by the other thread? I suppose that high spin time could be due to system rendering stage (DirectX->DxgKrnl->Nvidia display driver). In such a case you should perform additional analysis like: Memory bandwidth analysis, System-Wide analysis and Advanced Hotspot analysis. You should review also results of Front-End and Back-End analysis.

There is another very useful tool GPUView.exe which can help troubleshooting GPU portion of your code.

http://graphics.stanford.edu/~mdfisher/GPUView.html

0 Kudos
Bernard
Valued Contributor I
1,496 Views

 >>>am assuming the 10 second spin wait score is due to my application being GPU bound, with the driver having to lock the CPU while to communicates all the rendering instructions to the graphics card. If this may be the case, what are the steps to confirming this?>>>

In order to confirm your theory about the application being GPU bound please perform extensive testing with the help of GPUView.exe tool. If  your code is GPU bound you will see large frame dropout below VSync rate of 60fps.By looking at the GPUView output you will see a large accumulation of various DMA packets on the CPU graph and prolonged processing which crosses more than one VSync line.

0 Kudos
Bernard
Valued Contributor I
1,496 Views

@Lee Bamber

Do you have any updates regarding your issue?

0 Kudos
LeeBamber
Beginner
1,496 Views

Alas the GPUView is also bundled with the same installer as the Windows Performance Recorder, which due to the almighty wisdom of Microsoft, allows precisely no fall back if the installer fails with a generic and unhelpful error message.  I remember the days when a good old fashioned executable was available, and if you was a lucky chappy, it was wrapped up in a nice ZIP file. Now you have to register a hundred GUIDs and do a triple back-flip before you can run anything these days!  I plan to resurrect a new PC system I have been slowly building this past year, which is a fresh Windows 8 installation so will probably have better luck with that.  I will post here again once I have moved to the stage of running GPUView as part of my main development kit.  I also have an Ultrabook here so will probably try the installer again on there (which has Windows 8) as I might get a little further than 33% before the installer crashes :)

0 Kudos
LeeBamber
Beginner
1,496 Views

Okay, so by installing on a fresh Windows 8 machine, then copying the whole WPT folder over to my Windows 7 dev machine, and changing all files to non-read-only, I was able to trigger the event capture and produce the ETL file and view it in GPUView.  It's a screaming mess of values and charts, and would probably take a few days to completely absorb. I have attached it to this in case anyone with ETL file experience can shed light on whether I am CPU bound or GPU bound.

From reading the article "Matt's Webcorner" I was able to get an idea what the various charts mean, but for some reason I could not find my main thread work to show the CPU workload (just the NVD3DUM again), my D3D Present command seems to spend 16 milliseconds on the CPU queue side, which seems an eternity (as the app itself was running at around 200fps without vsync), and despite my instinct this chart is telling me very critical things, I don't know enough to know what I am seeing.

If someone can tell me CPU or GPU bound, and why, and where in the chart it is illustrated, that would be a great first start!  Thanks again!

0 Kudos
Bernard
Valued Contributor I
1,496 Views

>>>Alas the GPUView is also bundled with the same installer as the Windows Performance Recorder, which due to the almighty wisdom of Microsoft, allows precisely no fall back if the installer fails with a generic and unhelpful error message>>>

Completely agree with your opinion. It seems that GPUView was removed from Win SDK 8.1 , but it is still present in Win 8 SDK so I needed to download twice WPT in order to install GPUView.

>>>I will post here again once I have moved to the stage of running GPUView as part of my main development kit. >>>

Since you were able to install WPT so now You can also use Windows Performance Recorder for system wide profiling.

0 Kudos
Bernard
Valued Contributor I
1,496 Views

>>>If someone can tell me CPU or GPU bound, and why, and where in the chart it is illustrated, that would be a great first start!  Thanks again!>>>

Today I will try to analyze GPUView files. I presume that half day will be spent on that. I will post the results later.

0 Kudos
Bernard
Valued Contributor I
1,496 Views

After spending more than 3 hours analyzing GPUView output I think that your game is partially CPU and GPU bound. The trend is almost the same during the whole time interval where there is no CPU downtime at the beginning and at the end of the VSync interval. CPU DMA packet on average spends ~19 ms in while waiting in SW queue and in GPU HW queue it spends ~9ms(processing time) that means that GPU is somehow blocked waiting for the DMA packets to process them.Your game can easily reach ~100 fps , but it is blocked by the large wait  time in CPU queue. I really do not know what is the reason for that prolonged wait time. I suppose that WPR can shed some light on that reason.

Tomorrow I will continue GPUView analysis.

0 Kudos
Bernard
Valued Contributor I
1,496 Views

>>>GPU HW queue it spends ~9ms(processing time)>>>

Small correction of the quoted sentence.

Waiting time in GPU HW queue is ~9ms and processing time is ~4ms. My further advise is to localize Hotspota inside the NVD3DUM.DLL which is called by D3D.DLL and try to understand what really is happening. If you want you can post screenshots here.

0 Kudos
Bernard
Valued Contributor I
1,496 Views

Did you run additional Xperf and VTune analysis?

0 Kudos
LeeBamber
Beginner
1,496 Views

I checked out Xperf but it was sensory overload for me to absorb. Way too many charts to choose from!  I know of it now, so next time I want to monitor overall CPU spikes I have a tool :)   Do not know how to "localize Hotspot inside the NVD3DUM.DLL" at present.  Currently doing some tests with Intel Frame Analyser on the Haswell Ultrabook as my researches so far suggest that I pretty GPU bound, or at least I am sending WAY too much to the graphics side of the machine.  Currently having a hoot performance tuning an engine that has never seen VTune in two years; plenty of targets to choose from!

0 Kudos
LeeBamber
Beginner
1,496 Views

Anyone know why GPA Frame Analyser would cause my standalone executable to produce a crash, failing inside module "shimd3d32.dll" where it runs find when not using the Analyse Application feature?  Precious little is documented about shimd3d32.dll but I am guessing it's a shim that is inserted into the DirectX API call process to intercept and report all D3D activity, e.t.c.  Because there is NO stack information when I try to debug the release build of the standalone executable, I am proceeding to install my entire dev kit onto the Ultrabook (Haswell) to re-run it but with a debug build in the hopes I see where it came from to identify a possible DirectX 9.0c call that GPA does not like.

0 Kudos
Bernard
Valued Contributor I
1,496 Views

 >>>Currently doing some tests with Intel Frame Analyser on the Haswell Ultrabook as my researches so far suggest that I pretty GPU bound>>>

Yes I agree with you. While analyzing merge.etl file I saw a large build up of the DMA packets in the CPU queue  their average wait was ~19ms. GPU was constantly busy processing data without any downtime. Fortunately your Engine was able to achieve ~60 fps (by looking at GPUView) output. Did you try to test your code on more powerful GPU?

0 Kudos
Bernard
Valued Contributor I
1,496 Views

>>>I checked out Xperf but it was sensory overload for me to absorb. Way too many charts to choose from!  I know of it now, so next time I want to monitor overall CPU spikes I have a tool :)   >>>

Do not worry if you want simply upload Xperf  etl capture and I would try to analyze it. I really like to dig deep into that code mess:)

0 Kudos
Bernard
Valued Contributor I
1,496 Views

>>>  Do not know how to "localize Hotspot inside the NVD3DUM.DLL" at present>>>

Probably Xperf analasis should have been able to locate NVD3DUM.DLL hotspot by tracking CPU load and time spending on executing that dll. The problem here can be related to proper caller-calles identifaction because lack of public symbols.

>>>Anyone know why GPA Frame Analyser would cause my standalone executable to produce a crash, failing inside module "shimd3d32.dll" where it runs find when not using the Analyse Application feature? >>>

Did you use windbg for the crash troubleshooting? Is that crash related to access violation exception?

 

0 Kudos
LeeBamber
Beginner
1,496 Views

Turns out the Frame Analyser has something to do with the amount of remaining system memory. When I run an app with a small system memory footprint (less than 700MB) the analyser works fine. But if I try my larger app (1.2GB of system memory) the frame analyser seems to want to reserve a lot of it's own system memory and as a result, the app fails to create new memory allocations at random points. Is there any way to make the frame analyer use LESS system memory for it's operations. I only really want to study ONE frame of the application, so hopefully that should not require a large additional system memory overhead?  Any insights would be helpful!  For now I am going to use a much smaller game level which can FIT inside the system memory limitations of the frame analyser as that should at least give me a small scale version of the problems that might be plaguing the large scale version!

0 Kudos
Bernard
Valued Contributor I
1,496 Views

I cannot speak for Frame Analyzer itself because I have never used it. Regarding GPU used in your tests did you try to use more powerful GPU with more SP units?

0 Kudos
LeeBamber
Beginner
1,461 Views

GPU Processing was not what caused it to crash, it was running out of SYSTEM memory when the frame analyser was actively monitoring the application.  I have run on two integrated systems (Ultrabook Haswell and HD4000 Desktop) and both show the same crash due to system memory shortfall.  Thanks for your help though!

0 Kudos
Reply