Intel® Graphics Performance Analyzers (Intel® GPA)
Improve your game's performance by quickly specifying problem areas

need help interpreting GPA Platform Analyzer results

steve_ee
Beginner
832 Views
Recently I have started using Intel GPA platform analyzer to see the relation between the CPU and GPU task for my application.
I am running this test on Windows 7 with a non-intel graphics card on a Intel Xeon CPU
Under the DX CPU, DX GPU tracks, I find that the CPU frames is many times ahead of the GPU frame.. almost 107 frames ahead..
That seems abnormal.. Any ideas or is it because im using a non intel graphics card?
0 Kudos
7 Replies
Neal_Pierman
Valued Contributor I
832 Views
Hello,

Does this happen for all apps, or just this one app/game in particular? So try something like gpasample.exe (you can find this in the folder where the rest of the Intel GPA binaries are installed) to see whether you get similar results.

But if you do have access to another GPU, try running on this as well, and post the results of these two experiments here.

Also, it would help to have you right-click on the Intel GPA Monitor icon in the notification tray and post the "About..." info here as well.

Regards,

Neal

0 Kudos
steve_ee
Beginner
832 Views

1. Ok so i run the gpasample.exe test on the same machine.

.

The machine has the following Intel GPA details

Windows 7, 64-bit DEP enabled

Num Processors: 4

Memory: 12271MB

System BIOS: Hewlett-Packard 786G3 v03.15 (10/29/2010)

Video BIOS: Version 70.04.2E.00.A0 (11/23/10)

Driver 0:

Device: NVIDIA GeForce GTX 460 SE

Provider: NVIDIA

Date: 8-3-2011

Version: 8.17.12.8026

VendorId: 10de

ProductId: e23

Stepping: a1

No support for GPA Instrumentation

GPA install directory: C:\Program Files\Intel\GPA\2012 R1\

GPA version: 12.1.166792

Current user is in Administrators group: YES

Current GPA 2012 R1 (12.1.166792)

.

And the DX CPU and DX GPU track seems ok with the CPU roughly 2 frames ahead of the GPU

.

2. Then i run my original app on a computer that has integrated intel graphics

.

The machine has the following Intel GPA details

Windows 7, 64-bit DEP enabled

Num Processors: 8

Memory: 3982MB

System BIOS: American Megatrends Inc. ASNBCPT1.86C.0045.P00.1010071143 (10/07/2010)

Video BIOS: Hardware Version 0.0 (10/03/20)

Driver 0:

Device: Intel HD Graphics Family

Provider: Intel Corporation

Date: 12-15-2011

Version: 8.15.10.2598

VendorId: 8086

ProductId: 126 (Intel HD Graphics 3000)

Stepping: 8

Supports GPA Instrumentation

GPA install directory: C:\Program Files\Intel\GPA\2012 R1\

GPA version: 12.1.166792

Current user is in Administrators group: NO

Current GPA 2012 R1 (12.1.166792)

.

And the DX CPU and DX GPU track seems ok with the CPU roughly 2 frames ahead of the GPU again.

.

So the question is why does my app's test result seems weird on the first computer in which the CPU is around 107 frames ahead. Is there some prob with the way i did the instrumentation? It shouldnt be.. since the logical tracks(DX CPU and DX GPU) does not depend on the code instrumentation correct?

Additional questions

So on a side note, I have some additional questions to ask as well.

1. I am trying to find out if my app is GPU or CPU bound on a nvidia graphic card. But currently there is no clear way to do it since the metrics do not include the GPU busy stats as well as the "NULL hardware" override state is missing on machine running non-intel graphics card. All i can do is to disable draw calls and it does increase the FPS by alot but there is no way to say if the bottleneck is coming from GPU or CPU(driver) and i can see from the task manager that my CPU is not fully utilised. Is there any other ways to do it?

2. What is the difference between using "__itt_frame_begin_v3" and "__itt_task_begin" to instrument your code. I know that task can appear in different threads. But what about frames? where does it appear. Also can you let me know of any examples where i would prefer one over the other and vice versa?

3. Last question is about relations. We can add tasks to a task groups. We can also add task to a parent task. But what is the main difference between them? I understand the task groups can be useful if we want to find out the total cost of related tasks. But what about when you add a task to a parent task, what do we usually use it for?

Thanks for taking the time to reply to my queries. Thanks!


0 Kudos
Neal_Pierman
Valued Contributor I
832 Views
Hello,

I'm doing some checking on the original question/issue, and hope to get back to you soon on this.

As to your other questions:

  1. Am I CPU or GPU bound?
    You are correct that some options available on Intel Processor Graphics are not available with NVIDIA hardware; we don't have access to some overrides on that hardware platform. With only "disable draw calls" showing a large improvement, you are correct that you can't say for sure which piece of the system is the bottleneck. If your code treats both Intel HD Graphics and NVIDIA the same, you could take the game over to the Intel HD Graphics platform and run "null hardware" tests to help nail this one down (since you're just trying to see the impact of an infinitely fast GPU, it doesn't really matter what GPU you use). If you're using DX11, then you might want to see if you can find an Ivy Bridge system, since the code path should be similar since Ivy Bridge supports DX11 in the GPU. So try this and see if this helps.
  2. "frame begin" vs. "task begin"?
    The advantage of frames is that they will appear in a separate track in the Intel GPA Platform Analyzer application. See this link for an example that shows this feature. Let me know if this isn't clear, and I can gather some more info for you.
  3. task groups vs. parent tasks
    I believe that the two methods allow you to see the tasks in different ways within Intel GPA Platform Analyzer. In other words, it's a logical grouping that you're creating, and in some cases you may want to see things collected in one way versus another -- we give you the flexibility of seeing this information in various ways within the tool. Also, have you seen the "relations" sample? This may provide some additional guidance on the topic. Again, if I've not fully answered your question, please let me know and I'll find some other info that might help.

So I still owe you an answer on your first question -- I'll let you know what I find out.

Regards,

Neal

0 Kudos
Neal_Pierman
Valued Contributor I
832 Views
Hello,

Regarding your original question about why you have as much as 100 buffer delay between processing on the CPU and the GPU, one of the developers has suggested that you run the BasicHLSL sample from the DX SDK on both platforms -- he indicated that this may be better than gpasample for tracking down this issue.

Also, how "big" are each of the buffers? Are you sending a single triangle in each, or many more primitives?

Thanks!

Neal
0 Kudos
Neal_Pierman
Valued Contributor I
832 Views
Hello,

The reason I'm asking about the buffer size is that if you are putting just a small amount of data into each buffer, the nvidia drivers may be holding onto that data until it needs to process it. In other words, the nvidia internal buffer may be much larger than what you see with Intel HD graphics, which may explain why you are so far "behind" on processing the GPU buffers.

You can force the buffer to flush (and be processed) by introducing locks or by other techniques, but it's not clear whether this improves your overall throughput or not. In other words, if the delay doesn't cause an issue, then don't worry about it.

Hope this helps.

Regards,

Neal
0 Kudos
steve_ee
Beginner
832 Views
Hi,
-
Well I could run the SDK HLSL example on the test machines and get back to you on this. I believe I am pushing quite alot of data.
-
Actually the main reason why I am asking about the CPU frames being 100+ frames ahead of the GPU is to find out whether this application is GPU or CPU bound on my machine.
-
Interestingly, when i run PIX on this application of the same machine it is able to display the GPU and CPU timing graph. (Just choose the collect statistics for each frame). And they seems to be in synch and doesn't seems to be GPU bound.
-
-
I believe PIX is accessing the hardware counters on the graphics card. So how is the Intel GPA platform Analyser different from PIX when they are collecting CPU/GPU timing information?
0 Kudos
Neal_Pierman
Valued Contributor I
832 Views
Hello,

As you indicated you're really trying to find out whether you are CPU or GPU bound with your app on a specific NVIDIA system.

Unfortunately Intel GPA Platform Analyzer won't necessarily provide this information to you -- this specific tool is usually most useful when you already know that you have a CPU-bound app, and are trying to visually understand the relationships between different tasks running on the CPU/GPU. Usually Intel GPA System Analyzer is the tool that can provide insight into the CPU vs. GPU issue.

Also, without access to your source code it'll be hard to figure out why Intel GPA shows that the NVIDIA GPU is processing "old" buffers, especially when using Intel HD graphics the buffers seem to be synchronized. Similarly, I can't guess why PIX and Intel GPA would show different results without access to your code. As I suggested, we might be able to deduce whether the issue is with your app or just the way things work on your NVIDIA system by running the HLSL example. This example might provide some hints as to whether the issue with being 100 buffers behind is specific to your app, or whether this is just the way things work on the NVIDIA platform.

So back to your original question... One of the experiments you already tried is to use the "disable draw calls" override on NVIDIA; the results showed that you possibly have a bottleneck in the driver and/or the GPU. You really wanted to use "null hardware", but this is not available on your specific graphics device (such as Intel HD graphics). Did you try other overrides, such as "simple pixel shader" or "2x2 textures"? Also, you can create a frame capture file, and look at the frame in detail to see if there are specific regions or individual ergs that are bottlenecks. In other words, assume that you are GPU bound, and dig deep into a frame to see if this hypothesis is correct or not. You might find something that is an obvious issue on your NVIDIA platform for that frame, and/or try using overrides for some more experiments. Also, you mentioned that you have access to an Intel HD graphics device -- did you try to run "null hardware" on this platform to see if this provides any useful info? I know it's a different graphics device, but when using "null hardware" you won't really be using the device at all, so it shouldn't matter. This experiment should provide some useful info, especially if the CPU specs are similar.

Hope this helps.

Regards,

Neal
0 Kudos
Reply