Community
cancel
Showing results for 
Search instead for 
Did you mean: 
srimks
New Contributor II
40 Views

CPU Utilization & Events Analysis.

Hi.

I had used VTune(v-9.1) to understand EBS for a large exe (1180972 in size), If I see the "Activity Summary by Core", I get below informaation -

CPU ID Idle time
0 0.09%
1 68.92%
2 72.51%
3 88.89%
4 70.33%
5 33.56%
6 66.47%
7 35.92%

The processor is Quad-Core Intel Xeon Processor 5300 Series & the architecture is IA-32.

I have following queries -
(a) Why I get different CPU utilization for all 8 cores?

(b) How do I interpret each below events information -
Event Scale SAV Total Samples Duration
RS_UOPS_DISPATCHED.CYCLES_NONE 1e-10 2,000,000 75139 10.397
CPU_CLK_UNHALTED.CORE 1e-09 2,666,000 11673 10.343
MEM_LOAD_RETIRED.L2_LINE_MISS 0.0001 100,000 8 10.327
MEM_LOAD_RETIRED.L1D_LINE_MISS 1e-06 100,000 973 10.269
MEM_LOAD_RETIRED.DTLB_MISS 1e-05 10,000 189 10.372
DTLB_MISSES_ANY 1e-06 10,000 2133 10.372
BR_INST_RETIRED.ANY 1e-09 1,000,000 5987 10.343
INST_RETIRED 1e-09 1,000,000 12,837 10.343

Please do suggest what could be the optimal SAV for each events, such that one can have better performances with above values?

(c) My code involves too many doubles intensive datatypes, which event will give component of stalls associated with FLOAT/DOUBLE operations for assisting denormals and such?

(d) What is the meaning of different values for Ring 0 & Ring 3?

(e) I see high BUS UTILIZATION when compiler does auto-vectorization, how do I minimize this with CPI?

Note: I have already auto-vectorized the code using pragmas.

~BR
0 Kudos
6 Replies
TimP
Black Belt
40 Views

You don't give any information to begin to answer most of your questions.
With regard to (e), auto-vectorization can be expected to be associated with high bus utilization and higher CPI than equivalent non-vector code. This points up the fact that those measurements don't correlate with the efficiency of your code. It's almost in the same category with the idea of avoiding vectorization in order to inflate your threaded parallel scaling ratings. The most efficient way to lower CPI is to add useless instructions (the opposite approach to vectorization).
As to (c), if you turned on gradual underflow (e.g. by a compilation option such as /fp:source), you might get more information by comparing it with a run where you set abrupt underflow (e.g. by putting /Qftz after /fp:source for your main()), or by executing the SSE intrinsic).
Shooting in the dark on (a), does it make a difference when you use a good affinity mapping? Presumably, you are using some kind of multi-threading, but I can't imagine how you expect us to guess what it is.
srimks
New Contributor II
40 Views

Quoting - tim18
You don't give any information to begin to answer most of your questions.
With regard to (e), auto-vectorization can be expected to be associated with high bus utilization and higher CPI than equivalent non-vector code. This points up the fact that those measurements don't correlate with the efficiency of your code. It's almost in the same category with the idea of avoiding vectorization in order to inflate your threaded parallel scaling ratings. The most efficient way to lower CPI is to add useless instructions (the opposite approach to vectorization).
As to (c), if you turned on gradual underflow (e.g. by a compilation option such as /fp:source), you might get more information by comparing it with a run where you set abrupt underflow (e.g. by putting /Qftz after /fp:source for your main()), or by executing the SSE intrinsic).
Shooting in the dark on (a), does it make a difference when you use a good affinity mapping? Presumably, you are using some kind of multi-threading, but I can't imagine how you expect us to guess what it is.
Hi All.

Can anyone help further on above (a) - (d) queries, it's very urgent.

~BR
Thomas_W_Intel
Employee
40 Views

Quoting - srimks
Hi All.

Can anyone help further on above (a) - (d) queries, it's very urgent.

~BR

BR,

The numbers in (a) suggest that you have a threading issue and you are leaving a lot of CPU time on the table. The best way to get some more insights is to use the "Thread Profiler" and/or the "Sampling Over Time View" in VTune. They should help you to see how and when the threads are working.

Kind regards
Thomas
srimks
New Contributor II
40 Views


BR,

The numbers in (a) suggest that you have a threading issue and you are leaving a lot of CPU time on the table. The best way to get some more insights is to use the "Thread Profiler" and/or the "Sampling Over Time View" in VTune. They should help you to see how and when the threads are working.

Kind regards
Thomas
Hi Thomas.

Got the clue for above (a), TX.

Out of curiosity was looking to ask - Is this a "Load balancing, the equal division of work among threads" problem which is not being achieved here within all 8 cores of Quad Core 5300 processor which has 8 cpu cores (4 in each die)?

Do you suggest use of OpenMP pragma's within the section of code for obtaining efficient Load Balancing.

~BR
Mukkaysh
TimP
Black Belt
40 Views

If you have OpenMP load balancing issues, you should see them in the summary you get when you link with -openmp_profile and run your tests. You might find it interesting to run those tests with various settings of KMP_AFFINITY or GOMP_CPU_AFFINITY. After that, the simplest solution, and often the appropriate one, is schedule(guided).
srimks
New Contributor II
40 Views

Quoting - tim18
If you have OpenMP load balancing issues, you should see them in the summary you get when you link with -openmp_profile and run your tests. You might find it interesting to run those tests with various settings of KMP_AFFINITY or GOMP_CPU_AFFINITY. After that, the simplest solution, and often the appropriate one, is schedule(guided).

The original code doesn't have any OpenMP syntax nor semantic present currently now.

I neither have worked on above Query - (a) nor on LOAD BALANCING for SMP system on all it's cores nor on OpenMP, but while going through "Thomas W" response, thought to ask him - Is this a "Load balancing, the equal division of work among threads"?

What should be appropriate approach to resolve above Query - (a) as asked earlier?

~BR