Community
cancel
Showing results for 
Search instead for 
Did you mean: 
McCalpinJohn
Black Belt
1,275 Views

Meaning of "Running", "CPU Time", and "Effective Time" in timelines

Stupid newby question?

I am trying to understand the meaning of some of the colors associated with the "Thread" timelines presented in the bottom section of the "Bottom-up" tab from an "HPC Performance Characterization" run on a large code.

(Tried and failed to upload screenshots, so reverting to words....)

The bottom part of the "Bottom-Up" tab has several timelines to choose among.   The top one is the most detailed, and defaults to "Thread".   Depending on the "Band Height" option, the key for the top timeline shows either three or five options.  When "Band Height" is "Super Tiny", there are three options under "Thread":

  • "Effective Time"
  • "Spin and Overhead Time"
  • "MPI Busy Wait Time"

The latter two categories are self-explanatory, but "Effective Time" is a bit vague.  It is described in the user guide as "CPU time spent in user code", I can't understand how this relates to the green to black color scale or to what I think the application is doing....  

If I switch the "Band Height" to "normal", the key changes to show five options:

  • "Running"
  • "CPU Time"
  • "Spin and Overhead Time"
  • "MPI Busy Wait Time"
  • "Clocktick Sample"

I can't figure out what it means to report that the thread is "Running" but not report that it is using "CPU Time".   This sounds like it must be related to the definition of "Effective Time" in the "Super Tiny" version, but I can't figure out what these terms and displays actually mean...

What am I missing in the definitions here?

0 Kudos
13 Replies
McCalpinJohn
Black Belt
1,256 Views

Zooming in to very small timescales, the option "Clocktick Sample" becomes effective, and I can look at specific values of the "CPU Time" for each sample.   It is clear that the samples are unaligned across threads, and that the samples are associated with very different interval lengths.  (The latter is not surprising if some of the sampling is done based on the overflow of hardware performance counters.)

It *looks like* the "CPU Time" reported is a percentage relative to 100% utilization for some nominal sample period?  (The specific values I see are consistent with a "nominal sample period" of 110 milliseconds.)

  • If the actual sample period is shorter than the nominal, the "CPU Time" is reported at greater than 100% -- but is shown in the strip chart as the 100% height brown bar.  
  • If the actual sample period is longer than the nominal, the actual CPU Time cannot must be under 100%, so only part of the bar is brown and the rest is filled in with green.

So if I am understanding this correctly, there is no useful information about performance in the split  between brown ("CPU Time") and green ("Running") in my use case.  (I know HyperThreading is disabled, I know the threads are bound and cannot move, and I know that the threads are going to be running during the entire period.).  The only information that the brown/green split provides is that more brown means more uniform sampling intervals, and more green means less uniform sampling intervals.  That seems like a waste of a color range. 

I still don't understand how this relates to the "Effective Time" color scale (green to black) in the "Super Tiny" timeline view -- is it just a mapping of the heights of the brown and green bars into a range of black to green colors?  

I hope there is something I am missing?

 

Dmitry_R_Intel1
Employee
1,230 Views

- "Running" is just a bar from thread start to thread end. There is nothing more in it. 

- "CPU Time" is calculated from samples , e.g. for PMU event based collection the formula is CPU_CLK_UNHALTED.REF_TSC / {reference frequency} - this gives us time in seconds which you can see in grid or summary views. When CPU Time is shown in timeline it is calculated for each time bin (time bin correspond to pixel in GUI) and divided by bin time length - this gives us percentage. For threads this percentage value should be in range from 0 to 100%, the only case when it can exceed 100% is threads migration - e.g. a thread may be switched out from a logical processor just before counter overflow and whatever it accumulates will be assigned to another thread which comes to this logical processor.  If threads are pinned to logical cores you should not see such effects.

- "Effective Time" is the same CPU Time but with certain samples filtered out. The samples that we filter out here are the ones happening in code we know relates to spinning, thread scheduling and such. E.g. if sample happened inside function  matching <regex>\b_?_kmpc_barrier</regex> we know this busy wait on barrier inside OMP runtime and filter-out it. We have a lot of such patterns covering OS API and threading runtimes.

 

McCalpinJohn
Black Belt
1,219 Views

Thanks!  That description makes sense -- now I need to understand why the results I have been looking at are so completely inconsistent.....

In the attached screenshot, the majority of the reported "CPU Time" is in narrow (10 millisecond) bands, each reporting "CPU Time" as numbers in the range of 1095%.   This sets the vertical scale at approximately 1100%, so intervals with "CPU Time" of 100% show as 90% green.

These high values occur with "CPU Time", "Spin and Overhead Time", and "MPI Busy Wait Time".  In the attached, there is an interval of high "MPI Busy Wait Time" starting at about 287450 ms.  Package 0, Core 2 show 1093.5% for each of those three metrics, and the aggregated "CPU Time" strip chart below shows 1184.7% "MPI Busy Wait Time" -- consistent with Package 0, Core 0 at 91.2% plus Package 0, Core 2 at 1093.5% (plus zero contribution from other packages and cores).

These values were collected on a single node 2s Xeon Platinum 8280 system running CentOS 7.8, with Amplifier 2019 Update 6 (build 602217), and visualized on a Mac running a local viewer (amplxe-2019.8.0.604197).  The application (version 4.2 of the WRF weather model) was run with 28 MPI tasks and 2 OpenMP threads/task -- but results are qualitatively similar with other task/thread splits.

During collection, amplxe-cl was given a data-limit of 10,000MB, but exceeded that limit near the end of the run.  After finalization on the compute node, the data set size that I downloaded to my Mac was 16GiB.   I have not been able fully bridge the timestamps from the application with the timeline reported by VTune, but it looks like the VTune timelines end after time step 111 of 120 executed.  Could running into the raw data limit cause trouble?

Dmitry_R_Intel1
Employee
1,215 Views

With default sample interval setting VTune sets Sample After value for clockticks events so that sample should happen each 5ms on fully loaded core. This means that distance between sample points (green marks) can not be less than 5ms on a logical core level (and if you have HT disabled on physical core level as well). In the screenshot the distance is smaller sometimes and this means we have a problem. Most likely it is due to threads migration so this is the first thing I would check - make sure your have correct affinity settings and your processes/threads don't really migrate. One way to do this is to use grouping like 'Thread / Physical Core' in the grid.

Yes data limit is probably the reason for incomplete data. You can turn it off completely by specifying 0 value.

McCalpinJohn
Black Belt
1,208 Views

The screenshot I posted came from a case that appears to have correct binding.   With I_MPI_DEBUG=5, there are exactly 56 messages binding threads 0 and 1 from each of the 28 MPI tasks to a unique "OS proc set" consisting of one core.

Switching the VTune display to "Thread / H/W Context" shows the 28 master threads bound to two cores, but only executing on the lower-numbered one, while the 28 worker threads are each bound to the upper-numbered core of the process's pair.

Switching to the "H/W Context / Thread" display shows most of the cores with only one process, and 13 with two TIDs.  In each case when there are two TIDs, the second one is the master thread for the active worker thread on the core, and the master thread shows zero execution time.

These results were collected with "-collect Hpc-performance" -- I will try some other collections to see if the results make more sense....

McCalpinJohn
Black Belt
1,168 Views

No fix with switching "-collect hpc-performance" to "-collect uarch-exploration" -- the "CPU Time" values don't get as high as 1100%, but they still show values as high as 600%.  

The reason I think this is a problem is that it exaggerates the magnitude of the signal in the aggregated CPU Time plot.  In my earlier example, there was a clear section of the CPU Time plot showing 1200% MPI "Busy Wait Time" on a single 56-core node.   This seems to imply that 12 cores were busy-waiting, which would be a very serious performance issue.  But what is really (?) happening is that only 2 cores are busy-waiting, with one reporting 100% utilization and the other reporting 1100% utilization.  (Presumably the latter is actually waiting 100% of the time, but for a period that is 11x longer than the interval to which it was attributed).   

It is certainly possible that this is a "user error", or perhaps a misconfiguration of the sep driver on the system, or ????, but it is turning into a very negative experience....

Dmitry_R_Intel1
Employee
1,152 Views

Could you please try collecting in driver-less mode and see if problem still exist? For this just remove VTune driver and make sure /proc/sys/kernel/perf_event_paranoid has value of 1 or less.

if you are using driver-less mode already - do vice-versa by installing Vtune drivers.

McCalpinJohn
Black Belt
1,135 Views

I will have to get one of the system folks to unload the VTune driver for me....

In the meantime, I decided to switch to an application that I understand much better: "STREAM".

Using the same methodology (same scripts, but running an MPI version of STREAM instead of running WRF), the VTune results don't show any of the anomalies that I see in the WRF results.

With the STREAM benchmark and "-collect hpc-performance", the "CPU Time" in the thread (or core) timeline view is almost always very close 100%, and the samples are almost always very close to 5 milliseconds apart.  Every once in a while a sample appears to have been dropped, and those show up as 50% "CPU Time" and 50% "Running".

With the STREAM benchmark and "-collect threading", the "CPU Time" in the thread (or core) timeline view is almost always very close to 100%, and the samples are almost always 10 milliseconds apart.  Every once in a while some samples are delayed by a few milliseconds, and these show up as (for example) 70% "CPU Time" and 30% "Running" for a 14 millisecond interval.   Unlike what I see in WRF, the "CPU Time" is never reported as over 100%, even if the neighboring sample points have been delayed.   I have to say that I really like the OpenMP Barrier display on the timeline view -- for my codes that makes it really easy to see how to line up the cores and to identify where stalls due to load imbalance are occurring....  It looks like I can use API calls to mark spots in the timeline that I think are important....

With the STREAM benchmark and "-collect hotspots", the results look very similar to those that I get with "-collect threading".  In both the "threading" and "hotspots" views, when the timeline shows OpenMP "Spin and Overhead Time", it always shows 100% for the duration of the sample.   Is that just the way time is attributed?   (With OpenMP, more precise spinning time could be measured by the OpenMP runtime library and communicated to VTune -- I guess I was incorrectly assuming it would work that way.  I have a manually instrumented version of the code that gives much more precise attribution of spin time, so I should compare the two approaches....)

So this is a mixed update -- VTune is performing as expected on a code for which I already know the answers, but not on the (1000x larger and more complex) code that I am trying to understand....

Dmitry_R_Intel1
Employee
1,126 Views

It looks like some bug on our collector side. This means switching to driver-less mode should help since it uses completely different event collection mechanism.

Is it possible to setup this WRF workload on our side? Can you provide some instructions?

McCalpinJohn
Black Belt
1,121 Views

I figured out how to unload (and reload) the sep/socperf/vtsspp/pax drivers....

To reduce the file sizes, the collection command was "-collect hpc-performance -knob sampling-interval=100".

With the sep drivers, the results remained very bad, but now that I am starting to learn what this is supposed to look like, I am starting to see some patterns.  

Starting with the VTune "bottom-up" view of the full run, with grouping by "Thread" and "Band Height" of "normal":

  • "WRF_120steps_sep_normal_band_height.png" shows the results using the sep driver.
  • "WRF_120steps_driverless_normal_band_height.png" shows the results with driverless collection.
  • The driverless case looks like one would expect for a properly bound HPC job running on all cores.
  • With sep, the large amount of "green" obscures the overall pattern.
  • Looking at the aggregated "CPU Time" timeline below the "Thread" timeline, we would expect full utilization to correspond to all threads being active (brown).  This is mostly the case with the driverless collection, and is very definitely not the case for the sep collection.

Zooming in enough to see the sample collection markers, I picked the first two time steps of the second group of 40 steps -- starting at about 695 seconds with the sep collector and about 735 seconds with the driverless collector.

  • "WRF_2steps_sep.png" is clearly a mess, with sample intervals varying between the requested 100 milliseconds and 15000 milliseconds.
    • Curiously, the intervals close to the requested 100 milliseconds all report about 200% "CPU Time".  This corresponds to most of the full-height "brown" samples.
    • Also curious, the vertical height scaling of the "CPU Time" in the "Thread" bars remains close to 100% -- so the 200% bars are truncated and the few cases reporting 100% are just a tiny bit shorter.  This was not the case in some earlier tests that reported 1100% "CPU Time".
  • "WRF_2steps_driverless.png" is not perfect, but clearly displays a much more uniform sampling.
    • Intervals close to the requested 100 milliseconds report about 100% "CPU Time".
    • The longest intervals for active threads are about 200 milliseconds, reporting about 50% "CPU Time". These are more common than I would prefer, but they don't obscure the overall pattern.
    • There are a few short intervals for active threads -- e.g., 40 milliseconds, reported >200% "CPU Time".  The vertical height scaling is consistent with 100% "CPU Time".
    • Note that the wider sample intervals in the "green" area between time steps are due to threads that have gone idle after 200 milliseconds of spinning, so these wide intervals are not a concern.  (The idle cores are a concern, but that is a performance problem with the application, not a VTune problem!)

One reason that I have been forced to deal with such large VTune data collections is that I have not been able to get the "-start-paused -resume-after=300" options to work (with sep).  MPI ranks other than rank zero always crash right when the sampling is resuming....  Guess I ought to try that with driverless collection as well....

CoreDump
Beginner
211 Views

I'm not following.

Could you interpret the attached picture for me?vtune windowvtune window

ArunJ_Intel
Moderator
1,236 Views

Hi McCalpinJohn,


We are forwarding your case to SME. They will soon get back to you on this.



Thanks

Arun Jose


IntelSupport
Community Manager
758 Views

This issue has been resolved and support will no longer respond to this thread. If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only


Reply