Analyzers
Talk to fellow users of Intel Analyzer tools (Intel VTune™ Profiler, Intel Advisor)

VTUNE ITT counter strange behavior

CFR
New Contributor II
1,457 Views

Lenovo T570, Ubuntu 18.04.4, Intel(R) oneAPI VTune(TM) Profiler 2021.1.0 beta06 (build 608503) Command Line Tool

Given the following toy code:

#include <unistd.h>
#include <ittnotify.h>
int 
main(int argc, char *argv[])
{
  auto itt_d = __itt_domain_create("Domain");
  auto itt_str1 = __itt_string_handle_create("Task 1");
  unsigned __int64 counter = 5;
  auto itt_counter = __itt_counter_create("acounter", "cDomain");
  __itt_task_begin(itt_d, __itt_null, __itt_null, itt_str1);
  sleep(1);
  __itt_counter_set_value(itt_counter, &counter);
  sleep(1);
  __itt_counter_inc(itt_counter);
  sleep(1);
  __itt_counter_inc(itt_counter);
  sleep(1);
  __itt_counter_dec(itt_counter);
  sleep(1);
  __itt_counter_destroy(itt_counter);
  __itt_task_end(itt_d);
}

Compiled as:

dpcpp -g -o itt itt.cpp -I/local/opt/intel/inteloneapi/vtune/2021.1-beta06/include/pkgconfig/lib64/../../ -L/local/opt/intel/inteloneapi/vtune/2021.1-beta06/include/pkgconfig/lib64/../../../lib64 -littnotify -ldl   -lOpenCL -lsycl

I then run itt in vtune-gui and get the attached (I don't know how to put it inline).

The issue is that, based on the code, I would expect the counter to be undefined/default from 0..1 seconds, 5 from 1..2, 6 from 2..3, 7, from 3..4 and 6 from 4..5 seconds.  That's not what shows up.  Instead the five regions show the counter to be 0, 6, 7, 6, 0.

There's not a lot of examples or discussion in the VTUNE manual concerning use of the ITT API, so what am I missing?

Extra credit: Eventually I'm going to want to do this :^). Is there a way to instrument SYCL kernels with the ITT API?

0 Kudos
6 Replies
AthiraM_Intel
Moderator
1,457 Views

Hi,

We are forwarding your case for SME to check

Thanks

0 Kudos
Kirill_U_Intel
Employee
1,457 Views

Hi, Craig.

Could you give more info what would you like to instrument? just kernels or something inside them?

VTune instruments SYCL kernels. Could you look at this cookbook example

https://software.intel.com/content/www/us/en/develop/documentation/vtune-cookbook/top/configuration-recipes/profiling-dpc-application.html ?

Kirill

0 Kudos
Kirill_U_Intel
Employee
1,457 Views

about counter issue. I've reproduced issue on our side. We are going to investigate that.

Sorry for inconvenience, Kirill

0 Kudos
CFR
New Contributor II
1,457 Views

Generally I'm interested in any VTune capabilities to see what's going on in my SYCL kernels.  I know that VTune is supposed to let me do some things.  I've been able to get the overview/hotspot type information, but looking inside a kernel still eludes me.  I think I need to rebuild the i915 module but I've had problems doing that: see https://software.intel.com/en-us/forums/vtune/topic/856805.  I'm trying the cookbook, but not seeing the same thing that's in the documentation so I may be doing something wrong.

My specific question here was whether I could use ITT calls in a SYCL kernel.  (I guessing there are problems with how kernels execute and when exactly individual items run etc... but I thought I'd ask anyway.)  The kind of thing I most want to do is insert events (either single points or ranges) so that I can observe how long parts of of my kernel are taking.  It's not as important, but I can see using counters might be handy as well.  I'm not sure I really need all the more esoteric aspects of ITT though.

As for reproducing the counter issue, great.  Glad to help make VTune better.

0 Kudos
Alexander_T_Intel1
1,457 Views

Hi Craig!

Have you already tried Source Analysis in GPU Compute/Media Hotspots? You should pick "basic blocks latency" profiling mode to see the execution time of all basic blocks. To do this you just need to open Graphics tab, then click twice on the kernel you are interested in and drill-down to the source-assembly view. Please try this and tell me if smth is still unclear.

0 Kudos
CFR
New Contributor II
1,457 Views

This response is based on some _OTHER_ toy code not the example at the beginning of the thread.

If I take a code, start up VTUNE select the builtin "GPU Compute/Media Hotspots", Then I customize and only change "GPU profiling mode" to "basic blocks latency" I get the warning (and nothing in the output that looks like kernel/GPU code):

     [Instrumentation Engine]: GTPin: GTPin ERROR: apiTracingEnabled

     [Instrumentation Engine]: GTPin: GTPin didn't find any kernels... Exiting without doing anything.

It I uncheck "Trace GPU Programming APIs" then I get rid of the above error, but I still don't see any kernels listed that I can click on.  (I've attached a screen cap of VTune).  I can find my CPU kernels burried in some mangled subroutines, but not my GPU kernel.

I'm obviously missing something (wouldn't be the first time ;^)) so if you can be a bit more specific and point me in the right direction I'd appreciate it.

I will note that I'm running with "SYCL_BE=PI_OPENCL" otherwise all my GPU kernels intermittently crash with the ZE_RESULT_ERROR_INVALID_ENUMERATION problem (https://software.intel.com/en-us/forums/oneapi-data-parallel-c-compiler/topic/856864). (Surprisingly VTune silently halts without acknowledging the ZE_RESULT... fault).

For completeness...

Environment: Lenovo T570, Ubuntu 18.04.4, OneAPI 2021.1.0 Beta06

Build:

dpcpp -O0 -g -o ovhd2 ovhd2.cpp -gline-tables-only -fdebug-info-for-profiling -I/local/opt/intel/inteloneapi/vtune/2021.1-beta06/include/pkgconfig/lib64/../../ -L/local/opt/intel/inteloneapi/vtune/2021.1-beta06/include/pkgconfig/lib64/../../../lib64 -littnotify -ldl   -lOpenCL -lsycl
 

Code (It's just a toy so don't be too critical of it):

#include <CL/sycl.hpp>
#include <cstdio>
#include <unistd.h>
#include <algorithm>

#include <ittnotify.h>

/** timer **/
#include <x86intrin.h>

unsigned long tick()
{
  unsigned int tmp;
  return __rdtscp(&tmp);
}

double tick_calibrate()
{
  unsigned long t1 = tick();
  sleep(1);
  unsigned long t2 = tick();
  return (double)(t2-t1);
}

namespace sycl = cl::sycl;

const int Nproc=8;
const int Nelem=5;
const int Niter=3;
const int Nsteps=10000000;

class LCG48 {
  public :
    void seed(unsigned long x) { state = (x<<16) + 0x330E; }
    void step() {
      state = ((0x5DEECE66D * (state) + 0xB) % (1L<<48));
    }
    unsigned long get() {
      return ((0x7FFF8000 & (state >> 17)) + (0x7FFF & (state >> 17)));
    }
  private:
      unsigned long state;
};

int 
main(int argc, char *argv[])
{
  unsigned long t1, t2;
  int xdata[Nproc];
  for (int j=0; j<Nproc; j++) xdata = 0;

  auto itt_d = __itt_domain_create("Domain");
  auto itt_cpu = __itt_string_handle_create("CPU Task");
  auto itt_gpu = __itt_string_handle_create("GPU Task");

  double tickperseconds = tick_calibrate();

  /** 
   ** Choose a device 
   **/
  //sycl::device dev = sycl::default_selector().select_device();
  //sycl::device dev = sycl::host_selector().select_device();
  //sycl::device dev = sycl::gpu_selector().select_device();
  //sycl::device dev = sycl::cpu_selector().select_device();

  /**
   **
   **/
#if 1
  __itt_event itt_event1 = __itt_event_create("Event1",6);
  __itt_event_start(itt_event1);
  {
    sycl::device dev = sycl::cpu_selector().select_device();
    std::cout << "Device: " 
        << "name: " << dev.get_info<sycl::info::device::name>() << std::endl
        << "vendor: " << dev.get_info<sycl::info::device::vendor>() << std::endl;
    sycl::queue q(dev);

    t1 = tick();
    for (int i=0; i<Niter; i++) {
      sycl::buffer<int, 1> xbuffer((int *)xdata, sycl::range<1> {Nproc});
      q.submit([&](sycl::handler& cgh) {
          auto xaccessor = xbuffer.get_access<sycl::access::mode::read_write, sycl::access::target::global_buffer>(cgh);
          cgh.parallel_for<class mykernel0>(
              sycl::range<1> {Nproc}, 
              [=] (sycl::item<1> item) {
                int idx = item.get_linear_id();
                LCG48 lcg;
                lcg.seed(idx);
                for (int i=0; i<Nsteps; i++) lcg.step();
                xaccessor[idx] = 0xFFFFFFFF & lcg.get();
              }
              );
          }
          ); 
      q.wait();
    }
    t2 = tick();
    for (int i=0; i<std::min(Nproc,8); i++) printf("%08X ", xdata); printf("\n");
    printf("%d %ld %f\n", Niter, (t2-t1), (double)(t2-t1)/(double)(Niter));
  }
  __itt_event_end(itt_event1);
#endif
#if 1
  __itt_event itt_event2 = __itt_event_create("Event2",6);
  __itt_event_start(itt_event2);
  {
    sycl::device dev = sycl::cpu_selector().select_device();
    std::cout << "Device: " 
        << "name: " << dev.get_info<sycl::info::device::name>() << std::endl
        << "vendor: " << dev.get_info<sycl::info::device::vendor>() << std::endl;
    sycl::queue q(dev);

    t1 = tick();
    for (int i=0; i<Niter; i++) {
      sycl::buffer<int, 1> xbuffer((int *)xdata, sycl::range<1> {Nproc});
      q.submit([&](sycl::handler& cgh) {
          auto xaccessor = xbuffer.get_access<sycl::access::mode::read_write, sycl::access::target::global_buffer>(cgh);
          cgh.parallel_for<class mykernel1>(
              sycl::range<1> {Nproc}, 
              [=] (sycl::item<1> item) {
                int idx = item.get_linear_id();
                LCG48 lcg;
                lcg.seed(idx);
                for (int i=0; i<Nsteps; i++) lcg.step();
                xaccessor[idx] = 0xFFFFFFFF & lcg.get();
              }
              );
          }
          ); 
      q.wait();
    }
    t2 = tick();
    for (int i=0; i<std::min(Nproc,8); i++) printf("%08X ", xdata); printf("\n");
    printf("%d %ld %f\n", Niter, (t2-t1), (double)(t2-t1)/(double)(Niter));
  }
  __itt_event_end(itt_event2);
#endif

  /**
   **
   **/
#if 1
  __itt_event itt_event3 = __itt_event_create("Event3",6);
  __itt_event_start(itt_event3);
  {
    sycl::device dev = sycl::gpu_selector().select_device();
    std::cout << "Device: " 
        << "name: " << dev.get_info<sycl::info::device::name>() << std::endl
        << "vendor: " << dev.get_info<sycl::info::device::vendor>() << std::endl;
    sycl::queue q(dev);

    t1 = tick();
    for (int i=0; i<Niter; i++) {
      sycl::buffer<int, 1> xbuffer((int *)xdata, sycl::range<1> {Nproc});
      q.submit([&](sycl::handler& cgh) {
          auto xaccessor = xbuffer.get_access<sycl::access::mode::read_write, sycl::access::target::global_buffer>(cgh);
          cgh.parallel_for<class mykernel2>(
              sycl::range<1> {Nproc}, 
              [=] (sycl::item<1> item) {
                int idx = item.get_linear_id();
                LCG48 lcg;
                lcg.seed(idx);
                for (int i=0; i<Nsteps; i++) lcg.step();
                xaccessor[idx] = 0xFFFFFFFF & lcg.get();
                }
              );
          }
          ); 
      q.wait();
    }
    t2 = tick();
    for (int i=0; i<std::min(Nproc,8); i++) printf("%08X ", xdata); printf("\n");
    printf("%d %ld %f\n", Niter, (t2-t1), (double)(t2-t1)/(double)(Niter));
  }
  __itt_event_end(itt_event3);
#endif
}

 

0 Kudos
Reply