Best tools for analyzing and profiling Intel Processor Graphics OpenCL kernels?

allanmac1 · ‎07-21-2016

I want to deeply profile and analyze my Intel Graphics OpenCL kernels and am not entirely clear on which tools are available and their advantages.

I assume VTune is the very best? What specifically does it offer for OpenCL developers that's useful?

What about GPA? Last I checked it didn't provide much detail.

Is GT-Pin something that's documented or is it what's actually powering the kernel analysis in Code Builder and Visual Studio?

I'd like to understand:

achieved EU occupancy
where any local memory bank conflicts might be occurring
instruction hot spots

I may be missing some features that already exist in the Visual Studio (2013) CodeBuilder because the Kernel Analysis screen crashes when analyzing my application which uses pre-compiled binary kernels (with "-cl-kernel-arg-info" enabled).

Also, I'd like to be able to dump the GEN assembly from an existing .IR binary. Is there a way to do this?

Jeffrey_M_Intel1 · ‎07-22-2016

It sounds like you want Code Builder and possibly VTune.

Code builder has a lot of specialized views to help you understand OpenCL application/kernel performance. In case you haven't seen them the videos on the OpenCL page (https://software.intel.com/en-us/intel-opencl) have a lot of great details on what is included and how to use the tools.

VTune does great hotspot analysis, and it also is a great system efficiency visualizer with a lot of details on exactly what is happening in the EUs as well as video codec fixed function.

For more info: VTune for Media Apps, https://software.intel.com/en-us/articles/intel-vtune-amplifier-xe-getting-started-with-opencl-performance-analysis-on-intel-hd-graphics

GPA is custom made for game developers, with lots of 3d graphics details in its frame analyzer. You can see info useful for OpenCL performance analysis in the system analyzer but you're right the focus there is realtime feedback and you'll get much more detailed optimization info in VTune.

GT-Pin can provide a lot of details too, but you're probably best starting with the other tools. In many cases they are getting the same info but in a more immediately accessible, consolidated, more extensively documented way.

I suspect that there is a way to get Gen assembly from .ir via the ioc64 tool -- I'll research and get back to you.

allanmac1 · ‎07-22-2016

Thanks, those links gave me a better understanding of VTune's features! I guess I'll try the free trial.

Using the "-asm=<file_path>", I can get iocXX to spit out an assembly file from an OpenCL compilation but it looks like when there is a multi-kernel OpenCL file, only one of the kernels is written to the assembly file.

I'm going to guess the iocXX compiler overwrites the assembly output each time it finishes building a kernel. It would be nice if all the kernels appeared in the file and there was a leading textual header of some sort (kernel name, properties and some other meta info) before each assembly dump.

So there is probably a bug in the iocXX "-asm" feature or else the workaround is to simply generate assembly for one kernel at a time.

RFE: I would like to see is a command line profiler that lets me launch my app and by default reports the min/max/avg kernel runtimes and CL API timings. Additional switches would have the profiler focus on a single kernel by name and report back whatever counters are available in the GEN architecture. This is basically what CUDA's nvprof tool provides and for some GPU devs this is far more useful than a the full blown visual tool that doesn't involve HTML or a GUI. It sort of sounds like GT-Pin already has these capabilities and maybe a more user-friendly wrapper could expose them.

Yuval_E_Intel · ‎07-24-2016

Hi,

The Code Builder tool provides EU occupancy and instruction latency (for read instructions) reports. Currently there is no report for local memory bank conflicts as part of the code builder.

There are two ways to get these reports:

1. OpenCL Application Analysis - start from an existing application and get report for any kernel executed by the application. For this option to get the latency report your application must build the kernels from source (with clCreateProgramWithSource). Please read more here: https://software.intel.com/en-us/node/539437

2. Run analysis from OpenCL Kernel Development Framework - using the KDF you can start from a cl file, assign variables for the kernel, build, run and analyze the kernel without the need to write the host code. Of course you need to specify the kernel code to be analyzed. This is the recommended way to analyze the kernel if you want to try different implementations, different variables, or different locals and globals. Please read more here: https://software.intel.com/en-us/node/539301.

As for the Gen assembly – Currently the generation of GEN assembly from IR is not exposed externally. We will consider to add this capability for the next release

If you have the source code you can use the Intel Graphics Disassembly Source Mapping as part of the Code Builder to see the mapping between the cl code and the GEN assembly. Please read more here: https://software.intel.com/en-us/node/600446

the code builder also has a command line interface - with the CLI you can launch your applicaiton and get min/max/avg kernel runtimes and CL API timings as well as HW counters for each kernel. Then you can specify a specific kernel and perform a deep analysis for this kernel to get occupancy and latency reports. Please read more here https://software.intel.com/en-us/node/558502

Thanks,

Yuval

allanmac1 · ‎07-25-2016

Ah, OK!

I am currently creating a binary from the IR and converting it to a C array with bin2c.

That explains why I can't use the Deep Analysis features.

I'll load the kernel from source.

Thanks for your and Jeffrey's detailed responses.

allanmac1 · ‎07-25-2016

One more question...

Using an option string in clBuildProgram(), I can create and build the program with include files and successfully capture the basic Analysis info.

But the deeper Kernel Analysis panel will consistently fail even if I also declare the analysis working directory path to be the directory containing the include files. It's worth noting one of my include files has a ".inl" suffix.

I finally got the Analysis panel to work but I had to simplify the program source file so that there were no include files at all.

Is there any way to run iocXX in preprocessor mode and spit out the fully preprocessed file?

allanmac1 · ‎07-26-2016

Allan M. wrote:
Is there any way to run iocXX in preprocessor mode and spit out the fully preprocessed file?

Answering my own question...

On Windows, this will preprocess your OpenCL file and all its includes:

     cl /I . /I "%INTELOCLSDKROOT%\include" /D __OPENCL_VERSION__ /X /EP foo.cl | grep . - > foo.i

Notes:

The /EP option emits to stdout and removes all #line directives which seem to break the Kernel Latency panel.
The grep removes all empty lines making it easier to inspect in the Kernel Latency panel.

Next, use the "bin2c" utility to convert the source file into a char[] array that can be compiled directly into your C application so you don't have to fiddle with opening/loading/closing files:

     bin2c -c -p 0 -n foo_source foo.i > foo.source

The char[] array can be loaded like this:

_____________________________________________________________

  cl_int err;

  size_t const   strings_sizeof[] = { strlen(foo_source) };
  char   const * strings[]        = { (char*)foo_source  };

  cl_program program = clCreateProgramWithSource(context,
                                                 1,
                                                 strings,
                                                 strings_sizeof,
                                                 &err);
  cl_ok(err);

  char const * const options = " < your CL options here > ";

  cl(BuildProgram(program,
                  1,
                  &device_id,
                  options,
                  NULL,
                  NULL));

_____________________________________________________________

I have the preprocessing and "bin2c" steps in a makefile and the snippet directly above is neat and compact.

allanmac1 · ‎07-26-2016

Now that I'm building kernels from source, I can report that everything works well on a Core i7-5775C + HD6200 using VS2013:

Session Analysis
Kernels Overview
Kernel Latency
Kernel Analysis
GEN-to-source mapping in the Session Explorer

Thanks for your help.