Profiling Code To Check For Utilization of Intel® AMX Instructions

Aaron_Gubrud · ‎09-14-2023

Introduction

Intel recently launched the 4th Generation Intel® Xeon® Scalable Processors, which provides an array of integrated hardware accelerators to help customers realize efficiency gains across various use cases. Intel Advanced Matrix Extensions (AMX) is one of these accelerators and enables users to get the best possible performance out of their artificial intelligence workloads. The capabilities made possible by Intel AMX are tremendous, but the complex software solutions of today can make it difficult to determine if they leverage Intel AMX. This guide provides a simple procedure to evaluate whether a given software project uses Intel AMX.

Processwatch

Intel's processwatch is a utility that "displays per-process instruction mix in real-time, organizing these instructions into categories." This utility allows the user to monitor all executed instructions in a given sampling window with separation for all processes running in that window. Reporting each executed instruction can result in an overwhelming report, so the utility buckets instructions into groups and provides filtering options to report only the interesting buckets. In this case, we can focus on the vectorized instruction buckets: SSE, AVX, AVX2, AVX_512, and AMX_TILE.

OneDNN

To provide a lightweight and straightforward workload to test processwatch, we look to Intel's oneDNN project. oneDNN is "an open-source, cross-platform performance library of basic building blocks for deep learning applications." Various AI/ML frameworks implement oneDNN to provide optimizations when running on Intel hardware, including PyTorch and TensorFlow – both of which have extensions that enable Intel GPU support and provide additional optimization for Intel architecture.

BenchDNN

benchdnn offers "an extended and robust correctness verification and performance benchmarking tool for the primitives provided by oneDNN." With these capabilities, benchdnn will allow us to evaluate the presence of Intel AMX instructions and other levels of SIMD instructions.

Changing The Max CPU ISA Within oneDNN

Software solutions leveraging oneDNN can easily modify the level of SIMD acceleration with the DNNL_MAX_CPU_ISA environment variable. This resource provides documentation on the different possible ISA levels; for this guide, we'll focus on the following levels: SSE41, AVX, AVX2, AVX512_CORE, and AVX512_CORE_AMX.

OneDNN also provides a verbose mode controlled by the ONEDNN_VERBOSE environment variable. Enabling this verbose mode provides primitive-level statistics such as the primitive name, the data precision of its inputs and outputs, the name of the primitive implementation, and its execution time. For more details about ONEDNN_VERBOSE, you can visit this resource.

Monitoring Benchdnn with Processwatch

Note: this guide uses Ubuntu 22.04, so instructions will describe the process within that operating system. Adaptation to other operating systems may require some modifications.

Configure Processwatch

The process for cloning and configuring processwatch is below. Please follow the specific git commits for the best chances of replicating the steps in this guide.

Edited August 2024

cd ~
mkdir simd_experiment && cd $_
git clone --recursive https://github.com/intel/processwatch.git -b v1.1
cd processwatch/
sudo apt-get install libelf-dev cmake clang llvm llvm-dev libomp-dev build-essential binutils-dev libcapstone-dev libbpf-dev -y
./build.sh
sudo ln -sf `realpath processwatch` /usr/local/bin/

Build OneDNN

The process for cloning and configuring oneDNN/benchdnn is below. Please follow the specific git commits for the best chances of replicating the steps in this guide.

cd ~/simd_experiment/
git clone https://github.com/oneapi-src/oneDNN.git
cd oneDNN/
git checkout 7f83042c33a183b48e867d36e41fdee6c4524b83
mkdir build && cd $_
cmake ..
make -j
sudo ln -sf `realpath tests/benchdnn/benchdnn` /usr/local/bin/

Testing Using Varying SIMD Levels

With processwatch and oneDNN/benchdnn configured, we're ready to profile a test workload. In the script below, we focus on a particular convolution kernel, chosen because it is large enough to show a significant number of SIMD instructions and has oneDNN implementations at AVX2, AVX512_CORE, and AVX_CORE_AMX levels. As we cycle through the available levels, we can observe how the mix of AVX2, AVX512_CORE, and AVX512_CORE_AMX changes.

Setting Up The Test Script

Place the contents below into a file on your test system. We'll refer to it as "run_simd_test.sh"

#!/bin/bash

isa_arr=("AVX512_CORE_AMX" "AVX512_CORE" "AVX2")
for i in "${isa_arr[@]}"; do
  echo "`date` - DNNL_MAX_CPU_ISA=$i"
  echo "`date` - launching benchmark..."
  DNNL_MAX_CPU_ISA=$i ONEDNN_VERBOSE=1 benchdnn --conv --dt=u8:s8:u8 mb32ic3ih300oc64oh300kh3ph1n"ssd_300_voc0712:conv1_1" &> $i-benchmark.log &
  bmpid=$!
  sudo processwatch -n 1 -p $bmpid -f SSE -f AVX -f AVX2 -f AVX512 -f AMX_TILE &> $i-processwatch.tsv &
  pwpid=$!
  wait $bmpid
  echo "`date` - DNNL_MAX_CPU_ISA=$i complete"
done

Running The Test Script

Execution of the script is simple; just make it executable and call it directly.

chmod +x run_simd_test.sh
./run_simd_test.sh

Evaluating Results

Examining Benchdnn Logs

Generally speaking, if your application utilizes oneDNN for implementing all primitives, the easiest way to determine what level of SIMD instructions your application uses is to utilize the ONEDNN_VERBOSE environment variable. The reference bash script above appends this environment variable to the benchdnn command line. For the AMX case, you will see output similar to the screenshot below. Note that ONEDNN_VERBOSE mode prints that the application leverages AMX. Execution time in milliseconds is also present. For ease of locating information for the actual convolution operation, relevant items are highlighted in red.

cat AVX512_CORE_AMX-benchmark.log

onednn_verbose,info,oneDNN v3.2.0 (commit 7f83042c33a183b48e867d36e41fdee6c4524b83)
onednn_verbose,info,cpu,runtime:OpenMP,nthr:16
onednn_verbose,info,cpu,isa:Intel AVX-512 with float16, Intel DL Boost and bfloat16 support and Intel AMX with bfloat16 and 8-bit integer support
onednn_verbose,info,gpu,runtime:none
onednn_verbose,info,prim_template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc,exec_time
onednn_verbose,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:a::f0 dst_f32::blocked:a::f0,,,64,0.00585938
onednn_verbose,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:abcd::f0 dst_s8::blocked:Acdb16a::f0,,,64x3x3x3,0.167969
onednn_verbose,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:abcd::f0 dst_u8::blocked:acdb::f0,,,32x3x300x300,0.533936
onednn_verbose,exec,cpu,convolution,brgconv:avx512_core_amx,forward_training,src_u8:a:blocked:acdb::f0 wei_s8:a:blocked:Acdb16a::f0 bia_f32:a:blocked:a::f0 dst_u8:a:blocked:acdb::f0,,alg:convolution_direct,mb32_ic3oc64_ih300oh300kh3sh1dh0ph1_iw300ow300kw3sw1dw0pw1,8.78296
onednn_verbose,exec,cpu,reorder,jit:uni,undef,src_u8::blocked:acdb::f0 dst_f32::blocked:abcd::f0,,,32x64x300x300,17.3601
0:PASSED __REPRO: --conv --dt=u8:s8:u8 mb32ic3ih300oc64oh300kh3ph1nssd_300_voc0712:conv1_1
tests:1 passed:1 skipped:0 mistrusted:0 unimplemented:0 invalid_arguments:0 failed:0 listed:0
total compute_ref: sum(s):1.88

Examining the other execution logs will show how the implementation field changes as the DNNL_MAX_CPU_ISA environment variable changes. Execution time changes in response to the SIMD level as well.

Examining Processwatch Logs

As stated in the previous section, examining the logs when running with ONEDNN_VERBOSE is the most straightforward approach to check if your application uses AMX implementations for its primitives. However, not all frameworks implement their primitives with oneDNN. Processwatch can help bridge that gap, watching for specific instructions to be issued and grouping them as they relate to SSE, AVX, AVX2, AVX512, AMX, and others.

Starting with the AVX512_CORE_AMX processwatch log, we see entries in the AMX_TILE category provided by processwatch. The presence of AMX_TILE instructions in this log confirms the presence of AMX instructions in its execution.

cat AVX512_CORE_AMX-processwatch.tsv

PID	NAME	SSE	AVX	AVX2	AVX512	AMX_TILE	%TOTAL
ALL	ALL	17.56	0	0	0.18	0.29	100
12175	benchdnn	17.56	0	0	0.18	0.29	100

Moving down in SIMD levels, we can see that processwatch responds accordingly with the representation of different instruction groups.

cat AVX512_CORE-processwatch.tsv

PID	NAME	SSE	AVX	AVX2	AVX512	AMX_TILE	%TOTAL
ALL	ALL	17.35	0	0	0.67	0	100
12198	benchdnn	17.35	0	0	0.67	0	100

cat AVX2-processwatch.tsv

PID	NAME	SSE	AVX	AVX2	AVX512	AMX_TILE	%TOTAL
ALL	ALL	17.38	0.11	0.68	0	0	100
12222	benchdnn	17.38	0.11	0.68	0	0	100

Conclusion

This guide should help you evaluate whether your AI/DL application uses the highest level of acceleration possible for your 4th Generation Intel® Xeon® Scalable Processor. While this guide shows the process for profiling the execution of a particular convolution kernel in oneDNN's benchdnn, the procedure is extendable to any software project by running processwatch alongside your project's execution.

OttoChow · ‎11-04-2024

Is there any similar tool for Windows environment?