Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

How to verify Intel AMX usage with processor counters

Thomas_W_Intel
Employee
1,319 Views

Intel® Advanced Matrix Extensions (Intel(R) AMX) is a built-in accelerator that improves the performance of deep-learning training and inference on current generation of Intel Xeon processors. Optimized frameworks like Intel® Extension for PyTorch* or Intel® Extension for TensorFlow* are leveraging Intel AMX for best performance. However, it is not always obvious if the optimized stack and code path is used in a setup.

The good news is that you can use the performance monitoring units (PMUs) of the processor to verify if AMX instructions are exercised. The tool that we will use is Intel Performance Counter Monitor (Intel PCM). When following the description to build Intel PCM, please make sure that you clone the repository recursively to include the json parser:

 

git clone --recursive https://github.com/intel/pcm 
cd pcm
mkdir build
cd build
cmake ..
cmake --build . --parallel

 

At the first glance, Intel PCM comes with many tools but none of them is addressing AMX. The good news is that there is also a tool “pcm-raw” that can collect arbitrary raw counters. This is what we will use to determine the AMX instructions used!

The long list of available events for your processor can be found in the perfmon repository . In particular, the event EXE.AMX_BUSY is available current 4th and 5th gen Intel® Xeon® Scalable Processors (code-named “” and “Emerald Rapids”) as well as the next-gen Intel® Xeon® 6 Processor with P-cores (code named “Granite Rapids”). As the name suggests, EXE.AMX_BUSY reports the cycles were the AMX unit is busy. For using the event files with pcm-raw, the simplest might be to clone the whole repository:

 

git clone https://github.com/intel/perfmon

 

 pcm-raw detects the architecture and chooses the correct event file when executed like this in the perfmon directory: (Alternatively, you can provide the path to pcm-raw with the "-ep" option.)

 

pcm-raw -tr -e EXE.AMX_BUSY

 

On an idle system, the output looks something like this:

2024-08-06,17:45:06.981,EXE.AMX_BUSY,1000,1600240982,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0

The output can be deciphered as follows: After date und time, the event name is listed. The fourth column reports the time in milliseconds between the output, and the fifth column the TSC (CPU's Time Stamp Counter) cycles during that time. The following numbers are the number of AMX_BUSY events on each core. Since the AMX units weren’t used on the idle system, all numbers are zero.

As soon as you start running the software that is using Intel AMX, these numbers are becoming non-zero: (I was using the examples from LLM Modeling)

2024-08-06,17:56:57.775,EXE.AMX_BUSY,1000,1601515708,84184,0,0,76791932,0,0,76798833,0,76800706,65885840,76805883,76777056,76755105,76781423,76763623,76777146,76728379,70888877,76788855,0,76787351,76756889,0,76720079,76755198,76759664,76772279,76715953,59366815,83625791,83685711,76766046,76755934,76750846,30142249,76708832,76723637,76747039,76759419,83626252,0,76765427,76752597,76733974,77408719,76739454,76762390,76760832,76729625,0,0,83637903,76691024,76747312,76747200,76745047,76754927,76716791,76722641,83590723,0,76718627,0,76758112,83616019,76719278,83588628,19246934,76728645,76729446,77335509,76738508,76709315,76734313,83577106,83571601,77341093,76717084,83589470,83599106,83590694,77366776,76712027,76720818,77375604,83577789,83511037,77325126,76723867,76721705,83580629,76650471,76691904,77317858,0,76662931,37624863,30419982,0,0,80051054,83568169,0,0,83681004,83535270,83551644,0,59638197,0,0,17775964,0,0,83493948,77330475,0,0,77444887,76652810,42721327,0,0,34039845,0,0,0,83544697,83775111,76695455,76647095,0,76643141,83610939,0,77324884,0,77425520,0,0,0,0,0,0,0,17890105,0,76681828,0,0,83619596,0,0,0,0,0,0,0,0,0,0,0,0,0,47932,0,0,0,77331679,0,0,0,27767,0,0,0,0,9464,0,0,0,0,0,0,0,0,0,0,0,0,0,0,83320,83288,47086,76756698,48378,49640,0,51840,65728,0,83260,60842,76402,0,0,0,0,83270,0,0,0,0,0,0,0,0,0,0,0,0,83535083,0,83276,77394585,83576163,83597771,83623096,0,83577033,83568777,83661090,30250,0,83553172,78297,83578503,83541855,83585311,83594580,83597174,0,0,83566172,77404463,77430797,41670,83569292,83629111,83601728,83661236,83602887,83528878,76734756,0

These events are directly reported by the respective core itself and are therefore a direct indication for the usage of Intel AMX in your system. Have fun!

 

 

 

 

 

 

 

 

 

 

0 Kudos
0 Replies
Reply