Solved: Finding arithmetic intensity of application

Londhe__Ashutosh · ‎12-02-2015

Dear sir/mam,

I wanted to find arithmetic intensity of the code i developed. I want to know whether Intel has any tool which can calculate arithmetic intensity of any application. Also i wanted to know that if arithmetic intensity is known then how can i predict that the execution time i achieved on the architecture is have is good or not . Also is there way i can predict its performance in future hardware.

Regards,

Ashutosh S. Londhe

Project Engineer I

Seismic Data Processing Team,

Computational Earth Sciences Group,

Center for Development of Advanced Computing, CDAC Pune, India

McCalpinJohn · ‎12-20-2015

For a code that you have developed yourself, it is generally relatively easy to count "operations" (defined in any way that you find convenient) and have the code keep track of the total number of "operations" performed. Then you can add code to read the memory controller performance counters (which appear to be accurate on all the systems I have tested). For any interval of the program execution, you can define the "arithmetic intensity" as the number of "operations" that your program counts divided by the amount of memory traffic (64 Bytes times the number of DRAM read and write CAS operations).

For the Xeon E5 processors the intervals can be relatively large, since each memory controller has its own 48-bit counters. For the Core i5/i7 processors with 2 DRAM channels, the counters are harder to use and have to be sampled much more quickly since the memory controller has only one 32-bit wide counter for the sum of the traffic on the two channels. (https://software.intel.com/en-us/articles/monitoring-integrated-memory-controller-requests-in-the-2nd-3rd-and-4th-generation-intel)

This sort of ratio may or may not be helpful -- it depends on what it is you are trying to do or trying to understand....

View solution in original post

McCalpinJohn · ‎12-03-2015

I think that you will need a precise definition of "arithmetic intensity" before anyone can discuss whether it is possible to build a tool to measure it.

"Arithmetic Intensity" is typically defined as some measure of arithmetic required divided by some measure of memory access required. Unfortunately there are a great many variations possible.

Do you want to count "arithmetic" by number of instructions or number of "operations"? With the SIMD vector instruction sets, these can differ by a large factor. E.g., on a Haswell core, 1 instruction might correspond to 1, 2, 4, 8, or 16 arithmetic operations.
- Some arithmetic operations (e.g., divide and square root) are intrinsically slow. Do you want to count them as single operations? If the compiler splits these up into an approximation followed by a set of corrections, do you want to count all of the operations required? (Note that the number of operations will depend on the accuracy requested, with more iterations required to obtain full accuracy.)
- Compilers generally work very hard to eliminate redundant computations. Do you want to count the operations specified by the source code or the operations remaining after the compiler has eliminated common expressions?
Do you want to count "memory accesses" by instructions, by data words, by bytes, or by cache lines moved at some level of the memory hierarchy?
- Much of the memory traffic on modern hardware is actually initiated by hardware prefetch engines, rather than directly by loads or stores. Do you want to count all of this traffic? What about the cases where the prefetchers fetch data that is never used? What about data that is brought into the core because it is in the same cache line as some other data (but is never used)?
- Sometimes a significant portion of the memory traffic is generated by cache conflicts. This extra traffic can certainly slow down the application, but it is very unlikely that a static analysis tool would be able to predict it. You could measure the traffic, but then the results would only apply to the particular hardware you ran on, with the particular combination of pseudo-random page mappings that the OS provided for that particular run.

Sometimes there are easy answers to all these questions, but most of the time there are at least a few ambiguities that require special treatment. This makes it very difficult to create a "tool" to provide such an analysis.

dkokron · ‎12-09-2015

Arithmetic Intensity (AI) is definitely a squishy term. How about a tool that measures an AI of the type used in a standard roofline analysis? Such a tool would help a developer figure out if their implementation of an algorithm is close to optimal for that algorithm.

McCalpinJohn · ‎12-09-2015

I forgot to add that you can get a quick indication of the "arithmetic intensity" of a code on some Intel processors by running the code with the uncore frequency set to "maximum" and with varying core frequencies. (This is supported on Xeon E5 v3, for example.)

If the execution time is very close to linear in the CPU core frequency, then the code is strongly compute-bound. If the execution time is very close to constant in the CPU core frequency, then the code is (probably) strongly memory-access-bound. Intermediate results can be modeled in a variety of different ways, depending on what you think is happening with overlap of computation and memory access.

Londhe__Ashutosh · ‎12-09-2015

Did you mean that if my processor core frequency is 2.3 GHz and if my code runs for lets suppose 10 sec. then

AI => 10 * 2.3 => 23

Is i am understanding right. And what if i am using processor as thread (Like in OpenMP®) then same formula can be used.

McCalpinJohn · ‎12-10-2015

The sensitivity to processor core frequency can be used to define "intensity" ratios, but can only be used to compute an "arithmetic intensity" in very limited cases where you know that the portion of the code that executes at the core frequency (instructions, L1 cache accesses, and L2 cache accesses) is dominated by the time required to perform arithmetic.

I have had good luck with modeling the execution time as the sum of two non-overlapping components. One is "T_cpu", which is assumed to be linearly proportional to the core clock period. The other is "T_bw", which is assumed to be inversely proportional to the maximum sustainable memory bandwidth per core. The model is

T_total = W_cpu/R_cpu + W_bw/R_bw

Where:

T_total is the measured execution time
- The experiment should be run with several (many) different core frequencies and (if possible) DRAM frequencies.
W_cpu is the unknown amount of work done by the core+L1+L2.
- The units are "billions of cycles of work".
- This cannot be directly converted to instructions or arithmetic -- it is simply the number of cycles that this particular core would required to perform this work if there were no stalls related to memory access beyond the LLC.
R_cpu is the processor frequency in GHz for the experiment.
W_bw is the unknown amount of data moved to/from memory by this processor while executing this job.
- Units are billions of bytes. (1e9 bytes)
R_bw is the bandwidth in GB/s for each core when running the STREAM benchmark.
- This should be run with the same core and DRAM frequencies as each application performance test.

The sensitivity-based technique works best when using all the cores on a chip.

Once a number of experiments have been run, the W_cpu and W_bw parameters are estimated using least-squares fitting to the assumed model equation.

Given these results, a "System Balance" can be defined as "R_cpu/R_bw", and an "Application Balance" can be defined as "W_cpu/W_bw". For the System Balance, higher values correspond to lots of CPU throughput and little memory bandwidth.

If the Application Balance is numerically higher than the System Balance, then the job is typically considered to be CPU-bound.
If the Application Balance is numerically lower than the System Balance, then the job is typically considered to be memory-bandwidth-bound.

In some cases these two terms are not enough to account for the total execution time and I have had to add a memory latency term to the equation. This one is more difficult to estimate because it is much more difficult to vary the memory latency of the system than it is to vary the processor frequency, the number of processors in use, or the memory frequency.

SB17 · ‎12-19-2015

Hello all

Dear John McCalpin can you explain me ?

Question one. How to determine Application Balance for program If the program is not over yet (W_cpu/W_bw)/sec) ? Or Application Balance can be determine only if collect W_cpu and W_bw until the end of the application.

Question two. Sorry for my incompetence in this topic but may be:

- memory bound can be determine easier by queue depth in memory controller (or in any other unit) ?

- calculation bound can be determine easier by queue depth in any CPU instruction fetch block ?

In my opinion If I see that queue depth is increasing dramatically than is the indicator that device can not process more.

What I am wrong ?

Sorry for my English

Bernard · ‎12-20-2015

-Answering first question:

I think that "calculation bound" can depend on many factors like: actual type of machine code instructions, interdependency between the stream of instructions, exploitation of ILP and execution reordering in order to increment ILP.

For example think about the following scenario when two Hardware threads are executed at the same time on of the Haswell cores. Both threads have branches in their code and decoded stream of tagged and fused cmp/jmp uops will be scheduled for execution on Port0 branch unit and on Port6 branch unit. I suppose that both of branch units implement in hardware various branch prediction algorithms.

SB17 · ‎12-20-2015

i.e. silver bullet still not exist ? It means that even Intel employees difficult to identify clear indicators of "calculation bound" ? :)

but it may be easier with "memory bound" ?

Thanks for answer

Bernard · ‎12-20-2015

@Black

Wait for the Dr. McCalpin answer.

McCalpinJohn · ‎12-20-2015

For a code that you have developed yourself, it is generally relatively easy to count "operations" (defined in any way that you find convenient) and have the code keep track of the total number of "operations" performed. Then you can add code to read the memory controller performance counters (which appear to be accurate on all the systems I have tested). For any interval of the program execution, you can define the "arithmetic intensity" as the number of "operations" that your program counts divided by the amount of memory traffic (64 Bytes times the number of DRAM read and write CAS operations).

For the Xeon E5 processors the intervals can be relatively large, since each memory controller has its own 48-bit counters. For the Core i5/i7 processors with 2 DRAM channels, the counters are harder to use and have to be sampled much more quickly since the memory controller has only one 32-bit wide counter for the sum of the traffic on the two channels. (https://software.intel.com/en-us/articles/monitoring-integrated-memory-controller-requests-in-the-2nd-3rd-and-4th-generation-intel)

This sort of ratio may or may not be helpful -- it depends on what it is you are trying to do or trying to understand....

SB17 · ‎12-21-2015

Thank you for the interesting link

Londhe__Ashutosh · ‎12-21-2015

Dear John D. McCalpin sir,

Thank you for your help. The formula you mention is the best way to calculate arithmetic intensity and i am aware of that. I was asking for tool because the code is not the small (approx. contain 5-6 thousand lines), that's why i am searching for tool.

McCalpinJohn · ‎12-22-2015

You might find the "-profile-functions" and "-profile-loops=<arg>" compiler options useful for helping with manual counting of the operations in a more complex code. Not automatic, but may be helpful....