Solved: The VPU_ELEMENTS_ACTIVE

Sunwoo_L_ · ‎09-08-2015

Hello,

I am calculating FLOPS of an application. The application is really simple and short. It's just a matrix multiplication. The code is like the below.

void myMult(int size, float (*restrict A)[size], float (*restrict B)[size], float (*restrict C)[size])
{
    #pragma omp parallel for shared(A,B,C,size)
    for(int i = 0; i < size; ++i){
         for(int k = 0; k < size; ++k){
             for(int j = 0; j < size; ++j){
                 C += A * B;
             }
         }
    }
}

I know 2 ways of calculating FLOPS. First one is dividing the total FLOP number by the elapsed seconds, and the second one is dividing the VPU_ELEMENTS_ACTIVE number by the elapsed seconds. I expected to get the almost same value by the two methods. But it wasn't.

In the first way, the total FLOP number is 2*size*size*size. And it took 0.22 seconds. The size was 3000. So, the GFLOPS is (2*3000*3000*3000)/0.22/10^9 = 245.45 GFLOPS.

In the second way, I collected the number of VPU_ELEMENTS_ACTIVE event using the below command.

/opt/intel/vtune_amplifier_xe/bin64/amplxe-cl -collect-with runsa-knc -knob event-config=CPU_CLK_UNHALTED,VPU_ELEMENTS_ACTIVE -target-system=mic-host-launch:0 -- ./test

The number was 595,239,785,714. With this number, I can get the 270.56 GFLOPS.

The two numbers are not so different... But I found out that VPU_ELEMENTS_ACTIVE counts the FMA operation as a single operation! Then, in the second way, the FLOPS should be doubled to be 541.12GFLOPS (In the application, there is no other FLOPs).

Which way would the correct one to calculate the FLOPS?

McCalpinJohn · ‎09-08-2015

The VPU_ELEMENTS_ACTIVE counts for all vector instructions, not just the arithmetic instructions, so it is not a reliable way to count arithmetic operations.

Xeon Phi does not have a counter for arithmetic operations, so counting by inspection of the source code is typically the best option.

View solution in original post

McCalpinJohn · ‎09-08-2015

The VPU_ELEMENTS_ACTIVE counts for all vector instructions, not just the arithmetic instructions, so it is not a reliable way to count arithmetic operations.

Xeon Phi does not have a counter for arithmetic operations, so counting by inspection of the source code is typically the best option.

Sunwoo_L_ · ‎09-08-2015

Okay, then I have to use the first way.

Thank you for your reply, John D. McCalpin!

jimdempseyatthecove · ‎09-09-2015

Then you have the "issue" of where the source code indicated N flops, but the compiler produces fewer operations. I am not talking about FMA. Rather, many pieces of code experience common sub-expression elimination. Does that factor into the FLOPS calculations?

Then, different ops have different timings. The generic term FLOPS assumes all operations are equal (including sqrt, sin, cos, ...).

And other concerns (cache hit/miss affecting FLOPS).

Your main concern should be wall clock time (overall run time).

Jim Dempsey

McCalpinJohn · ‎09-10-2015

The main purpose of counting floating-point operations is to have a reasonable numerator for the "operations per second" calculation. Although you could certainly use any number at all (including 1), it is usually helpful to come up with an operation count that scales properly with the actual work done by the code. This allows performance comparisons between different problem sizes -- something that is typically not intuitive when the only data is raw execution time.

Of course the most important characteristic of an operation count is that it be fixed when using it as the numerator in "operations/second". The exact value seldom matters very much, so the details of counting are not typically important -- as long as they are applied consistently. When I am counting by hand, I sometimes exclude operations when it appears obvious that they don't need to be recomputed, but this is usually a very small correction.

Once upon a time I was involved in a large government procurement in which vendors were allowed to use either the "nominal" floating point operation count of the code or the actual floating-point operation count. In this case one vendors actual floating-point operation count was much higher than the nominal count because floating-point divides were implemented in software. If I recall correctly, the vendor could take credit for something like 27 "actual" floating-point operations for a divide rather than the 4 "nominal" operations from the operation count formula. The end result was that this vendor was able to claim a higher MFLOPS rate despite taking longer to solve the same problem. The other vendor's product had slightly different characteristics and was not able to exploit this flawed rule in a comparable way.

For performance monitoring based on interval instrumentation, compiler counts of floating-point operations (or any other kinds of operations) could usually be managed with very low overhead. For sampling-based performance monitoring, there is no reasonable substitute for a hardware counter for floating-point operations. (If you really, really wanted to do it, you could reserve a GPR across the whole program to hold the compiler-generated accumulated operation count. Then this could be read at the random sampling points. All it takes is a new ABI and a full set of libraries and compilers to support.) Hardware counters cannot compensate for common sub-expression elimination or conversion of hardware divides into software divides, but they are close enough for sampling-based instrumentation (which is approximate by its very nature anyway).

jimdempseyatthecove · ‎09-11-2015

On one hand the 27 flop per abstract divide. On the other hand you have 1 flop per abstract multiply and add. Combine this with compiler optimizations (common sub-expression through code elimination of code that generates values not used), Not to mention that processors can perform overlapped operations such as multiply and add (non-FMA) and this yields differing FLOPS values depending on order of execution. Therefore, the term FLOPS is somewhat meaningless when taken out of context of the problem to be solved. Standardized benchmark timing is a better measure of relative performance... But these relative performance figures may yield little insight as to how it will affect your application.

Jim Dempsey

Sunwoo_L_ · ‎09-11-2015

Hello, Jim Dempsey.

Thank you for your explanation, but I am not following you. How is 27 flop calculated?

McCalpinJohn · ‎09-11-2015

I never did see the details, but my recollection is that the software implementation of the divide instruction included one reciprocal approximation (counting as one "actual" FP operation) followed by 13 floating-point multiply-add operations (counting as 26 "actual" FP operations).

An example of a similar algorithm is provided in http://gec.di.uminho.pt/discip/minf/ac0203/icca03/ia64fpbf1.pdf for the IA-64 architecture (which also does not have a floating-point divide instruction). The algorithm in the left column of page 7 appears to have 22 floating point operations -- one reciprocal approximation, nine multiply-add operations, and two isolated multiply operations.

Hardware dividers typically manage to get full precision in slightly fewer iterations by using some special tricks not available to software. A very nice example from the AMD K7 architecture is http://www.acsel-lab.com/arithmetic/arith14/papers/ARITH14_Oberman.pdf

The details of the algorithm will depend on the accuracy of the initial reciprocal approximation, the specific division algorithm used, the accuracy requirement of the result, and whether the code is optimized for latency or throughput.

jimdempseyatthecove · ‎09-13-2015

To paraphrase "Dr. Bandwidth":

Where you see in your source code the expression "A / B", the floating point operator "/" on some architectures is implemented with one machine op code. On the architecture he referenced it was performed using more machine floating point op codes (27/26 and 22 op codes). The machine vendor in the one case chose to report the FLOPS value for "/" as the time for the "/" * 27 (because 27 floating point op codes were issued).

From the user (customer) viewpoint, they do not care how the "/" is implemented by the compiler or internally to the machine architecture (1 op, 1/2 op, 27 ops, ...), rather the customer will view "A / B" as one FLOP (of type divide).

Due to this mischief by the hardware vendor, it behooves the user community to rely on independent testers (testing organizations) that clearly publish the definition of the FLOP performance or how each FLOP type and FLOP type combination performance is determined.

IMHO, with todays machines, it is ineffective to reduce the floating point performance to a single value "FLOP/s" without it being related to a specific algorithm, benchmark or well established application.

From your original post, I will assume you are interested in "what is the FLOPS for the following code" (run as listed)?

To this you would answer "In the first way... 2*size*size*size/time" ***

*** Please note that compiler optimizations may recognize this as a matrix multiply, and substitute the expressed code loops with a call to a multi-threaded library such as MKL, which in turn may elect to offload to a faster device. Considering all this, the term FLOPS by itself becomes meaningless.

Jim Dempsey

How to correctly calculate the FLOPS on Xeon Phi?