Yes, with optimization passes

Richard_S_7 · ‎06-13-2017

Hello,

I am trying to compare my own implementation of GEMV with the MKL. For benchmarking I use the following code:

size_t M = 64; // rows
size_t N = 2; // columns

// allocate memory
float *matrix = (float*) mkl_malloc(M*N * sizeof(float), 64);
float *vector = (float*) mkl_malloc(N   * sizeof(float), 64);
float *result = (float*) mkl_malloc(M   * sizeof(float), 64);

// execute warm up calls
for (size_t i = 0; i < NUM_WARMUPS; ++i) {
    cblas_sgemv(CblasRowMajor, CblasNoTrans, M, N, 1.0f,
                matrix, N,
                vector, 1,
                0.0f,
                result, 1);
}

// measure runtime
float avg_runtime = 0;
for (size_t i = 0; i < NUM_EVALUATIONS; ++i) {
    auto start = dsecnd();
    cblas_sgemv(CblasRowMajor, CblasNoTrans, M, N, 1.0f,
                matrix, N,
                vector, 1,
                0.0f,
                result, 1);
    auto end = dsecnd();
    float runtime = (end - start) * 1000;

    avg_runtime += runtime;
}
avg_runtime /= NUM_EVALUATIONS;
std::cout << "avg_runtime: " << avg_runtime << std::endl;

// free buffers
mkl_free(matrix);
mkl_free(vector);
mkl_free(result);

On my system this gives me an average runtime of around 0.0003ms with the first evaluation taking around 0.002ms. Because the average seemed really fast, even for the small input size, I printed the runtimes of all 200 evaluations to make sure my calculation of the average value was correct. If I add a

std::cout << runtime << std::endl;

in line 29 the measured runtimes are way higher and every one of the 200 evaluations takes around 0.002ms. This seems more plausible compared to other libraries and my own implementation.

It seems like the compiler does some optimization to my code and notices that I call the routine with the exact same input multiple times. Can anyone confirm this? What is the suggested way of benchmarking MKL routines?

Thanks in advance!

TimP · ‎06-13-2017

Yes, with optimization passes enabled, compiler may skip operations where you don't keep or examine results.

SergeyKostrov · ‎06-20-2017

You can use volatile key word to workaround some optimization issues, especially when -O3 command line option is used. Another thing related to the test case you've posted is that pointers to matrix, vector and results arrays are not 64-byte aligned.

McCalpinJohn · ‎06-21-2017

I recommend including a "dummy" variable that uses some of the output of the routine under test.

Initialize to zero before the tests.
Add the first element of the output array to the dummy variable after each call to "cblas_sgemv".
Print the dummy variable at the end of the program.

You should also initialize the contents of "matrix", "vector", and "result" before using them. If you don't explicitly initialize the input arrays, you may get undesirable/incomprehensible behaviors. The output array will be initialized by the first "warm-up" call, but it is a good habit to explicitly initialize every array that you are going to use anyway.

Richard_S_7 · ‎06-22-2017

Thank you for your feedback.

Sergey Kostrov wrote:

Another thing related to the test case you've posted is that pointers to matrix, vector and results arrays are not 64-byte aligned.

Can you elaborate on this? I assumed that allocating the arrays by using

mkl_malloc(M*N * sizeof(float), 64)

for example made the pointers aligned. Am I wrong?

McCalpin, John wrote:

I recommend including a "dummy" variable that uses some of the output of the routine under test.

Initialize to zero before the tests.

Add the first element of the output array to the dummy variable after each call to "cblas_sgemv".

Print the dummy variable at the end of the program.

You should also initialize the contents of "matrix", "vector", and "result" before using them. If you don't explicitly initialize the input arrays, you may get undesirable/incomprehensible behaviors. The output array will be initialized by the first "warm-up" call, but it is a good habit to explicitly initialize every array that you are going to use anyway.

I changed my code a bit to accommodate your suggestions. Now, I

fill the input matrix and vector with random values for each evaluation.
fill the output vector with zeros for each evaluation.
compare the result of the MKL call to a sequential implementation of GEMV.

These changes did not affect the benchmarking results. When comparing the results to the sequential implementation for every evaluation, I still get an average runtime of around 0.0003ms. Only adding the printing of the runtime changes the measured runtimes. And that is what is so hard to understand for me. Why does adding console output slow down the benchmark although it is done outside the profiled code lines, but adding a result check does not?

McCalpinJohn · ‎06-22-2017

You may still be running into a dead code elimination issue, though this can be hard to diagnose without seeing the full code. The compiler can be very clever about seeing whether you are using results are not, and can remove code in unexpected places....

My suggestion about the "dummy" variable was intended to force a true data dependence across all of the iterations of the loop. This usually works, but it is a bit of a hack and is not guaranteed to produce the desired results.

With beta=0, the code automatically overwrites the previous values of the output array, so there is no need to zero it after the one-time (post-allocation) initialization.
You also don't want to re-initialize the inputs -- this just tells the compiler that only the result of the final iteration of the loop is needed.
If you want to ensure that all of the operations requested are actually done, you need to modify the code so that output of each loop iteration feeds into the input of the next loop iteration.
- This would happen automatically if the "beta" parameter were non-zero, but it is also easy to do by simply copying the output vector to the input vector after each iteration, and then remembering to use the final output vector to compute something that is printed before you exit.
- A sum of the values of the output vector is a convenient thing to print.)

You mentioned that you added a comparison of the result of the MKL call to the result of a sequential implementation, but this still requires that each of the NUM_EVALUATIONS iterations feeds into the next, and still requires that the comparison of the results be used in some way that has a side effect -- e.g., printing, writing to a file, use in a logical test that feeds into a print or IO write, etc.

SergeyKostrov · ‎06-23-2017

>>...Can you elaborate on this? Use __attribute__( ( aligned( 64 ) ) ) in case of GCC-compatible compiler and __declspec( align( 64 ) ) in case of MS-compatible compiler for a pointer.

Richard_S_7 · ‎06-24-2017

This is my current code:

size_t M = 64; // rows
size_t N = 2; // columns

// fill arrays with random values
float *matrix __attribute__( ( aligned( 64 ) ) );
float *vector __attribute__( ( aligned( 64 ) ) );
float *result __attribute__( ( aligned( 64 ) ) );
matrix = (float *) mkl_malloc(M * N * sizeof(float), 64); fill(matrix, M * N);
vector = (float *) mkl_malloc(N     * sizeof(float), 64); fill(vector, N);
result = (float *) mkl_malloc(M     * sizeof(float), 64); zero(result, M);

// execute warm up MKL calls
for (size_t i = 0; i < NUM_WARMUPS; ++i) {
    cblas_sgemv(CblasRowMajor, CblasNoTrans, M, N, 1.0f,
                matrix, N,
                vector, 1,
                0.0f,
                result, 1);
}

// execute MKL call and measure execution time
float avg_runtime = 0;
std::vector<float> runtimes;
for (size_t i = 0; i < NUM_EVALUATIONS; ++i) {
    
    // measure execution time of MKL call
    auto start = std::chrono::high_resolution_clock::now();
    cblas_sgemv(CblasRowMajor, CblasNoTrans, M, N, 1.0f,
                matrix, N,
                vector, 1,
                0.0f,
                result, 1);
    auto end = std::chrono::high_resolution_clock::now();
    auto elapsed_nanoseconds = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count();
    float runtime = elapsed_nanoseconds / 1000000.0f;

    // compare result to a sequential implementation of GEMV
    std::vector<float> res_glb_gold(M); zero(res_glb_gold);
    seq_gemv(matrix, vector, M, N, res_glb_gold.data());
    float max_error_abs, max_error_rel;
    check_rel_error(true, max_error_abs, max_error_rel, result, res_glb_gold.data(), M);
    std::cout << "abs error: " << max_error_abs << ", rel error: " << max_error_rel << std::endl;

    // use result as input of next evaluation to prevent compiler optimizations
    if (i < NUM_EVALUATIONS - 1) {
        for (size_t k = 0; k < std::max(M, N); ++k) {
            vector[k % N] = result[k % M];
            matrix = result[k % M];
        }
    }

    runtimes.push_back(runtime);
    avg_runtime += runtime;
}
avg_runtime /= NUM_EVALUATIONS;

for (size_t i = 0; i < NUM_EVALUATIONS; ++i) {
    std::cout << runtimes << std::endl;
}
std::cout << "avg_runtime: " << avg_runtime << std::endl;
float sum = 0;
for (size_t i = 0; i < M; ++i) sum += result;
std::cout << "sum: " << sum << std::endl;

// free buffers
mkl_free(matrix);
mkl_free(vector);
mkl_free(result);

I tried to incorporate your suggestions. With this code I get the slower runtimes of around 0.002ms. That however only seems to depend on printing the error values in line 41. Without this output, and even when I store the error values in a buffer and print them after all evaluations, I get a runtime that is almost 10 times faster.

Is this the recommended way of benchmarking a MKL routine? Maybe the faster values are correct, I am just not sure which values to use for a fair comparison.

McCalpinJohn · ‎06-27-2017

It is always a good idea to have some estimate of the amount of work required and the amount of time (or number of processor cycles) that are expected to be required to complete that work.

For this computation with these array sizes, the minimum execution time depends on the SIMD width. For 64 rows and 2 columns, computing the 64 output values should only take 64 cycles for scalar code, 16 cycles with 128-bit SIMD, 8 cycles with 256-bit SIMD, and 4 cycles with 512-bit SIMD. I don't know how well the compiler manages the optimization of this code, but 0.002 milliseconds is 2 microseconds or about 6000 cycles at 3 GHz. This suggests that you are mostly measuring overhead.

Ying_H_Intel · ‎07-04-2017

Hi Richard,

Right, the small matrix performance is volatile, so we recommend to add wrap up function and take the average time as the article

https://software.intel.com/en-us/articles/a-simple-example-to-measure-the-performance-of-an-intel-mkl-function

The Intel® Math Kernel Library (Intel® MKL) is multi-threaded and employs internal buffers for fast memory allocation. Typically the first subroutine call initializes the threads and internal buffers. Therefore, the first function call may take more time compared to the subsequent calls with the same arguments. Although the initialization time usually insignificant compared to the execution time of DGEMM for large matrices, it can be substantial when timing DGEMM for small matrices.

In your test case, do you mean, the runtime one times is 10x slower than average time, right? Could you please attach your complete code and your processor and compiler option etc.

I did test on Intel(R) Core(TM) i5-7600 CPU @ 3.50GHz, 4 core. Ubuntu 16.10, icc version 18.0.0 (gcc version 6.0.0 compatibility). MKL 2018 beta as below, I run several times, but totally, it looks fine about average time.

Intel(R) MKL 2018.0 Product build 20170601 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors, Lnx 3.50GHz lp64 intel_thread NMICDev:0

>icc GEMV_d.cpp -mkl

>export OMP_NUM_THREADS=1

>export KMP_AFFINITY=granularity=fine,compact,1,0
>a.out

Best Regards,

Ying

#include <stdio.h>
#include <stdlib.h>
#include<iostream>
#include <time.h>
#include <mkl.h>
#include <omp.h>
#include <vector>
#include <chrono>
#include <algorithm>
#define NUM_EVALUATIONS 1000
#define NUM_WARMUPS 10

int main(int argc, char* argv[])
{
 size_t M = 64; // rows
 size_t N = 2; // columns

      // allocate memory
 float *matrix = (float*)mkl_malloc(M*N * sizeof(float), 64);
 float *vector = (float*)mkl_malloc(N * sizeof(float), 64);
 float *result = (float*)mkl_malloc(M * sizeof(float), 64);

 // execute warm up calls
 for (size_t i = 0; i < NUM_WARMUPS; ++i) {
  cblas_sgemv(CblasRowMajor, CblasNoTrans, M, N, 1.0f,
   matrix, N,
   vector, 1,
   0.0f,
   result, 1);
 }

 // measure runtime
 float avg_runtime = 0;
 for (size_t i = 0; i < NUM_EVALUATIONS; ++i) {
  auto start = dsecnd();
  cblas_sgemv(CblasRowMajor, CblasNoTrans, M, N, 1.0f,
   matrix, N,
   vector, 1,
   0.0f,
   result, 1);
  auto end = dsecnd();
  float runtime = (end - start) * 1000;
  std::cout << runtime << std::endl;
  avg_runtime += runtime;
  // use result as input of next evaluation to prevent compiler optimizations
  for (size_t k = 0; k < std::max(M, N); ++k) {
   vector[k % N] = result[k % M];
   matrix = result[k % M];
  }
 }
 avg_runtime /= NUM_EVALUATIONS;
 std::cout << "avg_runtime: " << avg_runtime << std::endl;
 if(runtime>0.001) std::cout <<"------------num iteration ------------" << i <<std::endl;
 // free buffers
 mkl_free(matrix);
 mkl_free(vector);
 mkl_free(result);

 
}

> icc GEMV_d.cpp -mkl

yhu5@kbl01-ub:~/GEMV$ ./a.out
0.00280188
------------num iteration = ==--------0
0.000328291
0.000223052
0.000256579
0.000276603
0.000254717
0.000272412
0.000256579
0.000273809
0.000254717
0.000275671
0.000255182
0.000279862
0.000257976
0.00027474
0.000254251
0.000278
0.000256114
0.000276137
0.000257045
0.000278931
0.000257045
0.000276137
0.000257511
0.000272878
0.000257045
0.00027474
0.000257511
0.000271481
0.000255648
0.000275206
0.000257976
0.000277068
0.000256579
0.000275671
0.000258442
0.000277534
0.000256114
0.000275206
0.000256579
0.000275206
0.000254717
0.000274274
0.000256579
0.000275671
0.000256579
0.000276137
0.000257976
0.000272878
0.000257045
0.000272878
0.000257511
0.000276603
0.000257511
0.000275671
0.000257045
0.000276603
0.000255648
0.000276603
0.000257976
0.000278
0.000259373
0.000279862
0.000258442
0.000279862
0.000259373
0.000279862
0.000257976
0.000278
0.000258442
0.000279397
0.000256114
0.000280794
0.000259373
0.000280328
0.000257045
0.000280794
0.000250991
0.000275671
0.000255648
0.000272412
0.000258908
0.000274274
0.000257511
0.000276603
0.000255182
0.000277534
0.000258908
0.000277068
0.000257976
0.000278465
0.000258908
0.000275206
0.00026077
0.000275671
0.000258908
0.000280328
0.000258442
0.000272412
0.000256579
0.000276137
0.000257045
0.00027474
0.000256114
0.000275206
0.000256114
0.000274274
0.000256114
0.000275671
0.000256579
0.000275671
0.000255182
0.000277068
0.000255648
0.000275671
0.000256114
0.000276137
0.000256579
0.00027474
0.000256114
0.000272412
0.000256114
0.000276137
0.000256579
0.000276603
0.000257045
0.000277068
0.000251923
0.000275671
0.000255648
0.000275671
0.000255648
0.000275671
0.000255648
0.000275671
0.000256579
0.000277534
0.000252854
0.000277068
0.000257045
0.000276137
0.000252854
0.000275206
0.000257045
0.000270549
0.000258908
0.000271946
0.000254717
0.000274274
0.000252854
0.000275206
0.000256114
0.000278
0.000257511
0.000272878
0.000256114
0.000275206
0.000256114
0.000275671
0.000250526
0.000277534
0.000255648
0.000275206
0.000257511
0.000275206
0.000257511
0.000277534
0.000257045
0.000277068
0.000256114
0.000276137
0.000255648
0.000275206
0.000256114
0.000273343
0.000257045
0.000270549
0.000256579
0.000275671
0.000252854
0.000275206
0.000252854
0.000276137
0.000254717
0.000277068
0.000256114
0.00027474
0.000255182
0.000275206
0.000256579
0.000275671
0.000252854
0.000278
0.000256579
0.000275671
0.000256579
0.000271481
0.000255182
0.000270549
0.000255648
0.000271015
0.000256114
0.000277068
0.000256579
0.000275671
0.000257976
0.000278
0.000256114
0.000275671
0.000256579
0.000276603
0.000255182
0.000274274
0.000256579
0.000275206
0.000256114
0.000272412
0.000257511
0.000272412
0.000257045
0.000272878
0.000257045
0.000278931
0.000255648
0.000275671
0.000255648
0.000275206
0.000255648
0.000274274
0.000256579
0.000273809
0.000256579
0.000276137
0.000254717
0.000276603
0.000252388
0.000276137
0.000257511
0.000275671
0.000256579
0.000276137
0.000256114
0.000271481
0.000256114
0.00027474
0.000256114
0.000275671
0.000257511
0.000273809
0.000257045
0.000273809
0.000256579
0.000275671
0.000257976
0.000273343
0.000255648
0.000278
0.000255648
0.000275206
0.000257976
0.000278
0.000256114
0.000276603
0.000256579
0.000276603
0.000257976
0.000275671
0.000252388
0.000275671
0.000257045
0.000275671
0.000257045
0.000271946
0.000257976
0.000277068
0.000256579
0.000275671
0.000256114
0.000276603
0.000254717
0.000274274
0.000252388
0.000276137
0.000252854
0.000276603
0.000257045
0.000278
0.000257045
0.000271481
0.000256114
0.000272878
0.000257045
0.000272412
0.000256114
0.000276137
0.000252854
0.000276603
0.000255648
0.000272878
0.000257045
0.000277068
0.000255648
0.000277068
0.000256579
0.000271481
0.000257045
0.000271015
0.000256114
0.000271481
0.000258442
0.000276137
0.000255182
0.000273809
0.000257976
0.000271946
0.000256114
0.000276137
0.000256579
0.000276603
0.000251923
0.000275206
0.000252854
0.000277068
0.00025332
0.000277068
0.000259373
0.000278
0.00025332
0.00027474
0.000256114
0.000272878
0.000257976
0.000275206
0.000252388
0.000276603
0.000257976
0.000277534
0.000252854
0.000276137
0.00025332
0.000277068
0.000257045
0.000271946
0.000257045
0.000276137
0.000257045
0.000271481
0.000256114
0.000271946
0.000256114
0.000274274
0.000257045
0.000273343
0.000257045
0.000272412
0.000256579
0.000271015
0.000256579
0.000275206
0.000255648
0.000275206
0.000257045
0.000275671
0.000252388
0.000276603
0.000252388
0.000277534
0.000251923
0.000275671
0.00025332
0.000277534
0.000253785
0.000276137
0.00025332
0.000274274
0.000256114
0.000271481

...

avg_runtime: 0.000268196

Richard_S_7 · ‎07-05-2017

Hello Ying,

thank you for your input. I am executing my code on an Intel(R) Xeon(R) CPU E5-2640 v2 @ 2.00GHz, the OS is CentOS Linux release 7.3.1611 and the compiler is clang. The compiler call is

clang++ main.cpp -I/path/to/mkl/include/ -L/path/to/mkl/lib/intel64/ -lmkl -liomp5

The complete code is as follows:

#include <cstddef>
#include <mkl.h>
#include <cstdio>
#include <iostream>
#include <vector>
#include <unistd.h>

#define NUM_EVALUATIONS 200
#define NUM_WARMUPS 3

int main() {
    size_t M = 64; // rows
    size_t N = 2; // columns

    // fill arrays with random values
    float *matrix __attribute__( ( aligned( 64 ) ) );
    float *vector __attribute__( ( aligned( 64 ) ) );
    float *result __attribute__( ( aligned( 64 ) ) );
    matrix = (float *) mkl_malloc(M * N * sizeof(float), 64); fill(matrix, M * N);
    vector = (float *) mkl_malloc(N     * sizeof(float), 64); fill(vector, N);
    result = (float *) mkl_malloc(M     * sizeof(float), 64); zero(result, M);

    // execute warm up MKL calls
    for (size_t i = 0; i < NUM_WARMUPS; ++i) {
        cblas_sgemv(CblasRowMajor, CblasNoTrans, M, N, 1.0f,
                    matrix, N,
                    vector, 1,
                    0.0f,
                    result, 1);
    }

    // execute MKL call and measure execution time
    double avg_runtime = 0;
    std::vector<float> runtimes;
    for (size_t i = 0; i < NUM_EVALUATIONS; ++i) {

        auto start = dsecnd();
        cblas_sgemv(CblasRowMajor, CblasNoTrans, M, N, 1.0f,
                    matrix, N,
                    vector, 1,
                    0.0f,
                    result, 1);
        auto end = dsecnd();
        double runtime = (end - start) * 1000;

        // use result as input of next evaluation to prevent compiler optimizations
        if (i < NUM_EVALUATIONS - 1) {
            for (size_t k = 0; k < std::max(M, N); ++k) {
                vector[k % N] = result[k % M];
                matrix = result[k % M];
            }
        }

//        std::cout << runtime << std::endl;
        avg_runtime += runtime;
    }
    avg_runtime /= NUM_EVALUATIONS;

    float sum = 0;
    for (size_t i = 0; i < M; ++i) sum += result;
    std::cout << "sum: " << sum << std::endl;
    std::cout << "avg_runtime: " << avg_runtime << std::endl;
    double gflop = (M * (2 * N - 1)) * 1E-9;
    std::cout << "GFlop/s: " << gflop / avg_runtime * 1000.0 << std::endl;

    // free buffers
    mkl_free(matrix);
    mkl_free(vector);
    mkl_free(result);
}

Ying H. (Intel) wrote:

In your test case, do you mean, the runtime one times is 10x slower than average time, right?

Yes the 10x difference occurs only in the first call. However, when uncommenting the line 54 of the listing above I get a runtime that is approximately 2x slower. So when I don't print all the runtimes I get around 0,46 GFlop/s, but with printing I get 0,26 GFlop/s. That is the only issue remaining. I am not sure, which values are "correct" and should be used in a comparison.

Ying_H_Intel · ‎07-06-2017

Hi Richard,

Get it. Do you have the Xeon machine HT switched on or not?

Anyway, please try

>export OMP_NUM_THREADS=1

>export KMP_AFFINITY=granularity=fine,compact,1,0
>a.out

As the size of matrix is small, it is better to keep the thread affinity to fix CPU, so get stable result.

https://software.intel.com/en-us/mkl-linux-developer-guide-using-intel-hyper-threading-technology

Best Regards,

Ying

Benchmarking MKL GEMV