- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I am trying to compare my own implementation of GEMV with the MKL. For benchmarking I use the following code:
size_t M = 64; // rows size_t N = 2; // columns // allocate memory float *matrix = (float*) mkl_malloc(M*N * sizeof(float), 64); float *vector = (float*) mkl_malloc(N * sizeof(float), 64); float *result = (float*) mkl_malloc(M * sizeof(float), 64); // execute warm up calls for (size_t i = 0; i < NUM_WARMUPS; ++i) { cblas_sgemv(CblasRowMajor, CblasNoTrans, M, N, 1.0f, matrix, N, vector, 1, 0.0f, result, 1); } // measure runtime float avg_runtime = 0; for (size_t i = 0; i < NUM_EVALUATIONS; ++i) { auto start = dsecnd(); cblas_sgemv(CblasRowMajor, CblasNoTrans, M, N, 1.0f, matrix, N, vector, 1, 0.0f, result, 1); auto end = dsecnd(); float runtime = (end - start) * 1000; avg_runtime += runtime; } avg_runtime /= NUM_EVALUATIONS; std::cout << "avg_runtime: " << avg_runtime << std::endl; // free buffers mkl_free(matrix); mkl_free(vector); mkl_free(result);
On my system this gives me an average runtime of around 0.0003ms with the first evaluation taking around 0.002ms. Because the average seemed really fast, even for the small input size, I printed the runtimes of all 200 evaluations to make sure my calculation of the average value was correct. If I add a
std::cout << runtime << std::endl;
in line 29 the measured runtimes are way higher and every one of the 200 evaluations takes around 0.002ms. This seems more plausible compared to other libraries and my own implementation.
It seems like the compiler does some optimization to my code and notices that I call the routine with the exact same input multiple times. Can anyone confirm this? What is the suggested way of benchmarking MKL routines?
Thanks in advance!
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes, with optimization passes enabled, compiler may skip operations where you don't keep or examine results.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I recommend including a "dummy" variable that uses some of the output of the routine under test.
- Initialize to zero before the tests.
- Add the first element of the output array to the dummy variable after each call to "cblas_sgemv".
- Print the dummy variable at the end of the program.
You should also initialize the contents of "matrix", "vector", and "result" before using them. If you don't explicitly initialize the input arrays, you may get undesirable/incomprehensible behaviors. The output array will be initialized by the first "warm-up" call, but it is a good habit to explicitly initialize every array that you are going to use anyway.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you for your feedback.
Sergey Kostrov wrote:
Another thing related to the test case you've posted is that pointers to matrix, vector and results arrays are not 64-byte aligned.
Can you elaborate on this? I assumed that allocating the arrays by using
mkl_malloc(M*N * sizeof(float), 64)
for example made the pointers aligned. Am I wrong?
McCalpin, John wrote:
I recommend including a "dummy" variable that uses some of the output of the routine under test.
- Initialize to zero before the tests.
- Add the first element of the output array to the dummy variable after each call to "cblas_sgemv".
- Print the dummy variable at the end of the program.
You should also initialize the contents of "matrix", "vector", and "result" before using them. If you don't explicitly initialize the input arrays, you may get undesirable/incomprehensible behaviors. The output array will be initialized by the first "warm-up" call, but it is a good habit to explicitly initialize every array that you are going to use anyway.
I changed my code a bit to accommodate your suggestions. Now, I
- fill the input matrix and vector with random values for each evaluation.
- fill the output vector with zeros for each evaluation.
- compare the result of the MKL call to a sequential implementation of GEMV.
These changes did not affect the benchmarking results. When comparing the results to the sequential implementation for every evaluation, I still get an average runtime of around 0.0003ms. Only adding the printing of the runtime changes the measured runtimes. And that is what is so hard to understand for me. Why does adding console output slow down the benchmark although it is done outside the profiled code lines, but adding a result check does not?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You may still be running into a dead code elimination issue, though this can be hard to diagnose without seeing the full code. The compiler can be very clever about seeing whether you are using results are not, and can remove code in unexpected places....
My suggestion about the "dummy" variable was intended to force a true data dependence across all of the iterations of the loop. This usually works, but it is a bit of a hack and is not guaranteed to produce the desired results.
- With beta=0, the code automatically overwrites the previous values of the output array, so there is no need to zero it after the one-time (post-allocation) initialization.
- You also don't want to re-initialize the inputs -- this just tells the compiler that only the result of the final iteration of the loop is needed.
- If you want to ensure that all of the operations requested are actually done, you need to modify the code so that output of each loop iteration feeds into the input of the next loop iteration.
- This would happen automatically if the "beta" parameter were non-zero, but it is also easy to do by simply copying the output vector to the input vector after each iteration, and then remembering to use the final output vector to compute something that is printed before you exit.
- A sum of the values of the output vector is a convenient thing to print.)
You mentioned that you added a comparison of the result of the MKL call to the result of a sequential implementation, but this still requires that each of the NUM_EVALUATIONS iterations feeds into the next, and still requires that the comparison of the results be used in some way that has a side effect -- e.g., printing, writing to a file, use in a logical test that feeds into a print or IO write, etc.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This is my current code:
size_t M = 64; // rows size_t N = 2; // columns // fill arrays with random values float *matrix __attribute__( ( aligned( 64 ) ) ); float *vector __attribute__( ( aligned( 64 ) ) ); float *result __attribute__( ( aligned( 64 ) ) ); matrix = (float *) mkl_malloc(M * N * sizeof(float), 64); fill(matrix, M * N); vector = (float *) mkl_malloc(N * sizeof(float), 64); fill(vector, N); result = (float *) mkl_malloc(M * sizeof(float), 64); zero(result, M); // execute warm up MKL calls for (size_t i = 0; i < NUM_WARMUPS; ++i) { cblas_sgemv(CblasRowMajor, CblasNoTrans, M, N, 1.0f, matrix, N, vector, 1, 0.0f, result, 1); } // execute MKL call and measure execution time float avg_runtime = 0; std::vector<float> runtimes; for (size_t i = 0; i < NUM_EVALUATIONS; ++i) { // measure execution time of MKL call auto start = std::chrono::high_resolution_clock::now(); cblas_sgemv(CblasRowMajor, CblasNoTrans, M, N, 1.0f, matrix, N, vector, 1, 0.0f, result, 1); auto end = std::chrono::high_resolution_clock::now(); auto elapsed_nanoseconds = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count(); float runtime = elapsed_nanoseconds / 1000000.0f; // compare result to a sequential implementation of GEMV std::vector<float> res_glb_gold(M); zero(res_glb_gold); seq_gemv(matrix, vector, M, N, res_glb_gold.data()); float max_error_abs, max_error_rel; check_rel_error(true, max_error_abs, max_error_rel, result, res_glb_gold.data(), M); std::cout << "abs error: " << max_error_abs << ", rel error: " << max_error_rel << std::endl; // use result as input of next evaluation to prevent compiler optimizations if (i < NUM_EVALUATIONS - 1) { for (size_t k = 0; k < std::max(M, N); ++k) { vector[k % N] = result[k % M]; matrix= result[k % M]; } } runtimes.push_back(runtime); avg_runtime += runtime; } avg_runtime /= NUM_EVALUATIONS; for (size_t i = 0; i < NUM_EVALUATIONS; ++i) { std::cout << runtimes << std::endl; } std::cout << "avg_runtime: " << avg_runtime << std::endl; float sum = 0; for (size_t i = 0; i < M; ++i) sum += result; std::cout << "sum: " << sum << std::endl; // free buffers mkl_free(matrix); mkl_free(vector); mkl_free(result);
I tried to incorporate your suggestions. With this code I get the slower runtimes of around 0.002ms. That however only seems to depend on printing the error values in line 41. Without this output, and even when I store the error values in a buffer and print them after all evaluations, I get a runtime that is almost 10 times faster.
Is this the recommended way of benchmarking a MKL routine? Maybe the faster values are correct, I am just not sure which values to use for a fair comparison.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It is always a good idea to have some estimate of the amount of work required and the amount of time (or number of processor cycles) that are expected to be required to complete that work.
For this computation with these array sizes, the minimum execution time depends on the SIMD width. For 64 rows and 2 columns, computing the 64 output values should only take 64 cycles for scalar code, 16 cycles with 128-bit SIMD, 8 cycles with 256-bit SIMD, and 4 cycles with 512-bit SIMD. I don't know how well the compiler manages the optimization of this code, but 0.002 milliseconds is 2 microseconds or about 6000 cycles at 3 GHz. This suggests that you are mostly measuring overhead.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Richard,
Right, the small matrix performance is volatile, so we recommend to add wrap up function and take the average time as the article
The Intel® Math Kernel Library (Intel® MKL) is multi-threaded and employs internal buffers for fast memory allocation. Typically the first subroutine call initializes the threads and internal buffers. Therefore, the first function call may take more time compared to the subsequent calls with the same arguments. Although the initialization time usually insignificant compared to the execution time of DGEMM for large matrices, it can be substantial when timing DGEMM for small matrices.
In your test case, do you mean, the runtime one times is 10x slower than average time, right? Could you please attach your complete code and your processor and compiler option etc.
I did test on Intel(R) Core(TM) i5-7600 CPU @ 3.50GHz, 4 core. Ubuntu 16.10, icc version 18.0.0 (gcc version 6.0.0 compatibility). MKL 2018 beta as below, I run several times, but totally, it looks fine about average time.
Intel(R) MKL 2018.0 Product build 20170601 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors, Lnx 3.50GHz lp64 intel_thread NMICDev:0
>icc GEMV_d.cpp -mkl
>export OMP_NUM_THREADS=1
>export KMP_AFFINITY=granularity=fine,compact,1,0
>a.out
Best Regards,
Ying
#include <stdio.h> #include <stdlib.h> #include<iostream> #include <time.h> #include <mkl.h> #include <omp.h> #include <vector> #include <chrono> #include <algorithm> #define NUM_EVALUATIONS 1000 #define NUM_WARMUPS 10 int main(int argc, char* argv[]) { size_t M = 64; // rows size_t N = 2; // columns // allocate memory float *matrix = (float*)mkl_malloc(M*N * sizeof(float), 64); float *vector = (float*)mkl_malloc(N * sizeof(float), 64); float *result = (float*)mkl_malloc(M * sizeof(float), 64); // execute warm up calls for (size_t i = 0; i < NUM_WARMUPS; ++i) { cblas_sgemv(CblasRowMajor, CblasNoTrans, M, N, 1.0f, matrix, N, vector, 1, 0.0f, result, 1); } // measure runtime float avg_runtime = 0; for (size_t i = 0; i < NUM_EVALUATIONS; ++i) { auto start = dsecnd(); cblas_sgemv(CblasRowMajor, CblasNoTrans, M, N, 1.0f, matrix, N, vector, 1, 0.0f, result, 1); auto end = dsecnd(); float runtime = (end - start) * 1000; std::cout << runtime << std::endl; avg_runtime += runtime; // use result as input of next evaluation to prevent compiler optimizations for (size_t k = 0; k < std::max(M, N); ++k) { vector[k % N] = result[k % M]; matrix= result[k % M]; } } avg_runtime /= NUM_EVALUATIONS; std::cout << "avg_runtime: " << avg_runtime << std::endl; if(runtime>0.001) std::cout <<"------------num iteration ------------" << i <<std::endl; // free buffers mkl_free(matrix); mkl_free(vector); mkl_free(result); }
> icc GEMV_d.cpp -mkl
yhu5@kbl01-ub:~/GEMV$ ./a.out
0.00280188
------------num iteration = ==--------0
0.000328291
0.000223052
0.000256579
0.000276603
0.000254717
0.000272412
0.000256579
0.000273809
0.000254717
0.000275671
0.000255182
0.000279862
0.000257976
0.00027474
0.000254251
0.000278
0.000256114
0.000276137
0.000257045
0.000278931
0.000257045
0.000276137
0.000257511
0.000272878
0.000257045
0.00027474
0.000257511
0.000271481
0.000255648
0.000275206
0.000257976
0.000277068
0.000256579
0.000275671
0.000258442
0.000277534
0.000256114
0.000275206
0.000256579
0.000275206
0.000254717
0.000274274
0.000256579
0.000275671
0.000256579
0.000276137
0.000257976
0.000272878
0.000257045
0.000272878
0.000257511
0.000276603
0.000257511
0.000275671
0.000257045
0.000276603
0.000255648
0.000276603
0.000257976
0.000278
0.000259373
0.000279862
0.000258442
0.000279862
0.000259373
0.000279862
0.000257976
0.000278
0.000258442
0.000279397
0.000256114
0.000280794
0.000259373
0.000280328
0.000257045
0.000280794
0.000250991
0.000275671
0.000255648
0.000272412
0.000258908
0.000274274
0.000257511
0.000276603
0.000255182
0.000277534
0.000258908
0.000277068
0.000257976
0.000278465
0.000258908
0.000275206
0.00026077
0.000275671
0.000258908
0.000280328
0.000258442
0.000272412
0.000256579
0.000276137
0.000257045
0.00027474
0.000256114
0.000275206
0.000256114
0.000274274
0.000256114
0.000275671
0.000256579
0.000275671
0.000255182
0.000277068
0.000255648
0.000275671
0.000256114
0.000276137
0.000256579
0.00027474
0.000256114
0.000272412
0.000256114
0.000276137
0.000256579
0.000276603
0.000257045
0.000277068
0.000251923
0.000275671
0.000255648
0.000275671
0.000255648
0.000275671
0.000255648
0.000275671
0.000256579
0.000277534
0.000252854
0.000277068
0.000257045
0.000276137
0.000252854
0.000275206
0.000257045
0.000270549
0.000258908
0.000271946
0.000254717
0.000274274
0.000252854
0.000275206
0.000256114
0.000278
0.000257511
0.000272878
0.000256114
0.000275206
0.000256114
0.000275671
0.000250526
0.000277534
0.000255648
0.000275206
0.000257511
0.000275206
0.000257511
0.000277534
0.000257045
0.000277068
0.000256114
0.000276137
0.000255648
0.000275206
0.000256114
0.000273343
0.000257045
0.000270549
0.000256579
0.000275671
0.000252854
0.000275206
0.000252854
0.000276137
0.000254717
0.000277068
0.000256114
0.00027474
0.000255182
0.000275206
0.000256579
0.000275671
0.000252854
0.000278
0.000256579
0.000275671
0.000256579
0.000271481
0.000255182
0.000270549
0.000255648
0.000271015
0.000256114
0.000277068
0.000256579
0.000275671
0.000257976
0.000278
0.000256114
0.000275671
0.000256579
0.000276603
0.000255182
0.000274274
0.000256579
0.000275206
0.000256114
0.000272412
0.000257511
0.000272412
0.000257045
0.000272878
0.000257045
0.000278931
0.000255648
0.000275671
0.000255648
0.000275206
0.000255648
0.000274274
0.000256579
0.000273809
0.000256579
0.000276137
0.000254717
0.000276603
0.000252388
0.000276137
0.000257511
0.000275671
0.000256579
0.000276137
0.000256114
0.000271481
0.000256114
0.00027474
0.000256114
0.000275671
0.000257511
0.000273809
0.000257045
0.000273809
0.000256579
0.000275671
0.000257976
0.000273343
0.000255648
0.000278
0.000255648
0.000275206
0.000257976
0.000278
0.000256114
0.000276603
0.000256579
0.000276603
0.000257976
0.000275671
0.000252388
0.000275671
0.000257045
0.000275671
0.000257045
0.000271946
0.000257976
0.000277068
0.000256579
0.000275671
0.000256114
0.000276603
0.000254717
0.000274274
0.000252388
0.000276137
0.000252854
0.000276603
0.000257045
0.000278
0.000257045
0.000271481
0.000256114
0.000272878
0.000257045
0.000272412
0.000256114
0.000276137
0.000252854
0.000276603
0.000255648
0.000272878
0.000257045
0.000277068
0.000255648
0.000277068
0.000256579
0.000271481
0.000257045
0.000271015
0.000256114
0.000271481
0.000258442
0.000276137
0.000255182
0.000273809
0.000257976
0.000271946
0.000256114
0.000276137
0.000256579
0.000276603
0.000251923
0.000275206
0.000252854
0.000277068
0.00025332
0.000277068
0.000259373
0.000278
0.00025332
0.00027474
0.000256114
0.000272878
0.000257976
0.000275206
0.000252388
0.000276603
0.000257976
0.000277534
0.000252854
0.000276137
0.00025332
0.000277068
0.000257045
0.000271946
0.000257045
0.000276137
0.000257045
0.000271481
0.000256114
0.000271946
0.000256114
0.000274274
0.000257045
0.000273343
0.000257045
0.000272412
0.000256579
0.000271015
0.000256579
0.000275206
0.000255648
0.000275206
0.000257045
0.000275671
0.000252388
0.000276603
0.000252388
0.000277534
0.000251923
0.000275671
0.00025332
0.000277534
0.000253785
0.000276137
0.00025332
0.000274274
0.000256114
0.000271481
...
avg_runtime: 0.000268196
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Ying,
thank you for your input. I am executing my code on an Intel(R) Xeon(R) CPU E5-2640 v2 @ 2.00GHz, the OS is CentOS Linux release 7.3.1611 and the compiler is clang. The compiler call is
clang++ main.cpp -I/path/to/mkl/include/ -L/path/to/mkl/lib/intel64/ -lmkl -liomp5
The complete code is as follows:
#include <cstddef> #include <mkl.h> #include <cstdio> #include <iostream> #include <vector> #include <unistd.h> #define NUM_EVALUATIONS 200 #define NUM_WARMUPS 3 int main() { size_t M = 64; // rows size_t N = 2; // columns // fill arrays with random values float *matrix __attribute__( ( aligned( 64 ) ) ); float *vector __attribute__( ( aligned( 64 ) ) ); float *result __attribute__( ( aligned( 64 ) ) ); matrix = (float *) mkl_malloc(M * N * sizeof(float), 64); fill(matrix, M * N); vector = (float *) mkl_malloc(N * sizeof(float), 64); fill(vector, N); result = (float *) mkl_malloc(M * sizeof(float), 64); zero(result, M); // execute warm up MKL calls for (size_t i = 0; i < NUM_WARMUPS; ++i) { cblas_sgemv(CblasRowMajor, CblasNoTrans, M, N, 1.0f, matrix, N, vector, 1, 0.0f, result, 1); } // execute MKL call and measure execution time double avg_runtime = 0; std::vector<float> runtimes; for (size_t i = 0; i < NUM_EVALUATIONS; ++i) { auto start = dsecnd(); cblas_sgemv(CblasRowMajor, CblasNoTrans, M, N, 1.0f, matrix, N, vector, 1, 0.0f, result, 1); auto end = dsecnd(); double runtime = (end - start) * 1000; // use result as input of next evaluation to prevent compiler optimizations if (i < NUM_EVALUATIONS - 1) { for (size_t k = 0; k < std::max(M, N); ++k) { vector[k % N] = result[k % M]; matrix= result[k % M]; } } // std::cout << runtime << std::endl; avg_runtime += runtime; } avg_runtime /= NUM_EVALUATIONS; float sum = 0; for (size_t i = 0; i < M; ++i) sum += result; std::cout << "sum: " << sum << std::endl; std::cout << "avg_runtime: " << avg_runtime << std::endl; double gflop = (M * (2 * N - 1)) * 1E-9; std::cout << "GFlop/s: " << gflop / avg_runtime * 1000.0 << std::endl; // free buffers mkl_free(matrix); mkl_free(vector); mkl_free(result); }
Ying H. (Intel) wrote:
In your test case, do you mean, the runtime one times is 10x slower than average time, right?
Yes the 10x difference occurs only in the first call. However, when uncommenting the line 54 of the listing above I get a runtime that is approximately 2x slower. So when I don't print all the runtimes I get around 0,46 GFlop/s, but with printing I get 0,26 GFlop/s. That is the only issue remaining. I am not sure, which values are "correct" and should be used in a comparison.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Richard,
Get it. Do you have the Xeon machine HT switched on or not?
Anyway, please try
>export OMP_NUM_THREADS=1
>export KMP_AFFINITY=granularity=fine,compact,1,0
>a.out
As the size of matrix is small, it is better to keep the thread affinity to fix CPU, so get stable result.
https://software.intel.com/en-us/mkl-linux-developer-guide-using-intel-hyper-threading-technology
Best Regards,
Ying
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page