Hi Patrick,

Francois_F_ · ‎01-26-2015

I've been testing the differences of time it takes to sum the element of a matrix in row major order

std::vector<double> v( n * n );

// Timing begins
double sum{ 0.0 };
for (std::size_t i = 0; i < n; i++) {
    for (std::size_t j = 0; j < n; j++) {
        sum += v[i * n + j];
    }
}
// Timing ends

and in column major order

std::vector<double> v( n * n );

// Timing begins
double sum{ 0.0 };
for (std::size_t j = 0; j < n; j++) {
    for (std::size_t i = 0; i < n; i++) {
        sum += v[i * n + j];
    }
}
// Timing ends

The code has been compiled with

g++ -std=c++11 -Ofast -fno-tree-vectorize -DNDEBUG main.cpp -o main

and also with similar settings with icpc (This is more of an hardware question than a compiler question). We expect the timings of the row major order (blue) to be significantly faster than the column major order (yellow). If I plot the time it takes to run this algorithm (in nanoseconds) divided by the size in bytes of the array, I get the following graph on my computer which has a core-i7.

The x-axis displays n, and the y-axis displays the time in nanoseconds for the sumation divided by the size (in bytes) of v. Everything seems normal. The huge difference in between the two starts around n = 850 for which the size of the matrix is about 6MB which is exactly the size of my L3 cache. The column major order is 10 times slower than the row major order for large n. I am pleased with the results.

Next thing I do is run the same program on Amazon Web Services where they have an E5-2670. Here are the results of the same program.

The column major order is about 10 times slower than the row major order for 700 <= n <= 2000, but for n >= 2100, the cost per bytes of the column major order suddenly drops and it is just 2 times slower than the row major order!!! Does anyone have an explanation for this strange behaviour?

PS: For those who are interested, the full code is available here: https://www.dropbox.com/s/778hwpuriwqbi6o/InsideLoop.zip?dl=0

pbkenned1 · ‎01-28-2015

Thank you for submitting the issue.

>>>and also with similar settings with icpc

What exactly was your compilation command line, and what version of icpc?

Using the row-major and column-major kernels you provided, I was unable to replicate the results you obtained on your core-i7 box. In other words, I don't see big increases in the time for the column-major version; it performs about the same as the row-major version.

kernel_cm-rd-n.cpp is my column-major version using your kernel

kernel_rm-rd-n.cpp is my row-major version using your kernel

At -O2/-O3, icpc-15.0.1 will perform loop interchange on the column-major version, and then vectorize the permuted loop, effectively generating code equivalent to the row-major version:

< LOOP BEGIN at kernel_cm-rd-n.cpp(38,8)
<    remark #25444: Loopnest Interchanged: ( 1 2 ) --> ( 2 1 )
<    remark #15542: loop was not vectorized: inner loop was already vectorized   [ kernel_cm-rd-n.cpp(38,8) ]
---
> LOOP BEGIN at kernel_rm-rd-n.cpp(37,4)
>    remark #15542: loop was not vectorized: inner loop was already vectorized
164c160
<    LOOP BEGIN at kernel_cm-rd-n.cpp(37,4)
---
>    LOOP BEGIN at kernel_rm-rd-n.cpp(38,7)
168,169c164,165
<    LOOP BEGIN at kernel_cm-rd-n.cpp(37,4)
<       remark #15301: PERMUTED LOOP WAS VECTORIZED
---
>    LOOP BEGIN at kernel_rm-rd-n.cpp(38,7)
>       remark #15300: LOOP WAS VECTORIZED

I tested on Intel Xeon E5-2650 Sandy Bridge-EP 2.0GHz, 20MB L3 Cache. sizeof(std::vector<double>) == 24 bytes on my machine, so for n==933, the array just fits into the L3 cache at 20,891,736 bytes, and for n==935, it is slightly larger at 20,981,400 bytes (20MB == 20,971,520 bytes).

Testing at n == {100, 500, 1000}

Column-major at -O3:

[U539679]$ ./kernel_cm-rd-n.cpp-O3.x

enter n for v(n * n) >100
&v[n*n-1] - &v[0] = 9999
Size of array (KB) = 234
L3 cache size (KB) = 20480
kernel loop took = 6.91414e-06 s
kernel loop (ns) = 6914.14
kernel loop (ns)/array size = 0.0288089
foo(100) = 2.55025e+07

[U539679]$ ./kernel_cm-rd-n.cpp-O3.x

enter n for v(n * n) >500
&v[n*n-1] - &v[0] = 249999
Size of array (KB) = 5859
L3 cache size (KB) = 20480
kernel loop took = 7.20024e-05 s
kernel loop (ns) = 72002.4
kernel loop (ns)/array size = 0.0120004
foo(500) = 1.56876e+10

[U539679]$ ./kernel_cm-rd-n.cpp-O3.x

enter n for v(n * n) >1000
&v[n*n-1] - &v[0] = 999999
Size of array (KB) = 23437
L3 cache size (KB) = 20480
kernel loop took = 0.00031805 s
kernel loop (ns) = 318050
kernel loop (ns)/array size = 0.0132521
foo(1000) = 2.505e+11

Row-major at -O3:

[U539679]$ ./kernel_rm-rd-n.cpp-O3.x

enter n for v(n * n) >100
&v[n*n-1] - &v[0] = 9999
Size of array (KB) = 234
L3 cache size (KB) = 20480
kernel loop took = 6.91414e-06 s
kernel loop (ns) = 6914.14
kernel loop (ns)/array size = 0.0288089
foo(100) = 2.55025e+07

[U539679]$ ./kernel_rm-rd-n.cpp-O3.x

enter n for v(n * n) >500
&v[n*n-1] - &v[0] = 249999
Size of array (KB) = 5859
L3 cache size (KB) = 20480
kernel loop took = 7.10487e-05 s
kernel loop (ns) = 71048.7
kernel loop (ns)/array size = 0.0118415
foo(500) = 1.56876e+10

[U539679]$ ./kernel_rm-rd-n.cpp-O3.x

enter n for v(n * n) >1000
&v[n*n-1] - &v[0] = 999999
Size of array (KB) = 23437
L3 cache size (KB) = 20480
kernel loop took = 0.00027585 s
kernel loop (ns) = 275850
kernel loop (ns)/array size = 0.0114938
foo(1000) = 2.505e+11
[U539679]$

I didn't try testing the full Amazon code, since I couldn't verify your initial results.

Patrick

velvia · ‎01-29-2015

Hi Patrick,

Thanks for your answer. I am Francois who originally posted the message, but I have messed up with my accounts. Here are a few comments:

- The original case is closed. I was using my own version of std::vector and I did not want to bother you with that which is the reason in my post I switched to std::vector. As my own version was configured to use 32 bits integers, it is easy to understand what was going on: for n > 2048, n * n > 2^32 and we get a nice integer overflow which is undefined (I use signed integers on my version of std::vector). I have yet to understand why it was not showing up on my machine :-)

- Your comment about the loop interchange is bugging me. When I do -qopt-report2 it says that the loop for the column major traversal have been interchanged. But timings seem to say that they have not been interchanged. My feeling is that you don't see that in your code because the compiler is optimizing away all the loops since they are not used. So it seems to me that the optimizer says that it exchanges the loops but it does not !

By the way, here is the full code using std::vector which shows a clear difference in between row major and row minor traversal.

#include <iostream>
#include <fstream>
#include <chrono>
#include <vector>

void save_row_column();

typedef int index_t;

int main(int argc, const char * argv[]) {
    save_row_column();

    return 0;
};

double sum_matrix_row_major(size_t n, double total_work, double& somme) {
    const size_t n_loops{ 1 + static_cast<size_t>(total_work / (8 * n * n)) };

    std::vector<double> v( n * n );
    double sum{ 0.0 };

    for (size_t k = 0; k < n * n; ++k) {
        v = static_cast<double>(k) / static_cast<double>(n * n);
    }

    auto start_time = std::chrono::high_resolution_clock::now();
    for (size_t k = 0; k < n_loops; ++k) {
        for (size_t i = 0; i < n; ++i) {
            for (size_t j = 0; j < n; ++j) {
                sum += v[n * i + j];
            }
        }
    }
    auto end_time = std::chrono::high_resolution_clock::now();
    long long time{ std::chrono::duration_cast<std::chrono::nanoseconds>(
            end_time - start_time).count() };

    somme += sum;

    return static_cast<double>(time) / n_loops;
};

double sum_matrix_column_major(size_t n, double total_work, double& somme) {
    const size_t n_loops{ 1 + static_cast<size_t>(total_work / (8 * n * n)) };

    std::vector<double> v( n * n );
    double sum{ 0.0 };

    for (size_t k = 0; k < n * n; ++k) {
        v = static_cast<double>(k) / static_cast<double>(n * n);
    }

    auto start_time = std::chrono::high_resolution_clock::now();
    for (size_t k = 0; k < n_loops; ++k) {
        for (size_t j = 0; j < n; ++j) {
            for (size_t i = 0; i < n; ++i) {
                sum += v[n * i + j];
            }
        }
    }
    auto end_time = std::chrono::high_resolution_clock::now();
    long long time{ std::chrono::duration_cast<std::chrono::nanoseconds>(
            end_time - start_time).count() };

    somme += sum;

    return static_cast<double>(time) / n_loops;
};

void save_row_column() {
	const double total_work{ 1.0 * 1024 * 1024 };
	double somme{ 0.0 };
    int n_begin{ 1 };
    int n_end{ 2000 };

    std::vector<int> size( n_end - n_begin + 1 );
    for (index_t k = 0; k <= n_end - n_begin; ++k) {
        size = n_begin + k;
    }

    std::vector<double> time_per_byte_row( size.size() );
    std::vector<double> time_per_byte_column( size.size() );

    // To speed the processor up to full speed
    sum_matrix_row_major(10000, total_work, somme);

    for (index_t i = 0; i < size.size(); ++i) {
        double time{ sum_matrix_row_major(size, total_work, somme) };
        double time_per_byte{ time / (8 * size * size) };
        time_per_byte_row = time_per_byte;
    };

    std::cout << std::endl;

    for (index_t i = 0; i < size.size(); ++i) {
        double time{ sum_matrix_column_major(size, total_work, somme) };
        double time_per_byte{ time / (8 * size * size) };
        time_per_byte_column = time_per_byte;
    };

    std::fstream file_row{
            "/Users/fayard/Desktop/time_per_byte_row.txt", 
            std::ios_base::out };
    std::fstream file_column{ 
            "/Users/fayard/Desktop/time_per_byte_column.txt",
            std::ios_base::out };

    for (index_t i = 0; i < size.size(); ++i) {
        file_row << size << " " << time_per_byte_row << std::endl;
        file_column << size << " " << time_per_byte_column << std::endl;
    }

    file_row.close();
    file_column.close();

    std::cout << somme << std::endl;
};

pbkenned1 · ‎01-29-2015

Hello Francois,

Thanks for posting the full code. I do think the optimizer interchanged the loops in the column-major example, after I inspected the generated assembly code. However, as you say, the loops are not doing much -- just a simple accumulation -- so perhaps your full code will be a better test. I'll investigate with the 15.0 compiler and let you know my results. If you are not using the 15.0 compiler, please let us know.

Patrick

velvia · ‎01-29-2015

Hi Patrick,

Here is the full information.

Mac OS X 10.10.2

fayard@pc-rdc:~$ icpc --version
icpc (ICC) 15.0.1 20141022
Copyright (C) 1985-2014 Intel Corporation.  All rights reserved.

fayard@pc-rdc:~$ icpc -std=c++11 -Ofast -xHost -ansi-alias main.cpp -o main

pbkenned1 · ‎01-29-2015

Thanks for the command line, compiler, and host information. I can reproduce your initial results, so now I'll look at the Amazon code.

Patrick

velvia · ‎01-29-2015

Hi Patrick,

So you get better performance with the row major order even though the compile says that it does exchange the loops in the column major order code? Is that right? Don't you find that surprising?

pbkenned1 · ‎01-29-2015

With -Ofast, I get the same graph you showed initially, ie, the one where the column-major version performs poorly once the L3 cache size is exceeded. But here's a surprise -- at -O2 essentially the column-major and row-major results are the same -- basically flat lines until n > 1400.

I'll need to go through the optimization reports in more detail to understand why, but 3 loops get interchanged at -O2, whereas only 1 got interchanged at -O3 (implied by -Ofast).

It's unusual, but not unheard of, that -O3 might have worse performance than -O2 on some codes.

Patrick

velvia · ‎01-29-2015

Hi Patrick,

I've checked the optimization report and I have been fooled by the fact that the compiler generated 2 codes for sum_matrix_column_major. One is directly inlined in save_row_column and the other one is not used. It turns out that the compiler generates different codes for the same function and fooled me in the way. So, finally we have :

- With -O2, the inlined version of sum_matrix_column_major has its inner loops exchanged. Therefore the function is as fast as sum_matrix_row_major.

- With -Ofast, these loops do not get exchanged. Therefore, the algorithm is not cache friendly and we can see the performance drop.

For Amazon Web Sevices, I always get the same results with the same program but as soon as I change the maximum value of the size, results change. If I warm up the CPU results also change. It must be a funny thing at Amazon that I won't get anyway.

Cache, row major and column major