Thank you for submitting the issue.

>>>and also with similar settings with icpc

What exactly was your compilation command line, and what version of icpc?

Using the row-major and column-major kernels you provided, I was unable to replicate the results you obtained on your core-i7 box. In other words, I don't see big increases in the time for the column-major version; it performs about the same as the row-major version.

kernel_cm-rd-n.cpp is my column-major version using your kernel

kernel_rm-rd-n.cpp is my row-major version using your kernel

At -O2/-O3, icpc-15.0.1 will perform loop interchange on the column-major version, and then vectorize the permuted loop, effectively generating code equivalent to the row-major version:

< LOOP BEGIN at kernel_cm-rd-n.cpp(38,8)

< **remark #25444: Loopnest Interchanged: ( 1 2 ) --> ( 2 1 )**

< remark #15542: loop was not vectorized: inner loop was already vectorized [ kernel_cm-rd-n.cpp(38,8) ]

---

> LOOP BEGIN at kernel_rm-rd-n.cpp(37,4)

> remark #15542: loop was not vectorized: inner loop was already vectorized

164c160

< LOOP BEGIN at kernel_cm-rd-n.cpp(37,4)

---

> LOOP BEGIN at kernel_rm-rd-n.cpp(38,7)

168,169c164,165

< LOOP BEGIN at kernel_cm-rd-n.cpp(37,4)

< **remark #15301: PERMUTED LOOP WAS VECTORIZED**

---

> LOOP BEGIN at kernel_rm-rd-n.cpp(38,7)

> remark #15300: LOOP WAS VECTORIZED

I tested on Intel Xeon E5-2650 Sandy Bridge-EP 2.0GHz, 20MB L3 Cache. sizeof(std::vector<double>) == 24 bytes on my machine, so for n==933, the array just fits into the L3 cache at 20,891,736 bytes, and for n==935, it is slightly larger at 20,981,400 bytes (20MB == 20,971,520 bytes).

Testing at n == {100, 500, 1000}

**Column-major at -O3:**

[U539679]$ ./kernel_cm-rd-n.cpp-O3.x

enter n for v(n * n) >100

&v[n*n-1] - &v[0] = 9999

Size of array (KB) = 234

L3 cache size (KB) = 20480

kernel loop took = 6.91414e-06 s

kernel loop (ns) = 6914.14

**kernel loop (ns)/array size = 0.0288089**

foo(100) = 2.55025e+07

[U539679]$ ./kernel_cm-rd-n.cpp-O3.x

enter n for v(n * n) >500

&v[n*n-1] - &v[0] = 249999

Size of array (KB) = 5859

L3 cache size (KB) = 20480

kernel loop took = 7.20024e-05 s

kernel loop (ns) = 72002.4

**kernel loop (ns)/array size = 0.0120004**

foo(500) = 1.56876e+10

[U539679]$ ./kernel_cm-rd-n.cpp-O3.x

enter n for v(n * n) >1000

&v[n*n-1] - &v[0] = 999999

Size of array (KB) = 23437

L3 cache size (KB) = 20480

kernel loop took = 0.00031805 s

kernel loop (ns) = 318050

**kernel loop (ns)/array size = 0.0132521**

foo(1000) = 2.505e+11

**Row-major at -O3:**

[U539679]$ ./kernel_rm-rd-n.cpp-O3.x

enter n for v(n * n) >100

&v[n*n-1] - &v[0] = 9999

Size of array (KB) = 234

L3 cache size (KB) = 20480

kernel loop took = 6.91414e-06 s

kernel loop (ns) = 6914.14

**kernel loop (ns)/array size = 0.0288089**

foo(100) = 2.55025e+07

[U539679]$ ./kernel_rm-rd-n.cpp-O3.x

enter n for v(n * n) >500

&v[n*n-1] - &v[0] = 249999

Size of array (KB) = 5859

L3 cache size (KB) = 20480

kernel loop took = 7.10487e-05 s

kernel loop (ns) = 71048.7

**kernel loop (ns)/array size = 0.0118415**

foo(500) = 1.56876e+10

[U539679]$ ./kernel_rm-rd-n.cpp-O3.x

enter n for v(n * n) >1000

&v[n*n-1] - &v[0] = 999999

Size of array (KB) = 23437

L3 cache size (KB) = 20480

kernel loop took = 0.00027585 s

kernel loop (ns) = 275850

**kernel loop (ns)/array size = 0.0114938**

foo(1000) = 2.505e+11

[U539679]$

I didn't try testing the full Amazon code, since I couldn't verify your initial results.

Patrick