Thank you for submitting the issue.
>>>and also with similar settings with icpc
What exactly was your compilation command line, and what version of icpc?
Using the row-major and column-major kernels you provided, I was unable to replicate the results you obtained on your core-i7 box. In other words, I don't see big increases in the time for the column-major version; it performs about the same as the row-major version.
kernel_cm-rd-n.cpp is my column-major version using your kernel
kernel_rm-rd-n.cpp is my row-major version using your kernel
At -O2/-O3, icpc-15.0.1 will perform loop interchange on the column-major version, and then vectorize the permuted loop, effectively generating code equivalent to the row-major version:
< LOOP BEGIN at kernel_cm-rd-n.cpp(38,8)
< remark #25444: Loopnest Interchanged: ( 1 2 ) --> ( 2 1 )
< remark #15542: loop was not vectorized: inner loop was already vectorized [ kernel_cm-rd-n.cpp(38,8) ]
---
> LOOP BEGIN at kernel_rm-rd-n.cpp(37,4)
> remark #15542: loop was not vectorized: inner loop was already vectorized
164c160
< LOOP BEGIN at kernel_cm-rd-n.cpp(37,4)
---
> LOOP BEGIN at kernel_rm-rd-n.cpp(38,7)
168,169c164,165
< LOOP BEGIN at kernel_cm-rd-n.cpp(37,4)
< remark #15301: PERMUTED LOOP WAS VECTORIZED
---
> LOOP BEGIN at kernel_rm-rd-n.cpp(38,7)
> remark #15300: LOOP WAS VECTORIZED
I tested on Intel Xeon E5-2650 Sandy Bridge-EP 2.0GHz, 20MB L3 Cache. sizeof(std::vector<double>) == 24 bytes on my machine, so for n==933, the array just fits into the L3 cache at 20,891,736 bytes, and for n==935, it is slightly larger at 20,981,400 bytes (20MB == 20,971,520 bytes).
Testing at n == {100, 500, 1000}
Column-major at -O3:
[U539679]$ ./kernel_cm-rd-n.cpp-O3.x
enter n for v(n * n) >100
&v[n*n-1] - &v[0] = 9999
Size of array (KB) = 234
L3 cache size (KB) = 20480
kernel loop took = 6.91414e-06 s
kernel loop (ns) = 6914.14
kernel loop (ns)/array size = 0.0288089
foo(100) = 2.55025e+07
[U539679]$ ./kernel_cm-rd-n.cpp-O3.x
enter n for v(n * n) >500
&v[n*n-1] - &v[0] = 249999
Size of array (KB) = 5859
L3 cache size (KB) = 20480
kernel loop took = 7.20024e-05 s
kernel loop (ns) = 72002.4
kernel loop (ns)/array size = 0.0120004
foo(500) = 1.56876e+10
[U539679]$ ./kernel_cm-rd-n.cpp-O3.x
enter n for v(n * n) >1000
&v[n*n-1] - &v[0] = 999999
Size of array (KB) = 23437
L3 cache size (KB) = 20480
kernel loop took = 0.00031805 s
kernel loop (ns) = 318050
kernel loop (ns)/array size = 0.0132521
foo(1000) = 2.505e+11
Row-major at -O3:
[U539679]$ ./kernel_rm-rd-n.cpp-O3.x
enter n for v(n * n) >100
&v[n*n-1] - &v[0] = 9999
Size of array (KB) = 234
L3 cache size (KB) = 20480
kernel loop took = 6.91414e-06 s
kernel loop (ns) = 6914.14
kernel loop (ns)/array size = 0.0288089
foo(100) = 2.55025e+07
[U539679]$ ./kernel_rm-rd-n.cpp-O3.x
enter n for v(n * n) >500
&v[n*n-1] - &v[0] = 249999
Size of array (KB) = 5859
L3 cache size (KB) = 20480
kernel loop took = 7.10487e-05 s
kernel loop (ns) = 71048.7
kernel loop (ns)/array size = 0.0118415
foo(500) = 1.56876e+10
[U539679]$ ./kernel_rm-rd-n.cpp-O3.x
enter n for v(n * n) >1000
&v[n*n-1] - &v[0] = 999999
Size of array (KB) = 23437
L3 cache size (KB) = 20480
kernel loop took = 0.00027585 s
kernel loop (ns) = 275850
kernel loop (ns)/array size = 0.0114938
foo(1000) = 2.505e+11
[U539679]$
I didn't try testing the full Amazon code, since I couldn't verify your initial results.
Patrick