Solved: Re: Runtime comparisson AoS vs sdlt generated SoA

_Thomas_ · ‎08-04-2021

Hi,

I recently learned that accessing data that is stored in contiguous memory, i.e. arrays, can be processed much faster than data that is randomly distributed in memory. Using sdlt containers looked very interesting to me, so I started experimenting with them.

First I wanted to do a small experiment and compare a std::vector of structs to a soa_container. My expectation was that I would see the same performance or small performance improvement. To my surprise that part of the code that uses sdlt is about 20 times slower.

I am using one of the sdlt examples provided in the documentation. Also the generated assembly code does not look like the assemble given in the example.

I am using VS2019 in Debug mode, otherwise the loop is optimized away and I am mainly interested in the relative performance gain.

Compiler Options:

/GS /Qiopenmp /W3 /ZI /Od /D "_DEBUG" /D "_CONSOLE" /D "__INTEL_CXX11_MODE__" /D "_UNICODE" /D "UNICODE" /Zc:forScope /arch:CORE-AVX2 /Oi /MDd /std:c++17 /Fa"x64\Debug\" /EHsc /nologo /Fo"x64\Debug\" /Qstd=c++17 //fprofile-instr-use "x64\Debug\" /Fp"x64\Debug\Cpp_Intel_SDLT.pch"

Compiler Version:

2021.3.0.3221

sdlt::soa1d_container<RGBTy> aContainer(N); //SDLT to get SOA data layout
auto a = aContainer.access();
std::vector<RGBTy> bContainer(N);

start = std::chrono::high_resolution_clock::now();
for (int i = 0; i < COUNTER_TIME_LOOP; i++){
  for (int k = 0; k < N; k++) {
    bContainer[k].r = k * 1.5;
    bContainer[k].g = k * 2.5;
    bContainer[k].b = k * 3.5;
  }
}
end = std::chrono::high_resolution_clock::now();

// Calculating total time taken by the loop. 
time_taken = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count();
time_taken *= 1e-9;
std::cout << "Time taken by loop is : " << std::fixed << time_taken/ COUNTER_TIME_LOOP << std::setprecision(9) << " sec" << std::endl;

start = std::chrono::high_resolution_clock::now();
for (int i = 0; i < COUNTER_TIME_LOOP; ++i) {
  #pragma omp simd
  for (int k = 0; k < N; ++k) {
    a[k].r() = k * 1.5;
    a[k].g() = k * 2.5;
    a[k].b() = k * 3.5;
  }
}
end = std::chrono::high_resolution_clock::now();

Time taken by loop is : 0.001831 sec
Time taken by loop is : 0.035508027 sec

for (int i = 0; i < COUNTER_TIME_LOOP; ++i)
00007FF6803B1E6E add dword ptr [rax],eax 
00007FF6803B1E70 add bh,al 
{
#pragma omp simd
00007FF6803B1E72 js main+344h (07FF6803B1E74h) 
00007FF6803B1E75 pop qword ptr [rcx] 
00007FF6803B1E77 add byte ptr [rbx+4D89784Dh],cl 
00007FF6803B1E7D in al,dx 
00007FF6803B1E7E xor eax,eax 
00007FF6803B1E80 cmp eax,ecx 
00007FF6803B1E82 jg main+4B0h (07FF6803B1FE0h) 
00007FF6803B1E88 xor eax,eax 
00007FF6803B1E8A mov dword ptr [rbp-18h],eax 
00007FF6803B1E8D jmp main+362h (07FF6803B1E92h) 
00007FF6803B1E92 mov eax,dword ptr [rbp-18h] 
00007FF6803B1E95 mov dword ptr [rbp-24h],eax 
00007FF6803B1E98 mov dword ptr [rbp+204h],eax 
for (int k = 0; k < N; ++k) {
a[k].r() = k * 1.5;
00007FF6803B1E9E vcvtsi2sd xmm0,xmm0,dword ptr [rbp+204h] 
00007FF6803B1EA6 vmovsd xmm1,qword ptr [__real@3ff8000000000000 (07FF6803BED00h)] 
00007FF6803B1EAE vmulsd xmm1,xmm0,xmm1 
00007FF6803B1EB2 vcvtsd2ss xmm0,xmm0,xmm1 
00007FF6803B1EB6 vmovss dword ptr [rbp+15Ch],xmm0 
00007FF6803B1EBE mov r8d,dword ptr [rbp+204h] 
00007FF6803B1EC5 lea rcx,[a] 
00007FF6803B1ECC lea rdx,[rbp+178h] 
00007FF6803B1ED3 call sdlt::v2_6::internal::soa1d::accessor<RGBs,0,sdlt::v2_6::no_offset>::operator[]<int,std::integral_constant<bool,1> > (07FF6803B1267h) 
00007FF6803B1ED8 lea rcx,[rbp+178h] 
00007FF6803B1EDF lea rdx,[rbp+160h] 
00007FF6803B1EE6 call sdlt::v2_6::primitive<RGBs>::member_interface<sdlt::v2_6::internal::soa1d::element<RGBs,0,int>,0,sdlt::v2_6::internal::soa1d::member<RGBs,0,int>::proxy>::r (07FF6803B11BDh) 
00007FF6803B1EEB lea rcx,[rbp+160h] 
00007FF6803B1EF2 lea rdx,[rbp+15Ch] 
00007FF6803B1EF9 call sdlt::v2_6::internal::soa1d::member<RGBs,0,int>::proxy<float,0>::operator= (07FF6803B1361h) 
a[k].g() = k * 2.5;
00007FF6803B1EFE vcvtsi2sd xmm0,xmm0,dword ptr [rbp+204h] 
00007FF6803B1F06 vmovsd xmm1,qword ptr [__real@4004000000000000 (07FF6803BECF0h)] 
00007FF6803B1F0E vmulsd xmm1,xmm0,xmm1 
00007FF6803B1F12 vcvtsd2ss xmm0,xmm0,xmm1 
00007FF6803B1F16 vmovss dword ptr [rbp+194h],xmm0 
00007FF6803B1F1E mov r8d,dword ptr [rbp+204h] 
00007FF6803B1F25 lea rcx,[a] 
00007FF6803B1F2C lea rdx,[rbp+1B0h] 
00007FF6803B1F33 call sdlt::v2_6::internal::soa1d::accessor<RGBs,0,sdlt::v2_6::no_offset>::operator[]<int,std::integral_constant<bool,1> > (07FF6803B1267h) 
00007FF6803B1F38 lea rcx,[rbp+1B0h] 
00007FF6803B1F3F lea rdx,[rbp+198h] 
00007FF6803B1F46 call sdlt::v2_6::primitive<RGBs>::member_interface<sdlt::v2_6::internal::soa1d::element<RGBs,0,int>,0,sdlt::v2_6::internal::soa1d::member<RGBs,0,int>::proxy>::g (07FF6803B121Ch) 
00007FF6803B1F4B lea rcx,[rbp+198h] 
00007FF6803B1F52 lea rdx,[rbp+194h] 
00007FF6803B1F59 call sdlt::v2_6::internal::soa1d::member<RGBs,0,int>::proxy<float,4>::operator= (07FF6803B119Fh) 
a[k].b() = k * 3.5;
00007FF6803B1F5E vcvtsi2sd xmm0,xmm0,dword ptr [rbp+204h] 
00007FF6803B1F66 vmovsd xmm1,qword ptr [__real@400c000000000000 (07FF6803BECE0h)] 
00007FF6803B1F6E vmulsd xmm1,xmm0,xmm1 
00007FF6803B1F72 vcvtsd2ss xmm0,xmm0,xmm1 
00007FF6803B1F76 vmovss dword ptr [rbp+1CCh],xmm0 
00007FF6803B1F7E mov r8d,dword ptr [rbp+204h] 
00007FF6803B1F85 lea rcx,[a] 
00007FF6803B1F8C lea rdx,[rbp+1E8h] 
00007FF6803B1F93 call sdlt::v2_6::internal::soa1d::accessor<RGBs,0,sdlt::v2_6::no_offset>::operator[]<int,std::integral_constant<bool,1> > (07FF6803B1267h) 
00007FF6803B1F98 lea rcx,[rbp+1E8h] 
00007FF6803B1F9F lea rdx,[rbp+1D0h] 
00007FF6803B1FA6 call sdlt::v2_6::primitive<RGBs>::member_interface<sdlt::v2_6::internal::soa1d::element<RGBs,0,int>,0,sdlt::v2_6::internal::soa1d::member<RGBs,0,int>::proxy>::b (07FF6803B13C0h) 
00007FF6803B1FAB lea rcx,[rbp+1D0h] 
00007FF6803B1FB2 lea rdx,[rbp+1CCh] 
00007FF6803B1FB9 call sdlt::v2_6::internal::soa1d::member<RGBs,0,int>::proxy<float,8>::operator= (07FF6803B14C4h) 
00007FF6803B1FBE mov ecx,dword ptr [rbp-14h] 
00007FF6803B1FC1 mov eax,dword ptr [rbp-24h] 
{
#pragma omp simd
00007FF6803B1FC4 add eax,1 
00007FF6803B1FC7 mov dword ptr [rbp-20h],eax 
00007FF6803B1FCA add ecx,1 
00007FF6803B1FCD mov dword ptr [rbp-1Ch],ecx 
00007FF6803B1FD0 cmp ecx,eax 
00007FF6803B1FD2 mov dword ptr [rbp-18h],eax 
00007FF6803B1FD5 jg main+362h (07FF6803B1E92h) 
00007FF6803B1FDB jmp main+4B0h (07FF6803B1FE0h) 
// use SDLT Data Member Interface to access struct members r, g, and b.
// achieve unit-stride access after vectorization

for (int i = 0; i < COUNTER_TIME_LOOP; ++i)
00007FF6803B1FE0 mov eax,dword ptr [rbp+7Ch] 
00007FF6803B1FE3 add eax,1 
00007FF6803B1FE6 mov dword ptr [rbp+7Ch],eax 
00007FF6803B1FE9 jmp main+334h (07FF6803B1E64h) 
}
}
end = std::chrono::high_resolution_clock::now();

Would be great if someone could explain why I am seeing this performance decrease.

Thank you!

Thomas

VidyalathaB_Intel · ‎08-16-2021

Hi,

>>I see, that makes more sense now, thanks! I can reproduce your results

Glad to know that the provided information helped you.

>>#pragma omp simd before the loop that doesn't use the sdlt struct increases the runtime.

The possible reason might be is, the array of objects, (array of structs) stored in the memory are not contiguous so it will not be vectorized efficiently which results in increasing the runtime because it will take sometime for loading the values.

Regards,

Vidya.

View solution in original post

VidyalathaB_Intel · ‎08-05-2021

Hi,

Thanks for reaching out to us.

>>I am using one of the sdlt examples provided in the documentation

https://software.intel.com/content/www/us/en/develop/documentation/cpp-compiler-developer-guide-and-reference/top/compiler-reference/libraries/introduction-to-the-simd-data-layout-templates/examples/example-1.html

Are you referring to the above document from which you have taken sdlt examples and working ?

The issue raised by you is reproducible.

we are working on this issue and we will get back to you soon

Regards,

Vidya.

_Thomas_ · ‎08-05-2021

Yes, that's the example I used.

VidyalathaB_Intel · ‎08-12-2021

Hi,

>>why I am seeing this performance decrease

The performance difference is because of the flags which you are using while compiling the program (like usage of /Zi & /Od which turns off all the optimizations).

If you try compiling the code without the any such flags(which disables the optimizations) you can see that sdlt generated SoA is taking less time when compared to AoS.

>>My expectation was that I would see the same performance or small performance improvement. To my surprise that part of the code that uses sdlt is about 20 times slower.

We tried it from our end and the results are as follows

Time taken by loop is : 0.000276 sec

Time taken by loop is : 0.000180907 sec

Hope the provided information might helps

Regards,

Vidya.

_Thomas_ · ‎08-13-2021

Hi Vidya,

I see, that makes more sense now, thanks! I can reproduce your results.
I disabled all the optimizations because some compilers just remove the "useless" code in the loop and then a comparison is of course pointless. When O is not defined it seems to default to O2. The command I am using to compile the code is:

icl /openmp /FAcs /Faoutput.asm /arch:CORE-AVX2 /Qstd=c++11 Cpp_Intel_SDLT.cpp

Another point that surprised me is that putting #pragma omp simd before the loop that doesn't use the sdlt struct increases the runtime.

VidyalathaB_Intel · ‎08-16-2021

Hi,

>>I see, that makes more sense now, thanks! I can reproduce your results

Glad to know that the provided information helped you.

>>#pragma omp simd before the loop that doesn't use the sdlt struct increases the runtime.

The possible reason might be is, the array of objects, (array of structs) stored in the memory are not contiguous so it will not be vectorized efficiently which results in increasing the runtime because it will take sometime for loading the values.

Regards,

Vidya.

VidyalathaB_Intel · ‎08-23-2021

Hi,

Reminder:

Has the provided information helped?

If so, could you please confirm whether we can close this thread from our end?

Regards,

Vidya.

VidyalathaB_Intel · ‎08-23-2021

Hi,

Thanks for accepting our solution.

As the issue is resolved we are going ahead and closing this thread. If you need any additional information, please post a new question.

Have a good day!

Regards,

Vidya.