Integer SSE Performance slower than traditional why?

midnight_coder · ‎09-25-2022

I don't understand why this is the case....
I have two functions that add two vectors of 32 bit integers. Strangely, the SSE implementation is slower... I would suspect the opposite, why?

the code...

#include <emmintrin.h>

#pragma pack(1)
typedef union { 
  struct { fixed_t x, y, z, r; };
  fixed_t array[4];
} vector_t __attribute__ ((aligned(16)));
#pragma pack()

vector_t *
vector_add_nosse(vector_t * const pOut,
const vector_t * const pLhs,
const vector_t * const pRhs) {
pOut->x = (pLhs->x + pRhs->x);
pOut->y = (pLhs->y + pRhs->y);
pOut->z = (pLhs->z + pRhs->z);
pOut->r = (pLhs->r + pRhs->r);
return pOut;
}

vector_t *
vector_add_sse(vector_t * const pOut,
const vector_t * const pLhs,
const vector_t * const pRhs) {
__m128i lhs = _mm_load_si128((__m128i*)pLhs);
__m128i rhs = _mm_load_si128((__m128i*)pRhs);
__m128i ret = _mm_add_epi32(lhs, rhs);
_mm_store_si128((__m128i*)pOut, ret);
return pOut;
}

The timing results are below (i7-2620M CPU 2.7 ghz)

Benchmark_Vector_Add_nosse         5.99 ns         5.98 ns    116186952
Benchmark_Vector_Add_sse           8.78 ns         8.78 ns     79138862

and just to make sure, here is the assembly output from the compiler for the two methods...

0000000000000000 <vector_add_nosse>:
   0:   f3 0f 1e fa             endbr64 
   4:   8b 0a                   mov    (%rdx),%ecx
   6:   03 0e                   add    (%rsi),%ecx
   8:   48 89 f8                mov    %rdi,%rax
   b:   89 0f                   mov    %ecx,(%rdi)
   d:   8b 4a 04                mov    0x4(%rdx),%ecx
  10:   03 4e 04                add    0x4(%rsi),%ecx
  13:   89 4f 04                mov    %ecx,0x4(%rdi)
  16:   8b 4a 08                mov    0x8(%rdx),%ecx
  19:   8b 52 0c                mov    0xc(%rdx),%edx
  1c:   03 4e 08                add    0x8(%rsi),%ecx
  1f:   03 56 0c                add    0xc(%rsi),%edx
  22:   89 4f 08                mov    %ecx,0x8(%rdi)
  25:   89 57 0c                mov    %edx,0xc(%rdi)
  28:   c3                      ret    

Disassembly of section .text.vector_add_sse:

0000000000000000 <vector_add_sse>:
   0:   f3 0f 1e fa             endbr64 
   4:   66 0f 6f 02             movdqa (%rdx),%xmm0
   8:   66 0f fe 06             paddd  (%rsi),%xmm0
   c:   48 89 f8                mov    %rdi,%rax
   f:   0f 29 07                movaps %xmm0,(%rdi)
  12:   c3                      ret

Anyone have any ideas why the non-SSE integer calc is so much faster?

Thanks

SeshaP_Intel · ‎09-27-2022

Hi,

Thank you for posting in Intel Communities.

Could you please share the sample reproducer code (steps if any)so that we can reproduce the issue from our end?

Also please let us know the OS details and compiler version being used.

Thanks and Regards,

Pendyala Sesha Srinivas

midnight_coder · ‎09-28-2022

Thank you for the reply, I don't know what you mean by "reproducer code". The code in question is already posted in the initial post. The only code that is missing is the library "Google Benchmark" which is too large to post here, but can be found here and the benchmark test code which is below...

#include <string.h>
#include <random>
#include <limits>
#include "benchmark/benchmark.h" 

static void
Benchmark_Vector_Add_nosse(benchmark::State & rState) {
  RandomList<fixed_t, 64> rand(std::numeric_limits<fixed_t>::min(),
							   std::numeric_limits<fixed_t>::max());
  vector_t rslt;
  for (auto _ : rState) {
    vector_t vec[] = {
      vector_initdeclare(*rand++, *rand++, *rand++),
      vector_initdeclare(*rand++, *rand++, *rand++)
    };
    vector_add_nosse(&rslt, &vec[0], &vec[1]);
  }
}

static void
Benchmark_Vector_Add_sse(benchmark::State & rState) {
  RandomList<fixed_t, 64> rand(std::numeric_limits<fixed_t>::min(),
							   std::numeric_limits<fixed_t>::max());
  vector_t rslt;
  for (auto _ : rState) {
    vector_t vec[] = {
      vector_initdeclare(*rand++, *rand++, *rand++),
      vector_initdeclare(*rand++, *rand++, *rand++)
    };
    vector_add_sse(&rslt, &vec[0], &vec[1]);
  }
}

BENCHMARK(Benchmark_Vector_Add_nosse); 
BENCHMARK(Benchmark_Vector_Add_sse); 
BENCHMARK_MAIN();

also the method vector_initdeclare is the following code...

#define vector_initdeclare(X, Y, Z) {{X, Y, Z, 0}}

As for the OS: Linux 5.15.0-48-generic Ubuntu 22.04
The compiler, it doesn't matter because the issue isn't the compiled output, that is pasted in the original post, its the performance between using SSE and not... using the SSE is slower, and that is counter intuitive. It can be compiled with GCC or ICC, same result. I Tested with both ICC 2021.6.0 & gcc 11.2.1.

PLEASE NOTE: in order to prevent the compiler for generating SSE code for the '*_nosse(..)' method, please prefix this to the function signature.

__attribute__((optimize("no-tree-vectorize")))

If that's not added, the compiler will likely generate the same output for both methods.

midnight_coder · ‎09-28-2022

oops sorry, here is the link to google benchmark https://github.com/google/benchmark

SeshaP_Intel · ‎10-06-2022

Hi,

We are facing some issues while trying to run the source file. Please find the attachment which has the output of errors.

Could you please help us to resolve this so that we can reproduce your issue from our end? And also please provide the steps you have followed.

Thanks and Regards,

Pendyala Sesha Srinivas

SeshaP_Intel · ‎10-13-2022

Hi,

We haven't heard back from you. Could you please provide an update on your issue?

Thanks and Regards,

Pendyala Sesha Srinivas

midnight_coder · ‎10-13-2022

yeah.. so for each of the errors in your log file..
1) "error: identifier "fixed_t" is undefined": This can be solved by adding the following lines of code.

#include <stdint.h>
typedef int32_t fixed_t;

2) The " error: expression must be a pointer to a complete object type" and other errors are a result of the type 'fixed_t' not being defined, which should be solved by step #1 (Above)
3) The "/usr/local/include/c++/12.1.0/bits/random.h(104): error: expected a declaration { __extension__ using type = unsigned __int128; };" seems like an issue with your C++ compiler setup/installation.

SeshaP_Intel · ‎11-02-2022

Hi,

We are unable to reproduce your issue. We have sent a mail with detailed steps and issues we are facing while trying to run the source file.

Could you please reply to the mail or send the complete source code and reproducible steps so that we can reproduce the issue and investigate more from our end?

Thanks and Regards,

Pendyala Sesha Srinivas

SeshaP_Intel · ‎11-09-2022

Hi,

We haven't heard back from you. Could you please provide an update on your issue?

Thanks and Regards,

Pendyala Sesha Srinivas

SeshaP_Intel · ‎11-15-2022

Hi,

We have not heard back from you. This thread will no longer be monitored by Intel. If you need further assistance, please post a new question.

Thanks and Regards,

Pendyala Sesha Srinivas