- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I don't understand why this is the case....
I have two functions that add two vectors of 32 bit integers. Strangely, the SSE implementation is slower... I would suspect the opposite, why?
the code...
#include <emmintrin.h>
#pragma pack(1)
typedef union {
struct { fixed_t x, y, z, r; };
fixed_t array[4];
} vector_t __attribute__ ((aligned(16)));
#pragma pack()
vector_t *
vector_add_nosse(vector_t * const pOut,
const vector_t * const pLhs,
const vector_t * const pRhs) {
pOut->x = (pLhs->x + pRhs->x);
pOut->y = (pLhs->y + pRhs->y);
pOut->z = (pLhs->z + pRhs->z);
pOut->r = (pLhs->r + pRhs->r);
return pOut;
}
vector_t *
vector_add_sse(vector_t * const pOut,
const vector_t * const pLhs,
const vector_t * const pRhs) {
__m128i lhs = _mm_load_si128((__m128i*)pLhs);
__m128i rhs = _mm_load_si128((__m128i*)pRhs);
__m128i ret = _mm_add_epi32(lhs, rhs);
_mm_store_si128((__m128i*)pOut, ret);
return pOut;
}
The timing results are below (i7-2620M CPU 2.7 ghz)
Benchmark_Vector_Add_nosse 5.99 ns 5.98 ns 116186952
Benchmark_Vector_Add_sse 8.78 ns 8.78 ns 79138862
and just to make sure, here is the assembly output from the compiler for the two methods...
0000000000000000 <vector_add_nosse>:
0: f3 0f 1e fa endbr64
4: 8b 0a mov (%rdx),%ecx
6: 03 0e add (%rsi),%ecx
8: 48 89 f8 mov %rdi,%rax
b: 89 0f mov %ecx,(%rdi)
d: 8b 4a 04 mov 0x4(%rdx),%ecx
10: 03 4e 04 add 0x4(%rsi),%ecx
13: 89 4f 04 mov %ecx,0x4(%rdi)
16: 8b 4a 08 mov 0x8(%rdx),%ecx
19: 8b 52 0c mov 0xc(%rdx),%edx
1c: 03 4e 08 add 0x8(%rsi),%ecx
1f: 03 56 0c add 0xc(%rsi),%edx
22: 89 4f 08 mov %ecx,0x8(%rdi)
25: 89 57 0c mov %edx,0xc(%rdi)
28: c3 ret
Disassembly of section .text.vector_add_sse:
0000000000000000 <vector_add_sse>:
0: f3 0f 1e fa endbr64
4: 66 0f 6f 02 movdqa (%rdx),%xmm0
8: 66 0f fe 06 paddd (%rsi),%xmm0
c: 48 89 f8 mov %rdi,%rax
f: 0f 29 07 movaps %xmm0,(%rdi)
12: c3 ret
Anyone have any ideas why the non-SSE integer calc is so much faster?
Thanks
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thank you for posting in Intel Communities.
Could you please share the sample reproducer code (steps if any)so that we can reproduce the issue from our end?
Also please let us know the OS details and compiler version being used.
Thanks and Regards,
Pendyala Sesha Srinivas
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you for the reply, I don't know what you mean by "reproducer code". The code in question is already posted in the initial post. The only code that is missing is the library "Google Benchmark" which is too large to post here, but can be found here and the benchmark test code which is below...
#include <string.h>
#include <random>
#include <limits>
#include "benchmark/benchmark.h"
static void
Benchmark_Vector_Add_nosse(benchmark::State & rState) {
RandomList<fixed_t, 64> rand(std::numeric_limits<fixed_t>::min(),
std::numeric_limits<fixed_t>::max());
vector_t rslt;
for (auto _ : rState) {
vector_t vec[] = {
vector_initdeclare(*rand++, *rand++, *rand++),
vector_initdeclare(*rand++, *rand++, *rand++)
};
vector_add_nosse(&rslt, &vec[0], &vec[1]);
}
}
static void
Benchmark_Vector_Add_sse(benchmark::State & rState) {
RandomList<fixed_t, 64> rand(std::numeric_limits<fixed_t>::min(),
std::numeric_limits<fixed_t>::max());
vector_t rslt;
for (auto _ : rState) {
vector_t vec[] = {
vector_initdeclare(*rand++, *rand++, *rand++),
vector_initdeclare(*rand++, *rand++, *rand++)
};
vector_add_sse(&rslt, &vec[0], &vec[1]);
}
}
BENCHMARK(Benchmark_Vector_Add_nosse);
BENCHMARK(Benchmark_Vector_Add_sse);
BENCHMARK_MAIN();
also the method vector_initdeclare is the following code...
#define vector_initdeclare(X, Y, Z) {{X, Y, Z, 0}}
As for the OS: Linux 5.15.0-48-generic Ubuntu 22.04
The compiler, it doesn't matter because the issue isn't the compiled output, that is pasted in the original post, its the performance between using SSE and not... using the SSE is slower, and that is counter intuitive. It can be compiled with GCC or ICC, same result. I Tested with both ICC 2021.6.0 & gcc 11.2.1.
PLEASE NOTE: in order to prevent the compiler for generating SSE code for the '*_nosse(..)' method, please prefix this to the function signature.
__attribute__((optimize("no-tree-vectorize")))
If that's not added, the compiler will likely generate the same output for both methods.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
oops sorry, here is the link to google benchmark https://github.com/google/benchmark
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
We are facing some issues while trying to run the source file. Please find the attachment which has the output of errors.
Could you please help us to resolve this so that we can reproduce your issue from our end? And also please provide the steps you have followed.
Thanks and Regards,
Pendyala Sesha Srinivas
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
We haven't heard back from you. Could you please provide an update on your issue?
Thanks and Regards,
Pendyala Sesha Srinivas
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
yeah.. so for each of the errors in your log file..
1) "error: identifier "fixed_t" is undefined": This can be solved by adding the following lines of code.
#include <stdint.h>
typedef int32_t fixed_t;
2) The " error: expression must be a pointer to a complete object type" and other errors are a result of the type 'fixed_t' not being defined, which should be solved by step #1 (Above)
3) The "/usr/local/include/c++/12.1.0/bits/random.h(104): error: expected a declaration { __extension__ using type = unsigned __int128; };" seems like an issue with your C++ compiler setup/installation.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
We are unable to reproduce your issue. We have sent a mail with detailed steps and issues we are facing while trying to run the source file.
Could you please reply to the mail or send the complete source code and reproducible steps so that we can reproduce the issue and investigate more from our end?
Thanks and Regards,
Pendyala Sesha Srinivas
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
We haven't heard back from you. Could you please provide an update on your issue?
Thanks and Regards,
Pendyala Sesha Srinivas
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
We have not heard back from you. This thread will no longer be monitored by Intel. If you need further assistance, please post a new question.
Thanks and Regards,
Pendyala Sesha Srinivas

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page