Intel® ISA Extensions
Use hardware-based isolation and memory encryption to provide more code protection in your solutions.
Announcements
This community is designed for sharing of public information. Please do not share Intel or third-party confidential information here.

## How to multiply __m128 by a scaler? Beginner
957 Views
I am just starting to work with the SSE intrinsic functions. Is there a better way to mupltiply a vector V by a scalar A than what I am doing below?

I would like to do the following u = u + (v*1.5 - vold*0.5)*delta_t;

where u, v, and vold are a vector with x, y, z, coordinates represented in a __m128.

Is there a better way to do this than to create the a, b and c values as I do in the below code? I am running on an Intel i7 computer so any options would be appreciated.

#include "stdafx.h"
#include
#include
#include

using namespace std;

class __declspec(align(16)) Element {
public:
Element( float ux, float uy, float uz, float vx, float vy, float vz, float vxold, float vyold, float vzold) {
u.m128_f32 = ux;
u.m128_f32 = uy;
u.m128_f32 = uz;
v.m128_f32 = vx;
v.m128_f32 = vy;
v.m128_f32 = vz;
vold.m128_f32 = vxold;
vold.m128_f32 = vyold;
vold.m128_f32 = vzold;
}

void Move() {
// u = u + (v*1.5 - vold*0.5)*delta_t;
float delta_t = 0.01;
__m128 a, b, c;
a.m128_f32 = 1.5;
a.m128_f32 = 1.5;
a.m128_f32 = 1.5;

b.m128_f32 = 0.5;
b.m128_f32 = 0.5;
b.m128_f32 = 0.5;

c.m128_f32 = delta_t;
c.m128_f32 = delta_t;
c.m128_f32 = delta_t;

u = _mm_add_ps(u, _mm_mul_ps(_mm_sub_ps(_mm_mul_ps(v,a), _mm_mul_ps(vold,b)), c));
}

__m128 u, v, vold;
};

int _tmain(int argc, _TCHAR* argv[])
{
Element A( 1,1,1, 1,2,3, 4,5,6);
A.Move();

cout << A.u.m128_f32 << " " << A.u.m128_f32 << " " << A.u.m128_f32 << endl;

char c;
cin >> c;
return 0;
}
1 Solution New Contributor I
957 Views
Using the _mm_set1_ps and _mm_set_ps intrinsics will clean up your code. Eg, you could re-write your Move method like:

void Move() {
// u = u + (v*1.5 - vold*0.5)*delta_t;
float delta_t = 0.01;

u = _mm_add_ps(u, _mm_mul_ps(_mm_sub_ps(_mm_mul_ps(v,_mm_set1_ps(1.5f)), _mm_mul_ps(vold,_mm_set1_ps(0.5f))), _mm_set1_ps(delta_t)));
}

You could similarly use the _mm_set_ps to clean up your constructor.

In terms of performance, if you are going to put the Move() method into a loop, you would probably want to create __m128 variablesfor 0.5, 1.5 and delta_t outside the loop. That way the compiler can keep these values in registers instead of loading them at each iteration of the loop. Hard forto say much more about performance without knowing what you intend to do with this object/method.
11 Replies New Contributor I
958 Views
Using the _mm_set1_ps and _mm_set_ps intrinsics will clean up your code. Eg, you could re-write your Move method like:

void Move() {
// u = u + (v*1.5 - vold*0.5)*delta_t;
float delta_t = 0.01;

u = _mm_add_ps(u, _mm_mul_ps(_mm_sub_ps(_mm_mul_ps(v,_mm_set1_ps(1.5f)), _mm_mul_ps(vold,_mm_set1_ps(0.5f))), _mm_set1_ps(delta_t)));
}

You could similarly use the _mm_set_ps to clean up your constructor.

In terms of performance, if you are going to put the Move() method into a loop, you would probably want to create __m128 variablesfor 0.5, 1.5 and delta_t outside the loop. That way the compiler can keep these values in registers instead of loading them at each iteration of the loop. Hard forto say much more about performance without knowing what you intend to do with this object/method. Beginner
957 Views Black Belt
957 Views
Quoting - areid

You could similarly use the _mm_set_ps to clean up your constructor.

On both gcc and icc (but not MSVC), _mm_set_ps will optimize with either SSE2 or SSE4 instructions, according to the architecture set on command line. New Contributor II
957 Views
Quoting - lascondes
I am just starting to work with the SSE intrinsic functions. Is there a better way to mupltiply a vector V by a scalar A than what I am doing below?

I would like to do the following u = u + (v*1.5 - vold*0.5)*delta_t;

where u, v, and vold are a vector with x, y, z, coordinates represented in a __m128.

Is there a better way to do this than to create the a, b and c values as I do in the below code? I am running on an Intel i7 computer so any options would be appreciated.

#include "stdafx.h"
#include
#include
#include

using namespace std;

class __declspec(align(16)) Element {
public:
Element( float ux, float uy, float uz, float vx, float vy, float vz, float vxold, float vyold, float vzold) {
u.m128_f32 = ux;
u.m128_f32 = uy;
u.m128_f32 = uz;
v.m128_f32 = vx;
v.m128_f32 = vy;
v.m128_f32 = vz;
vold.m128_f32 = vxold;
vold.m128_f32 = vyold;
vold.m128_f32 = vzold;
}

void Move() {
// u = u + (v*1.5 - vold*0.5)*delta_t;
float delta_t = 0.01;
__m128 a, b, c;
a.m128_f32 = 1.5;
a.m128_f32 = 1.5;
a.m128_f32 = 1.5;

b.m128_f32 = 0.5;
b.m128_f32 = 0.5;
b.m128_f32 = 0.5;

c.m128_f32 = delta_t;
c.m128_f32 = delta_t;
c.m128_f32 = delta_t;

u = _mm_add_ps(u, _mm_mul_ps(_mm_sub_ps(_mm_mul_ps(v,a), _mm_mul_ps(vold,b)), c));
}

__m128 u, v, vold;
};

int _tmain(int argc, _TCHAR* argv[])
{
Element A( 1,1,1, 1,2,3, 4,5,6);
A.Move();

cout << A.u.m128_f32 << " " << A.u.m128_f32 << " " << A.u.m128_f32 << endl;

char c;
cin >> c;
return 0;
}

u +=(v*1.5 - vold*0.5)*delta_t;
3 add + 3 sub + 9 mul = 15 scalar flops

rewrite it as :
u +=0.5*delta_t*(v*3.0 - vold);
3 add + 3 sub +7mul=13 scalar flops

vectorize :
process N elements *stored in SoA*together (N=4 for SSE, 8 for AVX, 16 for LRB, ...)

pseudo code below, hope it's clear

out of loop :

const c0_5 = 0.5 (N times)
const c3_0 =3.0 (N tiimes)

inner loop :

scale = c0_5*delta_t
ux += scale* (vx *c3_0 - voldx)
uy +=scale * (vy *c3_0 - voldy)
uz +=scale * (vz *c3_0 - voldz)

with SSE: c0_5, scale, ux, etc are vectors of 4 elements
for 4 results you need7 mulps+ 3addps + 3subps instead of 12 mulps + 4 addps + 4 subps with theAoS version above
also the mul vs add/sub ratiois a bit better balanced with my proposal,it's faster this way (i7 can issue mul and add in parallel) and future proof for FMA Beginner
957 Views
Some followup questions:

1. tim8: "On both gcc and icc (but not MSVC), _mm_set_ps will optimize with either SSE2 or SSE4 instructions, according to the architecture set on command line. "

What changes between SSE2 and SSE4 for the _mm_set_ps command?

2. bronxzv: "the mul vs add/sub ratio is a bit better balanced with my proposal, it's faster this way (i7 can issue mul and add in parallel)"

What, if anything, do I need to do so the i7 processor issues the mul and add in parallel? Do I need to call a special function, does the compiler take care of this, or does the processor do it all by itself? Appreciate your post - I think I have captured your intent in the code below.

3. Any additional comments on the following code is appreciated. Particularly if someone can help get the parallel_for loop working. I'm not sure if it is something in my coding or if the threads are accessing items at cache boundaries and this is causing a problem. Either way what needs to be changed to get this to work?

My current example code (can also be run without tbb):

// SSE Second Test.cpp : Defines the entry point for the console application.
//

#include "stdafx.h"
#include
#include
#include
#include
#include "tbb/tbb.h"
using namespace tbb;

using namespace std;

class __declspec(align(16)) Element {
public:
__m128 u, v, vold;

Element( float ux, float uy, float uz, float vx, float vy, float vz, float vxold, float vyold, float vzold) {
u =_mm_set_ps( 0, uz, uy, ux);
v =_mm_set_ps( 0, vz, vy, vx);
vold =_mm_set_ps( 0, vzold, vyold, vxold);
}

void MoveAoS(float delta_t ) {
// from areid
// u = u + (v*1.5 - vold*0.5)*delta_t;
// 3 add + 3 sub + 9 mul = 15 scalar flops
// u = _mm_add_ps(u, _mm_mul_ps(_mm_sub_ps(_mm_mul_ps(v,_mm_set1_ps(1.5f)), _mm_mul_ps(vold,_mm_set1_ps(0.5f))), _mm_set1_ps(delta_t)));

// bronxzv pointed out that the equation can be rewritten saving 2 multiplies
// u += 0.5*delta_t*(v*3.0 - vold);
// 3 add + 3 sub + 7 mul = 13 scalar flops
u = _mm_add_ps(u, _mm_mul_ps(_mm_set1_ps(0.5f * delta_t), _mm_sub_ps(_mm_mul_ps(v,_mm_set1_ps(3.0f)), vold)));
}
};

class __declspec(align(16)) ElementsSoA {
public:
float *ux, *uy, *uz, *vx, *vy, *vz, *vxold, *vyold, *vzold;
__m128 *ux128, *uy128, *uz128, *vx128, *vy128, *vz128, *vxold128, *vyold128, *vzold128;
static const int N = 4; // number of floats in each __mm128
int M; // number of elements
int Mm128; // number of _m128 structores over elements

ElementsSoA(int M_) { // Initialize structure for M Elements

// I think I need to allocate more memory than just M*sizeof(float)
int alignment = 16;
M = M_;
Mm128 = std::ceil(M*1.0/N); // ceiling function

ux = (float*)_aligned_malloc(sizeof(float)*Mm128*N, alignment);
uy = (float*)_aligned_malloc(sizeof(float)*Mm128*N, alignment);
uz = (float*)_aligned_malloc(sizeof(float)*Mm128*N, alignment);
vx = (float*)_aligned_malloc(sizeof(float)*Mm128*N, alignment);
vy = (float*)_aligned_malloc(sizeof(float)*Mm128*N, alignment);
vz = (float*)_aligned_malloc(sizeof(float)*Mm128*N, alignment);
vxold = (float*)_aligned_malloc(sizeof(float)*Mm128*N, alignment);
vyold = (float*)_aligned_malloc(sizeof(float)*Mm128*N, alignment);
vzold = (float*)_aligned_malloc(sizeof(float)*Mm128*N, alignment);

ux128 = (__m128*)ux;
uy128 = (__m128*)uy;
uz128 = (__m128*)uz;
vx128 = (__m128*)vx;
vy128 = (__m128*)vy;
vz128 = (__m128*)vz;
vxold128 = (__m128*)vxold;
vyold128 = (__m128*)vyold;
vzold128 = (__m128*)vzold;
}

~ElementsSoA() {
_aligned_free(ux);
_aligned_free(uy);
_aligned_free(uz);
_aligned_free(vx);
_aligned_free(vy);
_aligned_free(vz);
_aligned_free(vxold);
_aligned_free(vyold);
_aligned_free(vzold);

}

void MoveAoS(float delta_t ) {
// from areid
// u = u + (v*1.5 - vold*0.5)*delta_t;
// 3 add + 3 sub + 9 mul = 15 scalar flops
// u = _mm_add_ps(u, _mm_mul_ps(_mm_sub_ps(_mm_mul_ps(v,_mm_set1_ps(1.5f)), _mm_mul_ps(vold,_mm_set1_ps(0.5f))), _mm_set1_ps(delta_t)));

// bronxzv pointed out that the equation can be rewritten saving 2 multiplies
// u += 0.5*delta_t*(v*3.0 - vold);
// 3 add + 3 sub + 7 mul = 13 scalar flops

__m128 c5dt = _mm_set1_ps(0.5f * delta_t);
__m128 c3 = _mm_set1_ps(3.0f);
for (int j=0; jux128 = _mm_add_ps(ux128, _mm_mul_ps(c5dt, _mm_sub_ps(_mm_mul_ps(vx128, c3), vxold128)));
uy128 = _mm_add_ps(uy128, _mm_mul_ps(c5dt, _mm_sub_ps(_mm_mul_ps(vy128, c3), vyold128)));
uz128 = _mm_add_ps(uz128, _mm_mul_ps(c5dt, _mm_sub_ps(_mm_mul_ps(vz128, c3), vzold128)));
}
}
};

void OptionA(int Iterations, int M, float delta_t) {
Element *A;

int alignment = 16;

A = (Element*)_aligned_malloc(sizeof(Element)*M, alignment);
for (int i=0;iA = Element( 1+i,1,1, 1,2,3, 4,5,6);
}

clock_t start = clock();
for (int j=0;jfor (int i=0;iA.MoveAoS(delta_t);
}
}
double time = double(clock() - start)/CLOCKS_PER_SEC;

cout << A.u.m128_f32 << " " << A.u.m128_f32 << " " << A.u.m128_f32 << " " << time << endl;
_aligned_free(A);
}

void OptionB(int Iterations, int M, float delta_t) {
Element *A;

int alignment = 16;

A = (Element*)_aligned_malloc(sizeof(Element)*M, alignment);
for (int i=0;iA = Element( 1+i,1,1, 1,2,3, 4,5,6);
}

clock_t start = clock();

for (int j=0;jfor (int i=0;iA.u = _mm_add_ps(A.u, _mm_mul_ps(_mm_set1_ps(0.5f * delta_t), _mm_sub_ps(_mm_mul_ps(A.v,_mm_set1_ps(3.0f)), A.vold)));
}
}

double time = double(clock() - start)/CLOCKS_PER_SEC;

cout << A.u.m128_f32 << " " << A.u.m128_f32 << " " << A.u.m128_f32 << " " << time << endl;
_aligned_free(A);
}

void OptionC( int Iterations, int M, float delta_t) {
ElementsSoA A(M);

int alignment = 16;

for (int i=0;iA.ux = 1+i;
A.uy = 1;
A.uz = 1;

A.vx = 1;
A.vy = 2;
A.vz = 3;

A.vxold = 4;
A.vyold = 5;
A.vzold = 6;
}

clock_t start = clock();

for (int j=0;jA.MoveAoS( delta_t);
}

double time = double(clock() - start)/CLOCKS_PER_SEC;

cout << A.ux << " " << A.uy << " " << A.uz << " " << time << endl;
}

void OptionD( int Iterations, int M, float delta_t) {
ElementsSoA A(M);

int alignment = 16;

for (int i=0;iA.ux = 1+i;
A.uy = 1;
A.uz = 1;

A.vx = 1;
A.vy = 2;
A.vz = 3;

A.vxold = 4;
A.vyold = 5;
A.vzold = 6;
}

clock_t start = clock();

for (int j=0;j__m128 c5dt = _mm_set1_ps(0.5f * delta_t);
__m128 c3 = _mm_set1_ps(3.0f);

// The parallel_for does not work. Getting a Unhandled exception at 0x00f91a75 in SSE Third Test.exe: 0xC0000005: Access violation reading location 0x0071fa60.
// parallel_for( blocked_range(0,A.Mm128),
// [=](const blocked_range& r) {
// for(size_t i=r.begin(); i!=r.end(); ++i) {
// A.ux128 = _mm_add_ps(A.ux128, _mm_mul_ps(c5dt, _mm_sub_ps(_mm_mul_ps(A.vx128, c3), A.vxold128)));
// A.uy128 = _mm_add_ps(A.uy128, _mm_mul_ps(c5dt, _mm_sub_ps(_mm_mul_ps(A.vy128, c3), A.vyold128)));
// A.uz128 = _mm_add_ps(A.uz128, _mm_mul_ps(c5dt, _mm_sub_ps(_mm_mul_ps(A.vz128, c3), A.vzold128)));
// }
// });

for (int i=0; iA.ux128 = _mm_add_ps(A.ux128, _mm_mul_ps(c5dt, _mm_sub_ps(_mm_mul_ps(A.vx128, c3), A.vxold128)));
A.uy128 = _mm_add_ps(A.uy128, _mm_mul_ps(c5dt, _mm_sub_ps(_mm_mul_ps(A.vy128, c3), A.vyold128)));
A.uz128 = _mm_add_ps(A.uz128, _mm_mul_ps(c5dt, _mm_sub_ps(_mm_mul_ps(A.vz128, c3), A.vzold128)));
}
}

double time = double(clock() - start)/CLOCKS_PER_SEC;

cout << A.ux << " " << A.uy << " " << A.uz << " " << time << endl;
}

int _tmain(int argc, _TCHAR* argv[])
{
int Iterations = 10000;
int M = 125000; // Number of elements
float delta_t = 0.01f;

OptionA( Iterations, M, delta_t);
OptionB( Iterations, M, delta_t);
OptionC( Iterations, M, delta_t);
OptionD( Iterations, M, delta_t);

char c;
cin >> c;
return 0;
} Black Belt
957 Views
Quoting - lascondes

1. tim8: "On both gcc and icc (but not MSVC), _mm_set_ps will optimize with either SSE2 or SSE4 instructions, according to the architecture set on command line. "

What changes between SSE2 and SSE4 for the _mm_set_ps command?

There is an SSE4 intrinsic for insertps, equivalent to set_ps. gcc and icc make the switch-over automatic. New Contributor II
957 Views

>What, if anything, do I need to do so the i7 processor issues the mul and add in parallel?

You have nothing special to do, btw in fact your 'delta_t' isa loop invariant which I missed so it's in fact perfectly mul/add balanced

an interesting experiment willbe towatch your timingsby replacing the mulps with addps or the other way around (i.e all addps/subps or all mulps) even if the computations will have no practical meaning you will get an idea of the gain with balanced add/mul New Contributor I
957 Views
Could the problem with the parallel_for be that you forgot to create a task_scheduler_init object in main()? Beginner
957 Views
Quoting - areid
Could the problem with the parallel_for be that you forgot to create a task_scheduler_init object in main()?

Hi areid,

My understanding is the task_scheduler_init should call automatically the first time I do anything that would require the threads. In this case it should be called when the first parallel_for() loop is hit. If I want to change to something other than the default behavior - ie more or less threads than I have processors - then I need to call this function before any of the parallel_for loops.

The loop appears to run for a short while before throwing the error.

Can the SSE functions be called in parallel? On an i7 chip how many SSE functions can be called at the same time? I am running on an i7 with 4 cores each hyperthreaded. Does this mean I can handle 1 SSE call, 4 SSE calls or 8 SSE calls at the same time? Any links to somewhere that this is discussed would be appreciated.

Regards,

Las Condes New Contributor I
957 Views
Quoting - lascondes

My understanding is the task_scheduler_init should call automatically the first time I do anything that would require the threads.

That is the behavior in TBB 2.2, but not in previous versions. I don't know what version you are using.

Using as many threads as there are logical CPUs is a good general rule. If you really care about a few percent difference in performance you can experiment with fewer threads, but don't expect a big difference. The disadvantage of hard coding the number of threads is that if you run your app on a different machine, you might need to recompile to adjust the number of threads. Beginner
957 Views
Quoting - areid

That is the behavior in TBB 2.2, but not in previous versions. I don't know what version you are using. 