Solved: Re: How to multiply __m128 by a scaler?

lascondes · ‎09-30-2009

I am just starting to work with the SSE intrinsic functions. Is there a better way to mupltiply a vector V by a scalar A than what I am doing below?

I would like to do the following u = u + (v*1.5 - vold*0.5)*delta_t;

where u, v, and vold are a vector with x, y, z, coordinates represented in a __m128.

Is there a better way to do this than to create the a, b and c values as I do in the below code? I am running on an Intel i7 computer so any options would be appreciated.

#include "stdafx.h"
#include
#include
#include

using namespace std;

class __declspec(align(16)) Element {
public:
Element( float ux, float uy, float uz, float vx, float vy, float vz, float vxold, float vyold, float vzold) {
u.m128_f32[0] = ux;
u.m128_f32[1] = uy;
u.m128_f32[2] = uz;
v.m128_f32[0] = vx;
v.m128_f32[1] = vy;
v.m128_f32[2] = vz;
vold.m128_f32[0] = vxold;
vold.m128_f32[1] = vyold;
vold.m128_f32[2] = vzold;
}

void Move() {
// u = u + (v*1.5 - vold*0.5)*delta_t;
float delta_t = 0.01;
__m128 a, b, c;
a.m128_f32[0] = 1.5;
a.m128_f32[1] = 1.5;
a.m128_f32[2] = 1.5;

b.m128_f32[0] = 0.5;
b.m128_f32[1] = 0.5;
b.m128_f32[2] = 0.5;

c.m128_f32[0] = delta_t;
c.m128_f32[1] = delta_t;
c.m128_f32[2] = delta_t;

u = _mm_add_ps(u, _mm_mul_ps(_mm_sub_ps(_mm_mul_ps(v,a), _mm_mul_ps(vold,b)), c));
}

__m128 u, v, vold;
};

int _tmain(int argc, _TCHAR* argv[])
{
Element A( 1,1,1, 1,2,3, 4,5,6);
A.Move();

cout << A.u.m128_f32[0] << " " << A.u.m128_f32[1] << " " << A.u.m128_f32[2] << endl;

char c;
cin >> c;
return 0;
}

areid2 · ‎09-30-2009

Using the _mm_set1_ps and _mm_set_ps intrinsics will clean up your code. Eg, you could re-write your Move method like:

void Move() {
// u = u + (v*1.5 - vold*0.5)*delta_t;
float delta_t = 0.01;

u = _mm_add_ps(u, _mm_mul_ps(_mm_sub_ps(_mm_mul_ps(v,_mm_set1_ps(1.5f)), _mm_mul_ps(vold,_mm_set1_ps(0.5f))), _mm_set1_ps(delta_t)));
}

You could similarly use the _mm_set_ps to clean up your constructor.

In terms of performance, if you are going to put the Move() method into a loop, you would probably want to create __m128 variablesfor 0.5, 1.5 and delta_t outside the loop. That way the compiler can keep these values in registers instead of loading them at each iteration of the loop. Hard forto say much more about performance without knowing what you intend to do with this object/method.

View solution in original post

areid2 · ‎09-30-2009

Using the _mm_set1_ps and _mm_set_ps intrinsics will clean up your code. Eg, you could re-write your Move method like:

void Move() {
// u = u + (v*1.5 - vold*0.5)*delta_t;
float delta_t = 0.01;

u = _mm_add_ps(u, _mm_mul_ps(_mm_sub_ps(_mm_mul_ps(v,_mm_set1_ps(1.5f)), _mm_mul_ps(vold,_mm_set1_ps(0.5f))), _mm_set1_ps(delta_t)));
}

You could similarly use the _mm_set_ps to clean up your constructor.

In terms of performance, if you are going to put the Move() method into a loop, you would probably want to create __m128 variablesfor 0.5, 1.5 and delta_t outside the loop. That way the compiler can keep these values in registers instead of loading them at each iteration of the loop. Hard forto say much more about performance without knowing what you intend to do with this object/method.

lascondes · ‎09-30-2009

Very helpful. Thank you.

TimP · ‎09-30-2009

Quoting - areid

You could similarly use the _mm_set_ps to clean up your constructor.

On both gcc and icc (but not MSVC), _mm_set_ps will optimize with either SSE2 or SSE4 instructions, according to the architecture set on command line.

bronxzv · ‎10-01-2009

Quoting - lascondes

I am just starting to work with the SSE intrinsic functions. Is there a better way to mupltiply a vector V by a scalar A than what I am doing below?

I would like to do the following u = u + (v*1.5 - vold*0.5)*delta_t;

where u, v, and vold are a vector with x, y, z, coordinates represented in a __m128.

Is there a better way to do this than to create the a, b and c values as I do in the below code? I am running on an Intel i7 computer so any options would be appreciated.

#include "stdafx.h"
#include
#include
#include

using namespace std;

class __declspec(align(16)) Element {
public:
Element( float ux, float uy, float uz, float vx, float vy, float vz, float vxold, float vyold, float vzold) {
u.m128_f32[0] = ux;
u.m128_f32[1] = uy;
u.m128_f32[2] = uz;
v.m128_f32[0] = vx;
v.m128_f32[1] = vy;
v.m128_f32[2] = vz;
vold.m128_f32[0] = vxold;
vold.m128_f32[1] = vyold;
vold.m128_f32[2] = vzold;
}

void Move() {
// u = u + (v*1.5 - vold*0.5)*delta_t;
float delta_t = 0.01;
__m128 a, b, c;
a.m128_f32[0] = 1.5;
a.m128_f32[1] = 1.5;
a.m128_f32[2] = 1.5;

b.m128_f32[0] = 0.5;
b.m128_f32[1] = 0.5;
b.m128_f32[2] = 0.5;

c.m128_f32[0] = delta_t;
c.m128_f32[1] = delta_t;
c.m128_f32[2] = delta_t;

u = _mm_add_ps(u, _mm_mul_ps(_mm_sub_ps(_mm_mul_ps(v,a), _mm_mul_ps(vold,b)), c));
}

__m128 u, v, vold;
};

int _tmain(int argc, _TCHAR* argv[])
{
Element A( 1,1,1, 1,2,3, 4,5,6);
A.Move();

cout << A.u.m128_f32[0] << " " << A.u.m128_f32[1] << " " << A.u.m128_f32[2] << endl;

char c;
cin >> c;
return 0;
}

u +=(v*1.5 - vold*0.5)*delta_t;
3 add + 3 sub + 9 mul = 15 scalar flops

rewrite it as :
u +=0.5*delta_t*(v*3.0 - vold);
3 add + 3 sub +7mul=13 scalar flops

vectorize :
process N elements *stored in SoA*together (N=4 for SSE, 8 for AVX, 16 for LRB, ...)

pseudo code below, hope it's clear

out of loop :

const c0_5 = 0.5 (N times)
const c3_0 =3.0 (N tiimes)

inner loop :

scale = c0_5*delta_t
ux += scale* (vx *c3_0 - voldx)
uy +=scale * (vy *c3_0 - voldy)
uz +=scale * (vz *c3_0 - voldz)

with SSE: c0_5, scale, ux, etc are vectors of 4 elements
for 4 results you need7 mulps+ 3addps + 3subps instead of 12 mulps + 4 addps + 4 subps with theAoS version above
also the mul vs add/sub ratiois a bit better balanced with my proposal,it's faster this way (i7 can issue mul and add in parallel) and future proof for FMA

lascondes · ‎10-05-2009

Some followup questions:

1. tim8: "On both gcc and icc (but not MSVC), _mm_set_ps will optimize with either SSE2 or SSE4 instructions, according to the architecture set on command line. "

What changes between SSE2 and SSE4 for the _mm_set_ps command?

2. bronxzv: "the mul vs add/sub ratio is a bit better balanced with my proposal, it's faster this way (i7 can issue mul and add in parallel)"

What, if anything, do I need to do so the i7 processor issues the mul and add in parallel? Do I need to call a special function, does the compiler take care of this, or does the processor do it all by itself? Appreciate your post - I think I have captured your intent in the code below.

3. Any additional comments on the following code is appreciated. Particularly if someone can help get the parallel_for loop working. I'm not sure if it is something in my coding or if the threads are accessing items at cache boundaries and this is causing a problem. Either way what needs to be changed to get this to work?

My current example code (can also be run without tbb):

// SSE Second Test.cpp : Defines the entry point for the console application.
//

#include "stdafx.h"
#include
#include
#include
#include
#include "tbb/tbb.h"
using namespace tbb;

using namespace std;

class __declspec(align(16)) Element {
public:
__m128 u, v, vold;

Element( float ux, float uy, float uz, float vx, float vy, float vz, float vxold, float vyold, float vzold) {
u =_mm_set_ps( 0, uz, uy, ux);
v =_mm_set_ps( 0, vz, vy, vx);
vold =_mm_set_ps( 0, vzold, vyold, vxold);
}

void MoveAoS(float delta_t ) {
// from areid
// u = u + (v*1.5 - vold*0.5)*delta_t;
// 3 add + 3 sub + 9 mul = 15 scalar flops
// u = _mm_add_ps(u, _mm_mul_ps(_mm_sub_ps(_mm_mul_ps(v,_mm_set1_ps(1.5f)), _mm_mul_ps(vold,_mm_set1_ps(0.5f))), _mm_set1_ps(delta_t)));

// bronxzv pointed out that the equation can be rewritten saving 2 multiplies
// u += 0.5*delta_t*(v*3.0 - vold);
// 3 add + 3 sub + 7 mul = 13 scalar flops
u = _mm_add_ps(u, _mm_mul_ps(_mm_set1_ps(0.5f * delta_t), _mm_sub_ps(_mm_mul_ps(v,_mm_set1_ps(3.0f)), vold)));
}
};

class __declspec(align(16)) ElementsSoA {
public:
float *ux, *uy, *uz, *vx, *vy, *vz, *vxold, *vyold, *vzold;
__m128 *ux128, *uy128, *uz128, *vx128, *vy128, *vz128, *vxold128, *vyold128, *vzold128;
static const int N = 4; // number of floats in each __mm128
int M; // number of elements
int Mm128; // number of _m128 structores over elements

ElementsSoA(int M_) { // Initialize structure for M Elements

// I think I need to allocate more memory than just M*sizeof(float)
int alignment = 16;
M = M_;
Mm128 = std::ceil(M*1.0/N); // ceiling function

ux = (float*)_aligned_malloc(sizeof(float)*Mm128*N, alignment);
uy = (float*)_aligned_malloc(sizeof(float)*Mm128*N, alignment);
uz = (float*)_aligned_malloc(sizeof(float)*Mm128*N, alignment);
vx = (float*)_aligned_malloc(sizeof(float)*Mm128*N, alignment);
vy = (float*)_aligned_malloc(sizeof(float)*Mm128*N, alignment);
vz = (float*)_aligned_malloc(sizeof(float)*Mm128*N, alignment);
vxold = (float*)_aligned_malloc(sizeof(float)*Mm128*N, alignment);
vyold = (float*)_aligned_malloc(sizeof(float)*Mm128*N, alignment);
vzold = (float*)_aligned_malloc(sizeof(float)*Mm128*N, alignment);

ux128 = (__m128*)ux;
uy128 = (__m128*)uy;
uz128 = (__m128*)uz;
vx128 = (__m128*)vx;
vy128 = (__m128*)vy;
vz128 = (__m128*)vz;
vxold128 = (__m128*)vxold;
vyold128 = (__m128*)vyold;
vzold128 = (__m128*)vzold;
}

~ElementsSoA() {
_aligned_free(ux);
_aligned_free(uy);
_aligned_free(uz);
_aligned_free(vx);
_aligned_free(vy);
_aligned_free(vz);
_aligned_free(vxold);
_aligned_free(vyold);
_aligned_free(vzold);

}

void MoveAoS(float delta_t ) {
// from areid
// u = u + (v*1.5 - vold*0.5)*delta_t;
// 3 add + 3 sub + 9 mul = 15 scalar flops
// u = _mm_add_ps(u, _mm_mul_ps(_mm_sub_ps(_mm_mul_ps(v,_mm_set1_ps(1.5f)), _mm_mul_ps(vold,_mm_set1_ps(0.5f))), _mm_set1_ps(delta_t)));

// bronxzv pointed out that the equation can be rewritten saving 2 multiplies
// u += 0.5*delta_t*(v*3.0 - vold);
// 3 add + 3 sub + 7 mul = 13 scalar flops

__m128 c5dt = _mm_set1_ps(0.5f * delta_t);
__m128 c3 = _mm_set1_ps(3.0f);
for (int j=0; jux128 = _mm_add_ps(ux128, _mm_mul_ps(c5dt, _mm_sub_ps(_mm_mul_ps(vx128, c3), vxold128)));
uy128 = _mm_add_ps(uy128, _mm_mul_ps(c5dt, _mm_sub_ps(_mm_mul_ps(vy128, c3), vyold128)));
uz128 = _mm_add_ps(uz128, _mm_mul_ps(c5dt, _mm_sub_ps(_mm_mul_ps(vz128, c3), vzold128)));
}
}
};

void OptionA(int Iterations, int M, float delta_t) {
Element *A;

int alignment = 16;

A = (Element*)_aligned_malloc(sizeof(Element)*M, alignment);
for (int i=0;iA = Element( 1+i,1,1, 1,2,3, 4,5,6);
}

clock_t start = clock();
for (int j=0;jfor (int i=0;iA.MoveAoS(delta_t);
}
}
double time = double(clock() - start)/CLOCKS_PER_SEC;

cout << A[0].u.m128_f32[0] << " " << A[0].u.m128_f32[1] << " " << A[0].u.m128_f32[2] << " " << time << endl;
_aligned_free(A);
}

void OptionB(int Iterations, int M, float delta_t) {
Element *A;

int alignment = 16;

A = (Element*)_aligned_malloc(sizeof(Element)*M, alignment);
for (int i=0;iA = Element( 1+i,1,1, 1,2,3, 4,5,6);
}

clock_t start = clock();

for (int j=0;jfor (int i=0;iA.u = _mm_add_ps(A.u, _mm_mul_ps(_mm_set1_ps(0.5f * delta_t), _mm_sub_ps(_mm_mul_ps(A.v,_mm_set1_ps(3.0f)), A.vold)));
}
}

double time = double(clock() - start)/CLOCKS_PER_SEC;

cout << A[0].u.m128_f32[0] << " " << A[0].u.m128_f32[1] << " " << A[0].u.m128_f32[2] << " " << time << endl;
_aligned_free(A);
}

void OptionC( int Iterations, int M, float delta_t) {
ElementsSoA A(M);

int alignment = 16;

for (int i=0;iA.ux = 1+i;
A.uy = 1;
A.uz = 1;

A.vx = 1;
A.vy = 2;
A.vz = 3;

A.vxold = 4;
A.vyold = 5;
A.vzold = 6;
}

clock_t start = clock();

for (int j=0;jA.MoveAoS( delta_t);
}

double time = double(clock() - start)/CLOCKS_PER_SEC;

cout << A.ux[0] << " " << A.uy[0] << " " << A.uz[0] << " " << time << endl;
}

void OptionD( int Iterations, int M, float delta_t) {
ElementsSoA A(M);

int alignment = 16;

for (int i=0;iA.ux = 1+i;
A.uy = 1;
A.uz = 1;

A.vx = 1;
A.vy = 2;
A.vz = 3;

A.vxold = 4;
A.vyold = 5;
A.vzold = 6;
}

clock_t start = clock();

for (int j=0;j__m128 c5dt = _mm_set1_ps(0.5f * delta_t);
__m128 c3 = _mm_set1_ps(3.0f);

// The parallel_for does not work. Getting a Unhandled exception at 0x00f91a75 in SSE Third Test.exe: 0xC0000005: Access violation reading location 0x0071fa60.
// parallel_for( blocked_range(0,A.Mm128),
// [=](const blocked_range& r) {
// for(size_t i=r.begin(); i!=r.end(); ++i) {
// A.ux128 = _mm_add_ps(A.ux128, _mm_mul_ps(c5dt, _mm_sub_ps(_mm_mul_ps(A.vx128, c3), A.vxold128)));
// A.uy128 = _mm_add_ps(A.uy128, _mm_mul_ps(c5dt, _mm_sub_ps(_mm_mul_ps(A.vy128, c3), A.vyold128)));
// A.uz128 = _mm_add_ps(A.uz128, _mm_mul_ps(c5dt, _mm_sub_ps(_mm_mul_ps(A.vz128, c3), A.vzold128)));
// }
// });

for (int i=0; iA.ux128 = _mm_add_ps(A.ux128, _mm_mul_ps(c5dt, _mm_sub_ps(_mm_mul_ps(A.vx128, c3), A.vxold128)));
A.uy128 = _mm_add_ps(A.uy128, _mm_mul_ps(c5dt, _mm_sub_ps(_mm_mul_ps(A.vy128, c3), A.vyold128)));
A.uz128 = _mm_add_ps(A.uz128, _mm_mul_ps(c5dt, _mm_sub_ps(_mm_mul_ps(A.vz128, c3), A.vzold128)));
}
}

double time = double(clock() - start)/CLOCKS_PER_SEC;

cout << A.ux[0] << " " << A.uy[0] << " " << A.uz[0] << " " << time << endl;
}

int _tmain(int argc, _TCHAR* argv[])
{
int Iterations = 10000;
int M = 125000; // Number of elements
float delta_t = 0.01f;

OptionA( Iterations, M, delta_t);
OptionB( Iterations, M, delta_t);
OptionC( Iterations, M, delta_t);
OptionD( Iterations, M, delta_t);

char c;
cin >> c;
return 0;
}

TimP · ‎10-05-2009

Quoting - lascondes

1. tim8: "On both gcc and icc (but not MSVC), _mm_set_ps will optimize with either SSE2 or SSE4 instructions, according to the architecture set on command line. "

What changes between SSE2 and SSE4 for the _mm_set_ps command?

There is an SSE4 intrinsic for insertps, equivalent to set_ps. gcc and icc make the switch-over automatic.

bronxzv · ‎10-05-2009

>What, if anything, do I need to do so the i7 processor issues the mul and add in parallel?

You have nothing special to do, btw in fact your 'delta_t' isa loop invariant which I missed so it's in fact perfectly mul/add balanced

an interesting experiment willbe towatch your timingsby replacing the mulps with addps or the other way around (i.e all addps/subps or all mulps) even if the computations will have no practical meaning you will get an idea of the gain with balanced add/mul

areid2 · ‎10-05-2009

Could the problem with the parallel_for be that you forgot to create a task_scheduler_init object in main()?

lascondes · ‎10-06-2009

Quoting - areid

Could the problem with the parallel_for be that you forgot to create a task_scheduler_init object in main()?

Hi areid,

My understanding is the task_scheduler_init should call automatically the first time I do anything that would require the threads. In this case it should be called when the first parallel_for() loop is hit. If I want to change to something other than the default behavior - ie more or less threads than I have processors - then I need to call this function before any of the parallel_for loops.

The loop appears to run for a short while before throwing the error.

Can the SSE functions be called in parallel? On an i7 chip how many SSE functions can be called at the same time? I am running on an i7 with 4 cores each hyperthreaded. Does this mean I can handle 1 SSE call, 4 SSE calls or 8 SSE calls at the same time? Any links to somewhere that this is discussed would be appreciated.

Regards,

Las Condes

areid2 · ‎10-06-2009

Quoting - lascondes

My understanding is the task_scheduler_init should call automatically the first time I do anything that would require the threads.

That is the behavior in TBB 2.2, but not in previous versions. I don't know what version you are using.

There is no problemCALLING multiple SSE functions at the same time in different threads, in fact that is recommended to get best performance. You cancall as many SSE functions as you like in your program. If you had 100 threads they could all becalling SSE instructions at the same time (although I don't recommend this for performance). The Core i7 processor will only be able toEXECUTE between 4 and 8 SSE instructions in parallel. The reason I sayat least 4is that each core can process SSE instructions independently (just like any other instruction). The reason that it canbe up to 8is that each core is hyperthreaded. That means it can process 2 streams of instructions (hardware threads) at the same time. If the 2 threads on an HT core are using different processing units (integer, floating point, etc) then it can actually carry out both instructions in parallel. In your case since you are using mostly floating point instrructions you probably won't see a big gain when using 8 threads over 4, but you still will probably see a small gain. This is probably because the load/store and memory address calculations can occur in one thread on an HT core while the other thread is doing some floating point.

Using as many threads as there are logical CPUs is a good general rule. If you really care about a few percent difference in performance you can experiment with fewer threads, but don't expect a big difference. The disadvantage of hard coding the number of threads is that if you run your app on a different machine, you might need to recompile to adjust the number of threads.

lascondes · ‎10-06-2009

Quoting - areid

That is the behavior in TBB 2.2, but not in previous versions. I don't know what version you are using.

There is no problemCALLING multiple SSE functions at the same time in different threads, in fact that is recommended to get best performance. You cancall as many SSE functions as you like in your program. If you had 100 threads they could all becalling SSE instructions at the same time (although I don't recommend this for performance). The Core i7 processor will only be able toEXECUTE between 4 and 8 SSE instructions in parallel. The reason I sayat least 4is that each core can process SSE instructions independently (just like any other instruction). The reason that it canbe up to 8is that each core is hyperthreaded. That means it can process 2 streams of instructions (hardware threads) at the same time. If the 2 threads on an HT core are using different processing units (integer, floating point, etc) then it can actually carry out both instructions in parallel. In your case since you are using mostly floating point instrructions you probably won't see a big gain when using 8 threads over 4, but you still will probably see a small gain. This is probably because the load/store and memory address calculations can occur in one thread on an HT core while the other thread is doing some floating point.

Using as many threads as there are logical CPUs is a good general rule. If you really care about a few percent difference in performance you can experiment with fewer threads, but don't expect a big difference. The disadvantage of hard coding the number of threads is that if you run your app on a different machine, you might need to recompile to adjust the number of threads.

Hi areid,

Good news in your post. I am using TBB 2.2. I'll spend some more time looking at my code (parallel_for loop). It runs for a few cycles and then generates the read error on a core 2 laptop. I'll try to work on some tonight on my i7 computer.

lascondes