Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Intel Community
- Software
- Software Development Technologies
- Intel® ISA Extensions
- How to multiply __m128 by a scaler?

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

lascondes

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

09-30-2009
11:30 AM

496 Views

How to multiply __m128 by a scaler?

I would like to do the following u = u + (v*1.5 - vold*0.5)*delta_t;

where u, v, and vold are a vector with x, y, z, coordinates represented in a __m128.

Is there a better way to do this than to create the a, b and c values as I do in the below code? I am running on an Intel i7 computer so any options would be appreciated.

#include "stdafx.h"

#include

#include

#include

using namespace std;

class __declspec(align(16)) Element {

public:

Element( float ux, float uy, float uz, float vx, float vy, float vz, float vxold, float vyold, float vzold) {

u.m128_f32[0] = ux;

u.m128_f32[1] = uy;

u.m128_f32[2] = uz;

v.m128_f32[0] = vx;

v.m128_f32[1] = vy;

v.m128_f32[2] = vz;

vold.m128_f32[0] = vxold;

vold.m128_f32[1] = vyold;

vold.m128_f32[2] = vzold;

}

void Move() {

// u = u + (v*1.5 - vold*0.5)*delta_t;

float delta_t = 0.01;

__m128 a, b, c;

a.m128_f32[0] = 1.5;

a.m128_f32[1] = 1.5;

a.m128_f32[2] = 1.5;

b.m128_f32[0] = 0.5;

b.m128_f32[1] = 0.5;

b.m128_f32[2] = 0.5;

c.m128_f32[0] = delta_t;

c.m128_f32[1] = delta_t;

c.m128_f32[2] = delta_t;

u = _mm_add_ps(u, _mm_mul_ps(_mm_sub_ps(_mm_mul_ps(v,a), _mm_mul_ps(vold,b)), c));

}

__m128 u, v, vold;

};

int _tmain(int argc, _TCHAR* argv[])

{

Element A( 1,1,1, 1,2,3, 4,5,6);

A.Move();

cout << A.u.m128_f32[0] << " " << A.u.m128_f32[1] << " " << A.u.m128_f32[2] << endl;

char c;

cin >> c;

return 0;

}

1 Solution

areid2

New Contributor I

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

09-30-2009
11:58 AM

496 Views

void Move() {

// u = u + (v*1.5 - vold*0.5)*delta_t;

float delta_t = 0.01;

u = _mm_add_ps(u, _mm_mul_ps(_mm_sub_ps(_mm_mul_ps(v,_mm_set1_ps(1.5f)), _mm_mul_ps(vold,_mm_set1_ps(0.5f))), _mm_set1_ps(delta_t)));

}

You could similarly use the _mm_set_ps to clean up your constructor.

In terms of performance, if you are going to put the Move() method into a loop, you would probably want to create __m128 variablesfor 0.5, 1.5 and delta_t outside the loop. That way the compiler can keep these values in registers instead of loading them at each iteration of the loop. Hard forto say much more about performance without knowing what you intend to do with this object/method.

Link Copied

11 Replies

areid2

New Contributor I

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

09-30-2009
11:58 AM

497 Views

void Move() {

// u = u + (v*1.5 - vold*0.5)*delta_t;

float delta_t = 0.01;

u = _mm_add_ps(u, _mm_mul_ps(_mm_sub_ps(_mm_mul_ps(v,_mm_set1_ps(1.5f)), _mm_mul_ps(vold,_mm_set1_ps(0.5f))), _mm_set1_ps(delta_t)));

}

You could similarly use the _mm_set_ps to clean up your constructor.

In terms of performance, if you are going to put the Move() method into a loop, you would probably want to create __m128 variablesfor 0.5, 1.5 and delta_t outside the loop. That way the compiler can keep these values in registers instead of loading them at each iteration of the loop. Hard forto say much more about performance without knowing what you intend to do with this object/method.

lascondes

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

09-30-2009
12:52 PM

496 Views

Very helpful. Thank you.

TimP

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

09-30-2009
01:55 PM

496 Views

Quoting - areid

You could similarly use the _mm_set_ps to clean up your constructor.

bronxzv

New Contributor II

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

10-01-2009
09:19 AM

496 Views

Quoting - lascondes

I would like to do the following u = u + (v*1.5 - vold*0.5)*delta_t;

where u, v, and vold are a vector with x, y, z, coordinates represented in a __m128.

Is there a better way to do this than to create the a, b and c values as I do in the below code? I am running on an Intel i7 computer so any options would be appreciated.

#include "stdafx.h"

#include

#include

#include

using namespace std;

class __declspec(align(16)) Element {

public:

Element( float ux, float uy, float uz, float vx, float vy, float vz, float vxold, float vyold, float vzold) {

u.m128_f32[0] = ux;

u.m128_f32[1] = uy;

u.m128_f32[2] = uz;

v.m128_f32[0] = vx;

v.m128_f32[1] = vy;

v.m128_f32[2] = vz;

vold.m128_f32[0] = vxold;

vold.m128_f32[1] = vyold;

vold.m128_f32[2] = vzold;

}

void Move() {

// u = u + (v*1.5 - vold*0.5)*delta_t;

float delta_t = 0.01;

__m128 a, b, c;

a.m128_f32[0] = 1.5;

a.m128_f32[1] = 1.5;

a.m128_f32[2] = 1.5;

b.m128_f32[0] = 0.5;

b.m128_f32[1] = 0.5;

b.m128_f32[2] = 0.5;

c.m128_f32[0] = delta_t;

c.m128_f32[1] = delta_t;

c.m128_f32[2] = delta_t;

u = _mm_add_ps(u, _mm_mul_ps(_mm_sub_ps(_mm_mul_ps(v,a), _mm_mul_ps(vold,b)), c));

}

__m128 u, v, vold;

};

int _tmain(int argc, _TCHAR* argv[])

{

Element A( 1,1,1, 1,2,3, 4,5,6);

A.Move();

cout << A.u.m128_f32[0] << " " << A.u.m128_f32[1] << " " << A.u.m128_f32[2] << endl;

char c;

cin >> c;

return 0;

}

u +=(v*1.5 - vold*0.5)*delta_t;

3 add + 3 sub + 9 mul = 15 scalar flops

rewrite it as :

u +=0.5*delta_t*(v*3.0 - vold);

3 add + 3 sub +7mul=13 scalar flops

vectorize :

process N elements *stored in SoA*together (N=4 for SSE, 8 for AVX, 16 for LRB, ...)

pseudo code below, hope it's clear

out of loop :

const c0_5 = 0.5 (N times)

const c3_0 =3.0 (N tiimes)

inner loop :

scale = c0_5*delta_t

ux

uy

uz

with SSE: c0_5, scale, ux

for 4 results you need7 mulps+ 3addps + 3subps instead of 12 mulps + 4 addps + 4 subps with theAoS version above

also the mul vs add/sub ratiois a bit better balanced with my proposal,it's faster this way (i7 can issue mul and add in parallel) and future proof for FMA

lascondes

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

10-05-2009
12:31 PM

496 Views

1. tim8: "On both gcc and icc (but not MSVC), _mm_set_ps will optimize with either SSE2 or SSE4 instructions, according to the architecture set on command line. "

What changes between SSE2 and SSE4 for the _mm_set_ps command?

2. bronxzv: "the mul vs add/sub ratio is a bit better balanced with my proposal, it's faster this way (i7 can issue mul and add in parallel)"

What, if anything, do I need to do so the i7 processor issues the mul and add in parallel? Do I need to call a special function, does the compiler take care of this, or does the processor do it all by itself? Appreciate your post - I think I have captured your intent in the code below.

3. Any additional comments on the following code is appreciated. Particularly if someone can help get the parallel_for loop working. I'm not sure if it is something in my coding or if the threads are accessing items at cache boundaries and this is causing a problem. Either way what needs to be changed to get this to work?

My current example code (can also be run without tbb):

// SSE Second Test.cpp : Defines the entry point for the console application.

//

#include "stdafx.h"

#include

#include

#include

#include

#include "tbb/tbb.h"

using namespace tbb;

using namespace std;

class __declspec(align(16)) Element {

public:

__m128 u, v, vold;

Element( float ux, float uy, float uz, float vx, float vy, float vz, float vxold, float vyold, float vzold) {

u =_mm_set_ps( 0, uz, uy, ux);

v =_mm_set_ps( 0, vz, vy, vx);

vold =_mm_set_ps( 0, vzold, vyold, vxold);

}

void MoveAoS(float delta_t ) {

// from areid

// u = u + (v*1.5 - vold*0.5)*delta_t;

// 3 add + 3 sub + 9 mul = 15 scalar flops

// u = _mm_add_ps(u, _mm_mul_ps(_mm_sub_ps(_mm_mul_ps(v,_mm_set1_ps(1.5f)), _mm_mul_ps(vold,_mm_set1_ps(0.5f))), _mm_set1_ps(delta_t)));

// bronxzv pointed out that the equation can be rewritten saving 2 multiplies

// u += 0.5*delta_t*(v*3.0 - vold);

// 3 add + 3 sub + 7 mul = 13 scalar flops

u = _mm_add_ps(u, _mm_mul_ps(_mm_set1_ps(0.5f * delta_t), _mm_sub_ps(_mm_mul_ps(v,_mm_set1_ps(3.0f)), vold)));

}

};

class __declspec(align(16)) ElementsSoA {

public:

float *ux, *uy, *uz, *vx, *vy, *vz, *vxold, *vyold, *vzold;

__m128 *ux128, *uy128, *uz128, *vx128, *vy128, *vz128, *vxold128, *vyold128, *vzold128;

static const int N = 4; // number of floats in each __mm128

int M; // number of elements

int Mm128; // number of _m128 structores over elements

ElementsSoA(int M_) { // Initialize structure for M Elements

// I think I need to allocate more memory than just M*sizeof(float)

int alignment = 16;

M = M_;

Mm128 = std::ceil(M*1.0/N); // ceiling function

ux = (float*)_aligned_malloc(sizeof(float)*Mm128*N, alignment);

uy = (float*)_aligned_malloc(sizeof(float)*Mm128*N, alignment);

uz = (float*)_aligned_malloc(sizeof(float)*Mm128*N, alignment);

vx = (float*)_aligned_malloc(sizeof(float)*Mm128*N, alignment);

vy = (float*)_aligned_malloc(sizeof(float)*Mm128*N, alignment);

vz = (float*)_aligned_malloc(sizeof(float)*Mm128*N, alignment);

vxold = (float*)_aligned_malloc(sizeof(float)*Mm128*N, alignment);

vyold = (float*)_aligned_malloc(sizeof(float)*Mm128*N, alignment);

vzold = (float*)_aligned_malloc(sizeof(float)*Mm128*N, alignment);

ux128 = (__m128*)ux;

uy128 = (__m128*)uy;

uz128 = (__m128*)uz;

vx128 = (__m128*)vx;

vy128 = (__m128*)vy;

vz128 = (__m128*)vz;

vxold128 = (__m128*)vxold;

vyold128 = (__m128*)vyold;

vzold128 = (__m128*)vzold;

}

~ElementsSoA() {

_aligned_free(ux);

_aligned_free(uy);

_aligned_free(uz);

_aligned_free(vx);

_aligned_free(vy);

_aligned_free(vz);

_aligned_free(vxold);

_aligned_free(vyold);

_aligned_free(vzold);

}

void MoveAoS(float delta_t ) {

// from areid

// u = u + (v*1.5 - vold*0.5)*delta_t;

// 3 add + 3 sub + 9 mul = 15 scalar flops

// u = _mm_add_ps(u, _mm_mul_ps(_mm_sub_ps(_mm_mul_ps(v,_mm_set1_ps(1.5f)), _mm_mul_ps(vold,_mm_set1_ps(0.5f))), _mm_set1_ps(delta_t)));

// bronxzv pointed out that the equation can be rewritten saving 2 multiplies

// u += 0.5*delta_t*(v*3.0 - vold);

// 3 add + 3 sub + 7 mul = 13 scalar flops

__m128 c5dt = _mm_set1_ps(0.5f * delta_t);

__m128 c3 = _mm_set1_ps(3.0f);

for (int j=0; j

uy128

uz128

}

}

};

void OptionA(int Iterations, int M, float delta_t) {

Element *A;

int alignment = 16;

A = (Element*)_aligned_malloc(sizeof(Element)*M, alignment);

for (int i=0;i

}

clock_t start = clock();

for (int j=0;j

}

}

double time = double(clock() - start)/CLOCKS_PER_SEC;

cout << A[0].u.m128_f32[0] << " " << A[0].u.m128_f32[1] << " " << A[0].u.m128_f32[2] << " " << time << endl;

_aligned_free(A);

}

void OptionB(int Iterations, int M, float delta_t) {

Element *A;

int alignment = 16;

A = (Element*)_aligned_malloc(sizeof(Element)*M, alignment);

for (int i=0;i

}

clock_t start = clock();

for (int j=0;j

}

}

double time = double(clock() - start)/CLOCKS_PER_SEC;

cout << A[0].u.m128_f32[0] << " " << A[0].u.m128_f32[1] << " " << A[0].u.m128_f32[2] << " " << time << endl;

_aligned_free(A);

}

void OptionC( int Iterations, int M, float delta_t) {

ElementsSoA A(M);

int alignment = 16;

for (int i=0;i

A.uy

A.uz

A.vx

A.vy

A.vz

A.vxold

A.vyold

A.vzold

}

clock_t start = clock();

for (int j=0;j

}

double time = double(clock() - start)/CLOCKS_PER_SEC;

cout << A.ux[0] << " " << A.uy[0] << " " << A.uz[0] << " " << time << endl;

}

void OptionD( int Iterations, int M, float delta_t) {

ElementsSoA A(M);

int alignment = 16;

for (int i=0;i

A.uy

A.uz

A.vx

A.vy

A.vz

A.vxold

A.vyold

A.vzold

}

clock_t start = clock();

for (int j=0;j

__m128 c3 = _mm_set1_ps(3.0f);

// The parallel_for does not work. Getting a Unhandled exception at 0x00f91a75 in SSE Third Test.exe: 0xC0000005: Access violation reading location 0x0071fa60.

// parallel_for( blocked_range

// [=](const blocked_range

// for(size_t i=r.begin(); i!=r.end(); ++i) {

// A.ux128

// A.uy128

// A.uz128

// }

// });

for (int i=0; i

A.uy128

A.uz128

}

}

double time = double(clock() - start)/CLOCKS_PER_SEC;

cout << A.ux[0] << " " << A.uy[0] << " " << A.uz[0] << " " << time << endl;

}

int _tmain(int argc, _TCHAR* argv[])

{

int Iterations = 10000;

int M = 125000; // Number of elements

float delta_t = 0.01f;

OptionA( Iterations, M, delta_t);

OptionB( Iterations, M, delta_t);

OptionC( Iterations, M, delta_t);

OptionD( Iterations, M, delta_t);

char c;

cin >> c;

return 0;

}

TimP

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

10-05-2009
03:57 PM

496 Views

Quoting - lascondes

1. tim8: "On both gcc and icc (but not MSVC), _mm_set_ps will optimize with either SSE2 or SSE4 instructions, according to the architecture set on command line. "

What changes between SSE2 and SSE4 for the _mm_set_ps command?

bronxzv

New Contributor II

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

10-05-2009
05:11 PM

496 Views

>What, if anything, do I need to do so the i7 processor issues the mul and add in parallel?

You have nothing special to do, btw in fact your 'delta_t' isa loop invariant which I missed so it's in fact perfectly mul/add balanced

an interesting experiment willbe towatch your timingsby replacing the mulps with addps or the other way around (i.e all addps/subps or all mulps) even if the computations will have no practical meaning you will get an idea of the gain with balanced add/mul

areid2

New Contributor I

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

10-05-2009
09:13 PM

496 Views

lascondes

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

10-06-2009
03:20 AM

496 Views

Quoting - areid

My understanding is the task_scheduler_init should call automatically the first time I do anything that would require the threads. In this case it should be called when the first parallel_for() loop is hit. If I want to change to something other than the default behavior - ie more or less threads than I have processors - then I need to call this function before any of the parallel_for loops.

The loop appears to run for a short while before throwing the error.

Can the SSE functions be called in parallel? On an i7 chip how many SSE functions can be called at the same time? I am running on an i7 with 4 cores each hyperthreaded. Does this mean I can handle 1 SSE call, 4 SSE calls or 8 SSE calls at the same time? Any links to somewhere that this is discussed would be appreciated.

Regards,

Las Condes

areid2

New Contributor I

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

10-06-2009
11:24 AM

496 Views

Quoting - lascondes

My understanding is the task_scheduler_init should call automatically the first time I do anything that would require the threads.

That is the behavior in TBB 2.2, but not in previous versions. I don't know what version you are using.

There is no problemCALLING multiple SSE functions at the same time in different threads, in fact that is recommended to get best performance. You cancall as many SSE functions as you like in your program. If you had 100 threads they could all becalling SSE instructions at the same time (although I don't recommend this for performance). The Core i7 processor will only be able toEXECUTE between 4 and 8 SSE instructions in parallel. The reason I sayat least 4is that each core can process SSE instructions independently (just like any other instruction). The reason that it canbe up to 8is that each core is hyperthreaded. That means it can process 2 streams of instructions (hardware threads) at the same time. If the 2 threads on an HT core are using different processing units (integer, floating point, etc) then it can actually carry out both instructions in parallel. In your case since you are using mostly floating point instrructions you probably won't see a big gain when using 8 threads over 4, but you still will probably see a small gain. This is probably because the load/store and memory address calculations can occur in one thread on an HT core while the other thread is doing some floating point.

Using as many threads as there are logical CPUs is a good general rule. If you really care about a few percent difference in performance you can experiment with fewer threads, but don't expect a big difference. The disadvantage of hard coding the number of threads is that if you run your app on a different machine, you might need to recompile to adjust the number of threads.

lascondes

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

10-06-2009
02:18 PM

496 Views

Quoting - areid

That is the behavior in TBB 2.2, but not in previous versions. I don't know what version you are using.

There is no problemCALLING multiple SSE functions at the same time in different threads, in fact that is recommended to get best performance. You cancall as many SSE functions as you like in your program. If you had 100 threads they could all becalling SSE instructions at the same time (although I don't recommend this for performance). The Core i7 processor will only be able toEXECUTE between 4 and 8 SSE instructions in parallel. The reason I sayat least 4is that each core can process SSE instructions independently (just like any other instruction). The reason that it canbe up to 8is that each core is hyperthreaded. That means it can process 2 streams of instructions (hardware threads) at the same time. If the 2 threads on an HT core are using different processing units (integer, floating point, etc) then it can actually carry out both instructions in parallel. In your case since you are using mostly floating point instrructions you probably won't see a big gain when using 8 threads over 4, but you still will probably see a small gain. This is probably because the load/store and memory address calculations can occur in one thread on an HT core while the other thread is doing some floating point.

Using as many threads as there are logical CPUs is a good general rule. If you really care about a few percent difference in performance you can experiment with fewer threads, but don't expect a big difference. The disadvantage of hard coding the number of threads is that if you run your app on a different machine, you might need to recompile to adjust the number of threads.

Hi areid,

Good news in your post. I am using TBB 2.2. I'll spend some more time looking at my code (parallel_for loop). It runs for a few cycles and then generates the read error on a core 2 laptop. I'll try to work on some tonight on my i7 computer.

lascondes

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

For more complete information about compiler optimizations, see our Optimization Notice.