is TBB parallel-for work with SSE2 subroutine ?

intelbenz · ‎09-30-2008

As I know, doing parallel for is just decomposing for-action into several threads.

My question is, can I run SSE2-coded subroutines in parallel ?

A dual processor has only two sets of 128bit SSE2 registers. Does it mean I could only run two threads at a time ?

I have written a simple parallel-for with SSE2 function embedded. So far, the program run nicely.

I wonder if my code is unsafe(race condition problem), since the number of SSE registers are limited.

#include "tbb/parallel_for.h"
#include "tbb/blocked_range.h"
#include "tbb/task_scheduler_init.h"
#include
#include
//#include "mm_malloc.h"
#include "emmintrin.h"
#include "xmmintrin.h"

//functor

using namespace tbb;

struct test
{
int *const my_a;
int *const my_b;
int * my_c;

public:

void Add_SSE2(int *a, int *b, int *c) const
{
_mm_store_si128(
(__m128i *) ((int *) &c[0]),
_mm_add_epi32(*(__m128i *)((int*) &a[0]), *(__m128i *)((int*) &b[0]))
);
}

void foo(int i) const
{
printf("%d\n", i);
}

void operator()(const blocked_range& range) const
{
int *a = my_a;
int *b = my_b;
int *c = my_c;

for(int i=range.begin(); i!=range.end(); i+=4)
{
Add_SSE2(&a, &b, &c);
//foo(i);
}
}

test(int *a, int *b, int *c) :
my_a(a),
my_b(b),
my_c(c)
{}
};

#define GRAINSIZE 10000

void ParallelFor_test(int *a, int *b, int *c, size_t n)
{
parallel_for(blocked_range(0,n,GRAINSIZE), test(a, b, c));
}

void
ChkResult(int *c, size_t n)
{
for(int i=0; i {
printf("%d> %d\n", i, c);
}
}

#define ASIZE 1024

int
main(int argc, char **argv)
{
int *a = (int *) _mm_malloc(ASIZE*sizeof(int), 16);
int *b = (int *) _mm_malloc(ASIZE*sizeof(int), 16);
int *c = (int *) _mm_malloc(ASIZE*sizeof(int), 16);

for(int i=0; i {
a = i;
b = i;
c = 0;
}

task_scheduler_init init;
ParallelFor_test(a, b, c, ASIZE);
ChkResult(c, ASIZE);

return 1;
}

Dmitry_Vyukov · ‎09-30-2008

Quoting - intelbenz

As I know, doing parallel for is just decomposing for-action into several threads.
My question is, can I run SSE2-coded subroutines in parallel ?
A dual processor has only two sets of 128bit SSE2 registers. Does it mean I could only run two threads at a time ?
I have written a simple parallel-for with SSE2 function embedded. So far, the program run nicely.
I wonder if my code is unsafe(race condition problem), since the number of SSE registers are limited.

Dual processor also has only two sets of general-purpose registers. But it's Ok to have more than 2 threads in a system. So it's Ok to have more than 2 threads using SSE2 too.

Thread context switch will swap all general-purpose registers as well as SSE registers, so for end-user system looks like it has N (N - number of user threads) completely independent processors with completely independent register files. You must base your thinking on this model, not on hardware-model.

AJ13 · ‎09-30-2008

Hi,

You mentioned that parallel_for will break the work into threads, this is not accurate. The work will be broken into tasks, which are then mapped to worker threads. There is a single worker thread per core, these are done for you when you initialize TBB's scheduler.

Here's how I understand things working: Your tasks are executed one at a time by worker threads, each task will have the thread and hence processor to itself when it executes. So yes, using the SSE should be quite safe (I intend to do something like this very soon myself).

This assumes of course that ther eis a 1-1 mapping between worker threads and cores (which should be the case, unless you override the default), and that the operating system is scheduling the thread to its own core.

AJ

Dmitry_Vyukov · ‎09-30-2008

Quoting - aj.guillon@gmail.com

This assumes of course that ther eis a 1-1 mapping between worker threads and cores (which should be the case, unless you override the default), and that the operating system is scheduling the thread to its own core.

Dual core processor also has only 2 sets of general-purpose registers. Why SEE is different?

Alexey-Kukanov · ‎09-30-2008

I need to say in advance that I am not experienced with using SSE.

However from the base of my general knowledge I agree with Dmitriy. There might be more software threads than available cores, but at each given moment just one thread is executed on a core, so there is no fight for SSE or any other processor resources. Well, if your processor is hyper-threaded, it appears as two or more logical processors to OS (and thusapplications) and some processor resources are shared; but sharing between hyper-threads is managed by the processor itself, and in particular every hyper-thread has its own set of registers I believe.

To keep it short and not bother you by low level details, I believe your code is safe and has no races. You might as well check this; just extend ChkResult with the functionality to make the same computations in serial and using scalar instructions, and compare the result. Something like:

[cpp]void
ChkResult(const int *c, const int a[], const int , size_t n)
{
    for(int i=0; i!=a+b)
            printf("ERROR: %d> result is %d, should be %dn", i, c, a+b);
    }
}
[/cpp]

Alexey-Kukanov · ‎09-30-2008

My code above contains at least one syntax error, and seems the new forum does not allow to edit posts so I can't fix it there :(. Let me rewrite ita little:

[cpp]void
ChkResult(const int *c, const int* a, const int* b, size_t n)
{
    for(int i=0; i!=a+b)
            printf("ERROR: %d> result is %d, should be %dn", i, c, a+b);
    }
}
[/cpp]

RafSchietekat · ‎09-30-2008

Just a quick thought (from memory, no time to go and check myself now): SSE instructions may behave differently with regard to memory semantics (something with temporal vs. non-temporal), so you may have to bring your own fences when working with tasks. For example, a spawned task may write something to memory, and a parent task may read it again, but if stealing is involved the read may see an old value.

Dmitry_Vyukov · ‎10-01-2008

Quoting - raf_schietekat

Just a quick thought (from memory, no time to go and check myself now): SSE instructions may behave differently with regard to memory semantics (something with temporal vs. non-temporal), so you may have to bring your own fences when working with tasks. For example, a spawned task may write something to memory, and a parent task may read it again, but if stealing is involved the read may see an old value.

Here is that topic:

http://software.intel.com/en-us/forums/showthread.php?t=58670

(search by 'non-temporal')

At least for now no fences are needed. I recheck the code of scheduler one more time. The functionality of mailboxes also 'respects' non-temporal stores, i.e. includes locked instruction before making task available to other thread.

RafSchietekat · ‎10-01-2008

Quoting - Dmitriy Vyukov

At least for now no fences are needed. I recheck the code of scheduler one more time. The functionality of mailboxes also 'respects' non-temporal stores, i.e. includes locked instruction before making task available to other thread.

Well, threads have to provide fences because they are so impolite as to barge in on another thead's use of the processor with so much as a warning. Task boundaries are visibile to the user, and because tasks are supposed to be as light as possible it seems appropriate to leave the responsibility for non-C++ fences, properly documented of course, with the user. No program should rely on the current implementation of TBB in this regard, and at the most disable its own fences with a preprocessor switch or something, ready to be reactivated at the slightest sign of trouble. (BTW, I would like to repeat an earlier question for more clarity from TBB on memory semantics issues regarding tasks and concurrent data structures etc., like Java has done.)

Dmitry_Vyukov · ‎10-01-2008

Quoting - raf_schietekat

Well, threads have to provide fences because they are so impolite as to barge in on another thead's use of the processor with so much as a warning. Task boundaries are visibile to the user, and because tasks are supposed to be as light as possible it seems appropriate to leave the responsibility for non-C++ fences, properly documented of course, with the user. No program should rely on the current implementation of TBB in this regard

This is exactly what I was proposing:

Is using of non-temporal stores is prohibited in tbb::task::execute() in TBB documentation? ;)
Are you considering the case when I am using movntq for storing data for child task? It seems that for now it will work as expected.

But I did't fully understand Arch's answer:

Non-temporal stores can safely be used in task::execute, but the reasoning is subtle.As Dmitriy points out,the store that releases a task pool is not a strong enough fence to stop a non-temporal store from migratingfrom before to after it. However,whenTBBpublishes a task, it firstdoes a "lock cmpxchg" instruction to acquire the task pool. A non-temporal store cannot migrate past a locked instruction, hence we are safe. The above argument is peculiar to x86 semantics, but so are non-temporaral stores, so the argument seems fair.

Whether non-temporal stores can be safely used in tasks for now, or this is guaranteed behavior and all developers are aware of the subject and will not break that guarantee.

I've examined latest source and mailboxes don't break guarantee. But some optimizations and corner-cuts can conflict with that guarantee. So I also think that it's better to officially reject that guarantee.

ARCH_R_Intel · ‎10-01-2008

Quoting - Dmitriy Vyukov

This is exactly what I was proposing:
Is using of non-temporal stores is prohibited in tbb::task::execute() in TBB documentation? ;)
Are you considering the case when I am using movntq for storing data for child task? It seems that for now it will work as expected.Whether non-temporal stores can be safely used in tasks for now, or this is guaranteed behavior and all developers are aware of the subject and will not break that guarantee.
I've examined latest source and mailboxes don't break guarantee. But some optimizations and corner-cuts can conflict with that guarantee. So I also think that it's better to officially reject that guarantee.

What sort of corner-cutting or optimization do you have in mind? I think that TBB has to guarantee that when a task is stolen, the consumer of a task has toread valid snapshot of the task as the producer wrote it. So at a minimum, the producer has to execute a release fence (in theory delayable to the point where the task is stolen) and the thief has to execute an acquire fence. Likewise for "implicit steals" where a task decrements a reference count, and if it is zero, picks up the successor task.

Dmitry_Vyukov · ‎10-01-2008

Quoting - Arch Robison

What sort of corner-cutting or optimization do you have in mind? I think that TBB has to guarantee that when a task is stolen, the consumer of a task has toread valid snapshot of the task as the producer wrote it. So at a minimum, the producer has to execute a release fence (in theory delayable to the point where the task is stolen) and the thief has to execute an acquire fence.

Undoubtedly. But release fence doesn't affect non-temporal stores. So the question is whether spawn() has to execute only release fence (plain store on x86) or a full-fledged locked instruction (or mfence or sfence).

I mean optimizations like work-stealing deque, which executes only release fence in push() (no atomic RMW in push):

http://www.cs.bgu.ac.il/~hendlerd/papers/dynamic-size-deque.pdf

Or task passing based on single-producer/single-consumer queues (which also don't execute atomic RMW in push()).