Affinity

H__Kamil · ‎02-07-2015

Hello. I have been starded work with Intel TBB. Is it possible to set affinity in Intel TBB? I want to run one thread on one core. I write sample test code, but it's dont work correcly.

class Test {
  public:
    void operator()(const tbb::blocked_range<int>& range) const {
      cerr << "CPU: " << sched_getcpu() << " this: " << this
           << ", begin: " << range.begin() << ", end: " << range.end() << " " << endl;
    }
};


int main()
{
    cout << endl << endl;
    cout << "***** Intel TBB *****" << endl;

    tbb::affinity_partitioner ap;
    tbb::task_scheduler_init init(8);
    Test test;

    tbb::parallel_for(tbb::blocked_range<int>(0, 800, 100), test, ap);

    return 0;
}

When i run it, i get the following result:

CPU: 16 this: 0x7f2ff6d1fd58, begin: 0, end: 100 
CPU: 16 this: 0x7f2ff6d1f358, begin: 100, end: 200 
CPU: 16 this: 0x7f2ff6d1f658, begin: 200, end: 300 
CPU: 16 this: 0x7f2ff6d1f358, begin: 300, end: 400 
CPU: 16 this: 0x7f2ff6d1f958, begin: 400, end: 500 
CPU: 16 this: 0x7f2ff6d1f158, begin: 500, end: 600 
CPU: 16 this: 0x7f2ff6d1f458, begin: 600, end: 700 
CPU: 16 this: 0x7f2ff6d1f158, begin: 700, end: 800

Whats is the problem? Thank a lot.

RafSchietekat · ‎02-07-2015

Don't try to create one chunk per hardware thread (by abusing grainsize), that's the OpenMP way. TBB likes parallel slack, so what you should probably do here is not specify grainsize (use the default 1), and not specify partitioner (use the default auto_partitioner).

affinity_partitioner is relevant only if you have multiple loops over the same Range, and you want it to be divided the same way each time, with corresponding chunks executed preferentially on the same hardware thread.

H__Kamil · ‎02-08-2015

Ok. Thanks for reply. So I understand that i should do this in this way:

tbb::parallel_for(tbb::blocked_range<int>(0, 800), test, tbbb::auto_partitioner());

What about if I would like to unroll loop? For example, I implemented the algorithm with two loops and I would like to unroll outer (urnoll(4)) loop which is parallelized. How should I do it correctly?

RafSchietekat · ‎02-09-2015

The default partitioner is auto_partitioner, you don't have to mention it explicitly.

Just to avoid any misunderstanding, affinity_partitioner is relevant for successive loops over the same Range (in case you can't do it in just one loop, I suppose).

That said, for nested loops you would typically make only the outer loop parallel, and you might then decide to explicitly vectorise the inner loop, or at least give the compiler every chance to optimise it as well as it can (sometimes unrolling, sometimes even auto-vectorising), by "hoisting" the Range::end() out of the loop (assign to a variable at the start of the loop and compare with that variable), although it's not clear to me on which environments this is most relevant.

I don't know what you mean by unroll(4) or why you would want to do that to the outer loop.