Measurements and task_scheduler_init

Martin_K_9 · ‎02-03-2013

Hi

I am doing some measurement with TBB. I noticed that the first call to tbb::parallel_for is slow. I thought, this might come from the task scheduler initialization that is executed the first time a tbb algorithm is called. And yes it had an influence but I don't realy understand the behaviour.

Inserting

[cpp]task_scheduler_init();[/cpp]

before the first call to tbb::parallel_for is made reduces the time needed for the first call to tbb::parallel_for.

However when I insert:

[cpp]task_scheduler_init init;[/cpp]

the time for the first call to tbb::parallel_for is even shorter but I couldn't get any speedup during the measurement (further calls to tbb::parallel_for)

Is this behavior explainable and what should I call before the measurements to get meaningful results. The used test file is attached.

Thanks for any hints

tzannes · ‎02-04-2013

In the code you included, it looks like you forgot to set "after" after the parallel_for, so it's using the value you set it to after the sequential for.

RafSchietekat · ‎02-04-2013

TBB is about lightweight parallelism with tasks partly because creating and destroying threads takes time (even without additional overhead of preemptive scheduling by the kernel), so it stands to reason that starting up the scheduler takes time that you will want to amortise over the lifetime of an application, explicitly or implicitly. Don't use "task_scheduler_init();", which either does nothing or starts up and immediately shuts down the scheduler; I don't know how it can reduce the time needed for the first subsequent parallel_for and by how much, but that doesn't seem very relevant. I don't see what you mean by "speedup during the measurement", though: isn't it a good thing to have full parallel performance from the very first parallel_for?

(Added after seeing intervening posting) Without updating "after", you should have noticed some rather conspicuous timing results (growing negative intervals), not quite covered by "I couldn't get any speedup"...

Martin_K_9 · ‎02-04-2013

yes the updateing "after" was missing because I copied the code from the meaturement and changed it afterwards. Sorry for that.

To clarify what I mean with "speedup during the measurement" I post my results. First without an explicit call to init task scheduling:

--------------------

Data size = 1000M
serial simple kernel done, time: 0.000001
parallel simple kernel done, time: 0.004401

Data size = 10000M
serial simple kernel done, time: 0.000010
parallel simple kernel done, time: 0.000029

Data size = 100000M
serial simple kernel done, time: 0.000126
parallel simple kernel done, time: 0.000077

Data size = 1000000M
serial simple kernel done, time: 0.001266
parallel simple kernel done, time: 0.000793

Data size = 10000000M
serial simple kernel done, time: 0.007043
parallel simple kernel done, time: 0.004931

Data size = 100000000M
serial simple kernel done, time: 0.071461
parallel simple kernel done, time: 0.044711

Calculation finished.

--------

And now with the task scheduling initialized: "task_scheduler_init init;"

------------------

Data size = 1000M
serial simple kernel done, time: 0.000001
parallel simple kernel done, time: 0.000027

Data size = 10000M
serial simple kernel done, time: 0.000009
parallel simple kernel done, time: 0.000028

Data size = 100000M
serial simple kernel done, time: 0.000095
parallel simple kernel done, time: 0.000132

Data size = 1000000M
serial simple kernel done, time: 0.000832
parallel simple kernel done, time: 0.001100

Data size = 10000000M
serial simple kernel done, time: 0.007140
parallel simple kernel done, time: 0.010731

Data size = 100000000M
serial simple kernel done, time: 0.071809
parallel simple kernel done, time: 0.098906

Calculation finished.

------------------

While the first measurement shows a speed up of the parallel version if the problem size if bigger than 10000. In the second measurement there is never a speed up for the parallel version. Do I something wrong?

RafSchietekat · ‎02-04-2013

Strange...

Martin_K_9 · ‎02-04-2013

I have some new findings.

The measurement as shown was done on Windows 7 with the Microsoft Visual C++ compiler and TBB 4.1.

The execution of the measurement code on Ubuntu 12.04 64 Bit, with gcc 4.7.2 and TBB 4.1 doesn't show the strange behavior. Here are these test results with "task_scheduler_init init;":

----------
Data size = 1000M
serial simple kernel done, time: 0.000001
parallel simple kernel done, time: 0.000173

Data size = 10000M
serial simple kernel done, time: 0.000008
parallel simple kernel done, time: 0.000069

Data size = 100000M
serial simple kernel done, time: 0.000074
parallel simple kernel done, time: 0.000307

Data size = 1000000M
serial simple kernel done, time: 0.000751
parallel simple kernel done, time: 0.000631

Data size = 10000000M
serial simple kernel done, time: 0.008356
parallel simple kernel done, time: 0.004282

Data size = 100000000M
serial simple kernel done, time: 0.083278
parallel simple kernel done, time: 0.044280

Calculation finished.
---------------

Btw. This test is compiled with -O3 compiler option and is running on an Intel® Xeon(R) CPU E5520 @ 2.27GHz × 16 (16 cores)

Anton_M_Intel · ‎02-05-2013

Martin, your kernels are very small are they supposed to run repeatedly? You may want to set a barrier which waits when all the threads are up&ready before the measurment starts to get an idea how it could work in a real aplication.

RafSchietekat · ‎02-05-2013

Shouldn't all threads have started at least by the time the last parallel_for starts, even on Windows?

Anton_M_Intel · ‎02-06-2013

Raf Schietekat wrote:
Shouldn't all threads have started at least by the time the last parallel_for starts, even on Windows?

Depends on time of the last parallel_for... but still not guaranteed of course :)

It is also a good idea to track min, med, max times of each parallel_for run or their destribution.

RafSchietekat · ‎02-06-2013

So how long would it be before you can be certain that all threads are up and running? I never suspected it could be more than a few milliseconds (see the Linux test), but on the Windows test apparently more than 10 ms have already passed by the time the last parallel_for() starts!

And what then could explain the difference with the implicit initialisation?

A loop around the whole set of parallel_for calls for a total time of a full second may provide more information. Or whatever time is needed...

Anton_M_Intel · ‎02-06-2013

On certain platforms and conditions (e.g. under a profiler/debugger) it can take up to few seconds. In my exeriments I always rely on explicit barrier rather than on OS timings. I'm not sure how creating and destroying a task_scheduler_init object before using autoinitialization could affect performance... it might be that there is some additional latencies which give more time for threads to start and since the benchmark in non-deterministic in this regard, it affects the results.

RafSchietekat · ‎02-06-2013

A few seconds... that's like an eternity! Could you "out" those platforms and conditions, so we can stay at a safe distance? :-)

Martin_K_9 · ‎02-08-2013

the kernels were just an experiment. Thanks for the additional information. How would you insert a barrier to wait for all threads to be up and ready?

RafSchietekat · ‎02-08-2013

I was also wondering about a barrier in this context (TBB user code, simple and portable)... maybe a parallel_for with the right number of elements, using a (non-default) simple_partitioner with (default) grainsize 1, and busy-waiting for an atomic to fall back to zero?

But did you get any more information yet, by running the experiment repeatedly for a few seconds?

Martin_K_9 · ‎02-08-2013

I just did this test. For the barrier I've inserted the following code:

[cpp]

//waiting for all worker threads up and running
int elementsToProcess = 100000;
std::vector<int> initVector(elementsToProcess);
tbb::atomic<int> wait;
wait = 1;
while (wait != 0) {
tbb::parallel_for(tbb::blocked_range<int>(0, elementsToProcess, 1),
[&wait] (tbb::blocked_range<int> b) {
wait = 0;
}, tbb::simple_partitioner());
}

[/cpp]

With this code at the beginning, an explicit task_scheduler_init and repeating the test 4 times (runs) result in a total running time of about 9 seconds. During these 9 seconds no parallel version is ever faster than the serial version. The parallel version does also not improve over the runs.

The same configuration but with implicit task_scheduler_init (in this case by the parallel_for in the barrier) the parallel version improves of about 40% in every run for the biggest data size.

I do the tests on Windows using Eclipse CDT with MSVC Compiler. For setting up the environment for the compilation I use the Visual Studio Command prompt.

RafSchietekat · ‎02-08-2013

I meant something like:

[cpp]
//waiting for all worker threads up and running
const int nThreads =
  task_scheduler_init::default_num_threads();
atomic<int> wait = make_atomic(nThreads);
parallel_for(
  blocked_range<int>(0, nThreads, 1),
  [&wait] (blocked_range<int> b) {
      --wait;
      while(wait) {
          // maybe pause
      }
  }, simple_partitioner());
[/cpp]

(Edited 2013-02-09) Non-essential stylistic changes. So what's the timing now?

(Added 2013-02-10) It should be interesting to include timings for this poor man's startup barrier (on both platforms).

Martin_K_9 · ‎02-10-2013

Ok I see what you meant now.

I am not able to do the test today and tomorrow but I will do them as soon as I am back and have both platforms available.

Martin_K_9 · ‎02-12-2013

Okay at did the tests on both platforms with the following results:

- Ubuntu - 16 cores

The behaviour with the barrier in place at the beginning is equal for the explicit and the implicit taks scheduler init. The parallel version starts to outperform the serial version if the data size is around 1 mio. The tests are done on a 16 core machine.

- Windows - 4 cores

With the explicit task scheduler init the measurements do not even start. The programm loops in the barrier. My interpretation is, that there is never a worker thread available that steals tasks to reach parallelism. The worker thread that split the range and created the task runs the latest split task and loops in the busy wait as the tasks scheduling is non preemtive. Why there is not another worker thread that steals a task is strange to me. The task scheduler is initialized with 4 threads.

The measurements with an implicit task scheduler init (parallel_for of barrier) shows the expected behaviour with a parallel version starting to outperform the serial one if the data size is bigger than about 100000.

RafSchietekat · ‎02-12-2013

You couldn't give it a few more minutes, could you? :-)

At least this kind of failure should be easier to diagnose than something that's intermittent.

Martin_K_9 · ‎02-15-2013

I might find some time at the weekend. But I think I have to dive into the TBB source code that is not (yet) familiar to me. I'll keep you informed.

RafSchietekat · ‎02-16-2013

I think that's more something for the TBB development team (first to confirm your findings and then to remedy them). But don't let me dissuade you if you think you would like tackling that more mundane yet often quite delicate side of the code (see previously reported shutdown issues), and good luck!

You should probably start instead by comparing earlier TBB versions (to home in on when this started and have two close versions to compare) and/or different versions of your environment (perhaps less likely, but...), which should help a great deal in further analysis (don't delve into the source code without this knowledge and assurance or without a good graphical difference analyser!). Make sure to report the configuration in enough detail (earlier you wrote that you tested TBB 4.1, but have you tried update 2, the most recent "stable" version, and what was the exact name of the file from which you installed?), and also have a look whether its tests and examples indicate any problems.

Also, as far as timing goes (once you get past the barrier), try to avoid the iterators in the loop and instead use std::vector::data() and pointer arithmetic, which may avoid any overhead from boundary checking (see an earlier thread, I forgot the details). And I don't like that you only get about 2 times speedup with so much available parallelism (E5520 is documented as having 4 cores and 8 hardware threads, so do you have 2 or really 4 sockets?), although you might get better scaling with a more CPU-intensive operation per element (to at least get that out of the picture).