- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I got crazy with the parallel_for in TBB 3 under Linux.
I have a (more or less) simple matrix calculation with similar operations for each element. Therefore I thought that this is easy to parallelize and I expected a good scaling. It works fine on my 6-core machine but when I run the code on a 32-core system (MTL) the performance was similar to my 6-core machine.
I tried to figure out the load balancing using the top command from the shell and I observed that I did not get full cores/processors over the time. It varies very strong, like this 1-7-30-7-11-30-... It should be 32 all the time (there is no further load except OS).
I got mad with this partitioning inside TBB. There is to much "magic inside" for me. If I have 1000 columns
in my matrix and 32 processors I need 32 chunks with 1000/32 items. But this recursive splitting does something else and I observed that I have large chunks and very small ones. Why?
Does the partitioning work like this? (I don't think so; just for double-checking purpose).
1000 --> split 2x500
Assign 500 to the first thread and continue to split the other 500.
500 --> split to 2x250
Assign 250 to the second thread and continue to split the other 250
250 --> split to 2x125
and so on..
In this case the first thread has 50% of the work load, the second thread 25% and so on. This would at least explain my workload this the threads with small chunks have to wait for the treads with large chunks.
I tried to modify the grainsize and also the {simple|auto|affinity}partitioner without any success.
- How I can force parallel_for to create P similar chunks of data size N/P and to distribute this on P threads when P is the number of availabe cores/processors and N the number of columns/rows of my matrix?
- Is there a better way to observe the load balancing than using top (without using Intels complete tool chain) ?
- Does anybody have observed similar "bad-scaling" phenomenon with parallel_for.
Any other hint would be also helpful!
Thank you.
Best regards,
Michael
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
(Added) To convince yourself that you won't have big chunks, just log their sizes, you probably won't see anything larger than 2 with auto_partitioner (which starts with a number of chunks equal to a multiple of the number of workers and may chop them up further later on). Having 32 chunks with 1000/32 items would actually provide less opportunity for work balancing than what the partitioners do now. This is where TBB normally shines, so you're far more likely to find a problem elsewhere.
(Added) I don't really know if it would be false sharing, that's just the first thing I thought of when you mentioned going over the columns. If you have a large number of rows, maybe it's thrashing (or was it another term?) in the cache with only part of each cache line used per iteration, but it has the same remedy. I'll leave it to others now.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
1000 iterations / 32 threads =
1000/32 (parallel work time) + time to launch 31 tasks
1000 iterations / 6 threads =
1000/6 (parallel work time) + time to launch 5 tasks
When your work time is relatively short, the amount of time to launch all the tasks may become a large portion of the process time.
How much work is being done per iteration?
Try structuring a proof test where each iteration takes ~0.1 second of compute time. IOW Serial test runs ~100 seconds. Run the same test on your 6 core system and on the MTL system
Also get the serial run times such that you can adjust thescaling relative to the serial time on the respective systems.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
For example, if columns are single precision, then a 64-byte cache line spans 16 columns. 1000 columns partitioned across 32 cores is 31.25 columns per core. So threads updating different columns could frequently hit on the same cache line.
I recommend using a 2D partitioning across both the rows and columns of the output, and a fixed block size. Suppose the matrix is an MxN matrix.
- Use tbb::parallel_for( tbb::blocked_range2d(0,M,100,0,N,100),...,simple_partitioner()). For a 1000x1000 matrix, that will create at least 100 separate tasks, each of which will compute a 100x100 submatrix or smaller. The matrices will be at least 50x50, so that's at least 250,000 floating-point operations, which should be plenty to amortize scheduling overheads.
- Accumulate the submatrix in a local temporary and then copy it out to the global array. That will avoid false sharing in the inner loops. Because simple_partitioner is being used in conjunction with an explicit grain size of 100x100, the local submatrix will not exceed 100x100. So it can be declared as a fixed size 100x100 array. It might even be possible to get away with declaring just a row of it (see next bullet).
- Accumulate the submatrix in a way that reduces consumption of memory bandwidth. 32 cores all pulling from main memory can easily outrun the memory subsystem. For example, suppose the local computation is local
+= a *b . I'd make j the innermost loop and i the outermost loop. That way the bandwidth pressure is only on b , which ranges over a 1000x100 submatrix. Even for double-precision, that will fit in outer-level cache on most modern machines, and so only has to be fetched once. If i is the outermost loop, then local can be declared as a one dimensional array [100] instead of [100][100], and copied out after each row of the conceptual "local" is computed. Never use k as the inner loop because that introduces a loop-carried dependence on the innermost loop that slows down the hardware. - If the matrix might be much larger, block over the k loop. I.e., accumulate the local submatrix as the sum of products of appoximately 100x100 matrices. That way the inner three loops are operating on three matrices (two inputs and one output) that are only 100x100, which all easily fit in outer level cache. If you go really crazy with blocking, you can get essentially everything that counts into L1 cache or registers for the innermost loops.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
sorry for the confusion, I've written columns but I meant rows. I already arranged the algorithm in such a way trying to minimize negative cache effects.
Michael
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Raf,
My point clearly stated to:
Take the iteration space and divide it by the number of threads.
I did not say:
Take the iteration space and divide it into n partitions (tasks or non-task aggrigations).
Therefore, the grainsize has nothing to do with the number of threads.
When the thread pool, sans the current thread, is sleeping, the TBB library must communicate with the operating system to wakeup these threads (set an event in Windows, or condidtion variable in Linux(Pthreads)). This communication adds overhead be it done sequentially by the instigating thread or in a cascade as the additional threads join the task. This communication overhead is independent of the TBB library overhead as its thead pool varies size.
When the thread pool, sans the current thread, is busy (running something else), then the overhead becomes an "it depends" type of situation, however, the more threads in the TBB thread pool, the larger the overhead.
I think a synthetic test program could sort this out.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
(2010-10-29 Removed exaggeration.)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
There is the latency to start execution of all the tasks (chunks performed by the scheduled threads). So when the grain size reduces the number of threads from the full compement to less than full compement the answer is the startup latency is less. So in the case of dropping from 32-threads to 8-threads you would be reducing the startup latency.
As to if the (startup latency + all threads(chunks) run to completion latency) is reduced, this depends on the run time for the larger chunks and memory contention for the memory bus/subsystem. If this is a simple array copy/addition/scale function then likely 4 threads per socket would swamp the memory bandwidth.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
#10 "If this is a simple array copy/addition/scale function then likely 4 threads per socket would swamp the memory bandwidth."
And Dmitriy's point in #4?
But maybe I wasn't really exaggerating before in #8 about not nearly having enough information (see questions in #6). Giving hypothetical replies may not bethe best use of anyone's time until the original poster, who seems to have a preconceived idea about where the problem lies, provides access to what we need to know to say anything meaningful.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Now I did some tests with parallel_for. The more I program with TBB the more I don't understand. My synthetical
test works like this. There is a array with n elements, where n is 100. I run a parallel_for on this array. There is a
simple body that just put the size of the chunk into the first element of the chunk. The second element of the chunk
gets the thread id. The remaining elements get a 0. And then I do some senseless calculation to keep the cpu running.
I run the program on my notebook with linux on a virtual machine and also on my hexcore machine with HT enabled.
There are 2 observations:
- The program on my notebook is faster (2 times)
- The chunk sizes on the hexacore have a large variance.
Thanks.
--Michael
This is the output on my notebook:
Parallel Runtime=21.8565
Size Freq.
1250 8
Histogram checking ok.
Chunks per Thread
tid=5345 : 1250 1250 1250 1250 sum=5000
tid=5346 : 1250 1250 1250 1250 sum=5000
This is the output on my hexacore:
Parallel Runtime=38.8367
Size Freq.
1 9
2 22
3 4
5 15
9 1
10 11
19 6
20 5
39 8
78 8
156 13
157 2
312 11
313 9
Histogram checking ok.
Chunks per Thread
tid=3476 : 312 313 39 39 78 2 3 9 10 2 sum=807
tid=3477 : 312 313 312 313 312 2 3 2 2 sum=1571
tid=3478 : 19 20 312 313 5 5 10 10 78 78 sum=850
tid=3479 : 312 313 312 313 19 20 5 5 19 20 sum=1338
tid=3480 : 156 156 156 2 2 3 39 39 5 5 sum=563
tid=3481 : 2 5 5 2 2 78 78 312 313 19 20 sum=836
tid=3482 : 10 10 2 2 312 2 156 156 39 39 sum=728
tid=3483 : 156 156 156 157 39 39 2 2 2 sum=709
tid=3484 : 156 156 156 157 2 10 10 sum=647
tid=3485 : 312 313 312 313 2 3 2 5 5 5 sum=1272
tid=3486 : 2 2 78 78 78 10 10 10 10 sum=278
tid=3487 : 5 5 5 19 2 19 20 5 156 156 sum=392
This is my program:
[cpp]#include "tbb/tick_count.h" #include "tbb/parallel_for.h" #include "tbb/blocked_range.h" #include "tbb/partitioner.h" using namespace tbb; #include#include #include
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I guess it's not faster, it just performs much less work. Amount of work in your test is not constant, it depends on number of parallel tasks, the more hardware threads machine has the more parallel tasks it creates the more work it performs. That's ridiculous benchmark.
Instead of:
for(unsignedinti=0;i<(1<<28);i++){
put:
for(unsignedinti=0;i<(1<<24)*r.size();i++){
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Did you kill *all* other processes in the system?
Whatever, parallelism support libraries usually does not provide any guarantees to an end user with respect to uniformness of chunk sizes, they usually only try to achieve linear parallel speedup. It is really important to you?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
my intention was not to have a benchmark. I'm still wondering on the partitioning of parallel_for.
If I have n elements and p processors/cores then I would like to have n/p chunks of data. If I
know that the calculation of each element in the vector/matrix is the same and constant then I
do not need a recursive splitting. I'm a little bit worrying on the overhead with this partitioning scheme.
I understand that the current behavior is very flexibel and self-adapting to the problem and the
workload on the machine that runs the algorithm. But so far I did not understand how can I control
the partitioning.
Let's have a look on the documentation of the auto_partitioner (Ref.Man. p.26):
The range subdivision is initially limited to S subranges, where S is proportional to the
number of threads specified by the task_scheduler_init (11.2.1). Each of these
subranges is not divided further unless it is stolen by an idle thread. If stolen, it is
further subdivided to create additional subranges. Thus a loop template with an
auto_partitioner creates additional subranges only when necessary to balance load.
From this documentation when I have n=10000 elements and 12 cores I would expect 12 sub-ranges
with approx 830 elements. But I got something like this (including the change suggested by Dmitriy)
Run Test1
Parallel Runtime=12.5559
Size Freq.
1 3
2 13
3 7
5 22
9 4
10 28
19 10
20 10
39 25
40 1
78 15
79 1
156 14
312 9
313 6
Histogram checking ok.
Chunks per Thread
tid=4123 : 312 78 78 78 79 39 39 78 78 sum=859
tid=4124 : 312 313 312 sum=937
tid=4125 : 78 312 313 10 10 10 5 5 5 2 3 2 3 9 10 10 10 sum=797
tid=4126 : 312 313 19 19 20 2 2 5 5 5 5 2 3 2 3 19 20 19 20 sum=795
tid=4127 : 19 20 19 20 10 10 10 10 39 39 39 156 156 156 78 78 sum=859
tid=4128 : 39 39 39 39 39 39 156 156 156 2 2 10 10 10 10 39 5 5 9 10 10 39 sum=863
tid=4129 : 156 9 10 10 10 156 156 5 5 78 78 156 sum=829
tid=4130 : 39 39 39 40 39 39 39 39 39 156 156 39 39 19 20 sum=781
tid=4131 : 2 3 2 3 312 313 9 10 2 3 19 2 2 78 sum=760
tid=4132 : 78 78 10 10 78 78 10 10 10 10 156 156 19 20 19 20 20 20 sum=802
tid=4133 : 312 313 312 sum=937
tid=4134 : 5 5 5 5 5 5 5 312 313 5 5 5 5 10 10 39 39 sum=778
The overall workload is nearly uniformly distributed. But there is huge amount of very small partitions.
Does the documentation explain the seen behavior of the simple algorithm? I cannot see this.
Best regards,
Michael
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Your program is inappropriate if you ever going to compare execution times of two executions. And you do.
> But so far I did not understand how can I control the partitioning.
If you want static partitioning (each thread processes N/P elements), just set granularity size as N/P.
> Does the documentation explain the seen behavior of the simple algorithm? I cannot see this.
I suspect the root cause is randomized work-stealing, that is idle thread steals piece of work from *random* thread, and not from the thread that has the biggest piece of work. So small partitions are inevitable.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You mean chunks of size n/p, but that would be suboptimal unless used in an outer loop on a machine that isn't running anything else and where there is negligible variability across elements in execution time. TBB's way is better.
auto_partitioner tends toward smaller range sizes and higher number of tasks as the number of available threads increases, exposing it to higher parallel overhead. You may see better performance if you set a reasonable grainsize in the blocked_range to avoid the smallest ranges that you get on the higher number of threads. Tell us how that works out!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The assertion "the calculation of each element in the vector/matrix is the same and constant" is generally not enough to guarantee equal load. Cache misses, page faults, and interrupts can create unequal work for cores. Many years back when I was prototyping TBB on a 32-way Altix system, I observed that even for something as simple as a matrix multiply, differences in cache misses made dynamic load balancing pay off in some cases.
The intended usage model of TBB is to not think about the number of cores, but about reasonably amortizing overhead of chunks of work. For example, find chunk sizes so that about 5% of the time is spent on parallel scheduling overhead. The scheduling overhead per chunk is roughly independent of the number of processors, so the chunk size should be good across different processor counts.
For the detailed workings of the partitioners, see header file "tbb/include/partitioner.h". Method auto_partitioner::partition_type::should_execute_range has the basicsubdivide-on-steal logic. The logic in affinity_partitioner::partitioner_type is similar, albeit obscured somewhat by the affinity logic.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Arch, can you confirm that auto_partitioner (currently) requires a sensible grainsize setting to avoid degraded efficiency on a higher number of cores? The problem seems to be that thieves do not always go for the biggest chunks (except if they stumble upon them by accident), which would make for a partially exponential growth in the number of tasks with the number of cores (and increased overhead as a result). A solution might be to lead the secondary and tertiary thieves etc. back to the original victim for a more even distribution of work, if that were doable. But perhaps I overlooked something whenarriving atthat conclusion "in cerebro"...
Meanwhile, the recommended use would be to find a suitable grainsize using a simple_partitioner to avoid some of the sources of degraded performance with small subranges (not completely reliable without an actual many-core test machine, but still), and then replace it with an auto_partitioner for workloads that don't exhibit extreme variance across elements for improved performance on fewer cores.
Does that seem plausible?
(Added) Hmm, that goes against recursive parallelism, of course. But doesn't the auto_partitioner also, in a way? New idea: instead of splitting up the initial range into N times the number of worker threads and letting each part be stolen, record the current auxiliary grainsize starting at N times the number of cores parts (same division as now), but use recursive parallelism to get down to that size. Only when a smaller piece gets stolen will it be subdivided using the same mechanism. This will get the work divided across workers with less overhead in two ways that both contribute to better performance: recursive parallelism, and a higher probability of stealing a bigger chunk. Well, that's just what I came up with during the drive home, but I'm off again now, so maybe I'll get some more inspiration on the way. :-)
(Added 2010-11-16) So each task would be marked as one of about log_2(N*default_num_threads) generations for normal splitting (unless the range runs into its grainsize first), with the last generation executed completely when not stolen or starting another cycle when stolen. The generation could be recorded inside the local copy of the partitioner, and evaluated by incrementation modulo the value above in the next generation.
(Added 2010-11-18) Now that's embarrassing... auto_partitioner already works almost exactly like I "proposed", sorry about that. Note to self: don't just wing it.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page