Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.
2464 Discussions

Squeezing more speed out of parallel_for (or should I try tasks?)

ayembee
Beginner
524 Views
I have been experimenting with tbb for the last day after sorting out a compile issue. I like the syntax, was able to rewrite a major bit of array processing fairly quickly to use parallel_for. However, I am seeing very little speedup, and I REALLY do not understand the grainsize concept on anything remotely like an intuitive level.

I would really appreciate some suggestions for getting the kind of performance out of this that I intuitively expect. (And some better understanding of grainsize in the context of this problem would be nice too ;)

So here's the setup:

  • SourceData: large variant array -- say 50,000 rows x 20 columns, mixed between strings/numbers. ~10M
  • Each column will be processed individually in some user-specified way; could be simple math operation, could be regex substitution, could be string operation, could be some other custom function.
  • Each column's transform is 100% independent of any other column's transform.
  • All results are written into unique cells in a pre-constructed output array.

What I'd obviously like to do is parallelise at the column-level, so each column should process by itself. Then each time this processing operation is run, my critical path should depend on the slowest individual column(s), not on the serial time to go through each one.


So, I refactored the core processing code so it could run inside a parallel_for. My first thought was that I should set grainsize = 1. My intuitive feeling (clearly MASSIVELY wrong) was that each iteration through the blocked_range would setup a thread per-column, eg: my 1 grain = 1 column.

Time taken was around 2-5x slower than serial!
Assumption about grainsize clearly wildly wrong? :p
Tried auto_partioner; also got terrible results.

Set grainsize maually to 10,000 with simple_partitioner -aha!
Execution time finally faster than serial -- but only just. Just 5-8% faster.


Given that I have a quad core Xeon with hyperthreading, and this looks like a VERY parallelisable problem, I was thinking I should be able to get the processing speed down to about a half or a quarter of what it was before (the most flexible transform operations have a fixed overhead, so I don't expect perfect scaling)

Could anyone suggest how to get more performance?
It seems like it is the kind of problem that is absolutely begging to get huge speedups ;)

So...

  • Is parallel_for the best algorithm in this case?
  • What in god's name does grainsize really imply in this context?
  • Should I write my own split to force each column onto a thread?
  • Any other suggestions? :)

Many thanks for the help as I embark on a TBB journey...

Regards.

-A


current invoking code is very simple, looks like this:
[cpp]    tbb::task_scheduler_init init;
tbb::parallel_for(
tbb::blocked_range(0,n_columns,10000),
TransformArrayColumns(&SourceData,&Parameters)
);[/cpp]
0 Kudos
16 Replies
RafSchietekat
Valued Contributor III
524 Views
If you have only 20 columns, a grainsize of bigger than 20 will have the same effect as 20, i.e., no parallelism, because a range is only split up if its size is strictly larger than grainsize (although the resulting pieces may be smaller, which I find a bit counterintuitive).

Can you rearrange the computations by nesting the column computations inside the row iterations? Use parallel_for to iterate over the rows, don't bother about parallelism for the columns. Your cache will reward you bountifully.
0 Kudos
ayembee
Beginner
524 Views
That sounds like my initial assumption about grainsize would have been correct then -- eg: each column should have been run ina separate thread with a grainsize of 1 -- in which case I can't see why I'd get such dramatically worse performance charateristics vs serial execution... :-\

Unfortunately I can't easily arrange to invoke the parallel_for row-wise; some of the columns can (but don't necessarily) incur a fixed penalty as their transform is setup (specifically thinking about the user-defined transform functions that are specified in JS via google's v8 engine; initial parsing to machine code occurs once, each element of the array is then fed into the resultant machine code. Going row-wise on that would absolutely murder performance as you'd incur that fixed cost each time)

Having said that, maybe each individual transform operation (with the exception of these specific custom functions; I don't believe v8 is thread-safe) could itself be threaded. Then you'd be threading each column at the row-level as you suggest. Can give it a go.
0 Kudos
RafSchietekat
Valued Contributor III
524 Views
It would surprise me very much if you couldn't keep some kind of reference to a transform that has already been set up, so that you can reuse it on each new row. You could do that inside auto_partitioner-provided ranges (which tend to be fairly big at least in the beginning, amortising some of the setup costs), or using thread-local storage, all assuming that you have to keep things separated by thread.

Just for the heck of it, how about using a pipeline instead, with filters for each column? Use parallel filters where you can (those are very cheap to operate), and unordered serial filters for the operations that are expensive to set up. Put the parallel filters in front (normally, the order of the filters matters, and any parallel ones add to the execution time of the previous serial filter, which can be unfortunate if that's already the bottleneck, but here you can easily avoid that). The result might not be so great yet, but at least the cache gets used better.

(Added) I'm not sure I've had the whole picture, but also don't forget about the bad effects of lack of parallel slack: with 4 times 2 hardware threads to work on 8 tasks, the tasks had better be similar in size, otherwise near the end you'll likely have some hardware threads idle while others are still processing. This and the best use of the cache both point toward parallelism across rows.
0 Kudos
ARCH_R_Intel
Employee
524 Views

As Raf remarked before, operating over columns incurs significant cache overhead. Each cache line is typically 64 bytes. If an element of the array is only 4 bytes, the processor still brings in the entire 64 bytes, so your paying a memory bandwith tax of 16x. Even the serial code might benefit significantly if the array could be transposed so that physical operations are over rows.

http://www.upcrc.illinois.edu/workshops/paraplop10/papers/paraplop10_submission_4.pdf has a nice introduction to basic cache issues. It might be worth experimenting with changing the code to use a transposed form of the matrix. Even if you can't use that in the production version, timing the transposed form would indicate whether cache issues are the bottleneck.

Over the last two decades, processor speeds have increased much faster than memory bandwidth. A cache miss is on the order of a hundred cycles. So designers of programming interfaces really do need to consider this when choosing data layouts.

0 Kudos
ayembee
Beginner
524 Views
This is great stufff... Exactly the sort of things I haven't really had to think about before (regarding cache). Have to say it doesn't immediately form any intuitive links back to the data, but I think it'll start to make sense once I start to build the appropriate mental pictures.

At the moment the serial code does already operate row-wise, per-column, and has been fairly well optimised, avoiding all unnecessary data copies/etc. The same operation will be done to all the rows in one column before moving on to the next operation on the adjacent column - I believe this is what you are suggesting? So the operation is the same, and the data is similar, tens of thousands of times, before moving on to the dissimilar operation/data.

So, if I've got this right, parallelising at the column level, each into their own thread, could be causing major cache issues, hence leading to the (for me) surprising drop in performance as it thrashes data in and out.

I've run some more profiling to find out what specific operations are most costly, and there are three that really stand out from the pack; regular expression substitution via PCRE, date string parsing, and the V8 (javascript) freeform transforms. With the exception of V8, the other two could work really nicely with a parallel_for inside these functions (which take a const ref to the column as an argument and loop internally). This would definitely seem to tick the boxes being referenced above for good parallel behaviour -- then the blocked range will be across the rows inside the function, not the columns above.

I'm somewhat more leery about V8, if only because of the comment in the header about threading:

Multiple threads in V8 are allowed, but only one thread at a time is allowed to use V8. The definition of 'using V8' includes accessing handles or holding onto object pointers obtained from V8 handles.

Looks like you'd have to lock the incoming calls so they block each other for access to the engine; little benefit there. I have found that I can almost certainly cache a compiled script between sessions though (was already creating only once per block, looks like I'll be able to cache between blocks too, which will still save some time when I may be executing 50 or more large data blocks consecutively [in the case of multi-GB source data this is totally normal])

Lots of food for thought... ;)
0 Kudos
ARCH_R_Intel
Employee
524 Views
What I was suggesting (as an experiment) is to physically transpose the data so that what is logically a column can be stored as a row of a matrix, and vice versa. That way, you have 20 physical rows with 50000 physical columns, and adjacent elements in a physical row (logical column) would pack into cache lines nicely.

0 Kudos
ayembee
Beginner
524 Views
Got it; my mental picture has now more or less caught up after a lot more reading about cache behaviour/memory allocation. Comes down to data localisation/etc. A few interesting things to try when I get back in on Monday. Appreciate the pointers (pun not intended ;)
0 Kudos
RafSchietekat
Valued Contributor III
524 Views
#4 "It might be worth experimenting with changing the code to use a transposed form of the matrix."
Unless another part of the program needs significant amounts of random access (right?), I agree that this is the way out if any column really must be executed serially.

#5 "Multiple threads in V8 are allowed, but only one thread at a time is allowed to use V8. The definition of 'using V8' includes accessing handles or holding onto object pointers obtained from V8 handles."
Not quite what you would expect in this day and age, and from someone like Lars Bak... a V8 engine running on only one cylinder. So it really cannot be partitioned in any way?

I assume that only one column uses V8 (otherwise the parallel_for you tried wouldn't have been legal), so I would propose a transposition of the matrix, and using parallel_for or parallel_invoke to process the former columns, with nested parallel_for's for all but the V8-related one, unless the V8-related work is the last to the finish anyway (I don't think the generated code is highly optimised, and if the algorithm were simple it could be easily rewritten in C++).
0 Kudos
ayembee
Beginner
524 Views
I can reshape the incoming data to the parsing function before it gets there, so it'll be straightforward to benchmark what speedups can be had from pure transposition, and then feed that back up the relevant codepath properly, so the data is actually allocated that way to start with. Definitely looking like a great place to start.

As for V8, it could be used in zero or none of the columns; in the case above I actually had zero -- it's for use where the built-in native ones aren't sufficient and the parsing needs more flexibility. Usually it's going to be zero or none, so in this case I'd actually just be looking to make sure that the total parse time max is no more than the V8 column... Most transforms will not have V8 as the built-ins are reasonable in the common cases.

Given how strict the conditions above are (even just accessing handles!) it may not be possible to work around it. I'm going to follow up with them on that score, as my conditions are well-defined; each execution is guaranteed to be in a separate V8 context, and I've just discovered I can create my own out-of-context persistent cache for the compiled scripts (meaning if I need to process 50 large blocks, I can at least save 49 additional compiles; previously I could stash a block-local compiled script, but it couldn't persist until the next call due to scope-related garbage collection)

Found a nice opportunity for a small function-level cache in the date parsing routine too, should speed it up massively with the typical profile; looking at the typical profile of the data coming through, it wouldn't need threading; many dates are frequently repeated.

I was going to take a look at the Intel VTune tools; visualising thread behaviour, AND (I think) cache-use statistics/profiling? Will investigate...
0 Kudos
RafSchietekat
Valued Contributor III
524 Views
I just saw some presentation about V8 with Lars Bak, in which it is said that while JavaScript is essentially single-threaded, it can make sense to run several "worker threads" that perhaps communicate by messages. Maybe it does allow multiple carefully partitioned instantiations already? It just doesn't seem right to start with a nonreentrant implementation if there's any chance whatsoever that this may have to be revised later on.

What is a V8 "context", if not exactly that?
0 Kudos
ayembee
Beginner
524 Views
It seems contexts and threads are actually not related after all... Further digging brings this up from Erik Corry:

>Contexts and threads have nothing to do with each other in V8. A single thread can handle several contexts (as in Chromium) or a context could be used by more than one thread. The important thing is that only one thread is doing V8 stuff at a tme.

...and

>Normally you only use V8 in one thread. That's nice and simple. If you want to use more than one thread you have to make use of the V8::Locker class from include/v8.h. See that file for more details. This puts a single big lock around everything V8 does. So although the program as a whole can have more than one OS thread that uses V8 there can be only one thread at a given time that is using V8.

Not looking so good.
0 Kudos
RafSchietekat
Valued Contributor III
524 Views
The upside is that additonal contexts are fairly inexpensive.

(Added 2010-04-04) That would be about reusing the work that has already been done, assuming that this doesn't prevent getting a fresh environment even though JavaScript can change the basic libraries. But why would it not also be doable to start from a snapshot of a context that has already been warmed up, for either an initial or additional partitioned execution environment... it would mainly involve copying stuff, and even that can be optimised using something like private memory mapping, I guess (hmm, does that work for multiple mappings inside the same process?).

On the other hand, what are you doing with JavaScript that you couldn't also do with C++?
0 Kudos
ayembee
Beginner
524 Views
>> On the other hand, what are you doing with JavaScript that you couldn't also do with C++?

Offering users of the framework an elegant way to define their own parsing functions; the built-in transforms cover all typical cases, but there are always going to be custom parsing conditions that can't be anticipated/covered by generalised code -- gives great flexibility, and maintains good execution speed. (I've wrapped python the same way; user's choice which they use)

Basically there are three primary bottlenecks, and two of them are disk/network I/O; the third area is this transform pipeline, and though it's reasonably fast already, there's definitely scope to make it non-trivially faster... It's also such a conceptually clean problem that it makes a good test-bed for getting familiar with TBB concepts.
0 Kudos
RafSchietekat
Valued Contributor III
524 Views
Time to get desperate: parallel_for(auto_partitioner) over the rows, and fork/wait to process each range. :-)

(Added) Or maybe not, because apparently there's very little you can safely do in a forked multithreading program. :-(
0 Kudos
jimdempseyatthecove
Honored Contributor III
524 Views
Arch,

The transposition experiment should be done but the payback will come only if the transposed data is referenced a sufficient number of times to recover the cost of rewriting this data (effectively you are performing a read one stream (cache linepacked data)- write 20 streams then 20x read stream and operate(). Without additional information regarding what will be done it would be premature to speculate as to the effectiveness or the transposition.

A variation of this (transposition) should be explored as well. This would be to setup a parallel pipeline where you transpose batches of the 50,000 rows, batch size tuned for particular processor cache archetecture and number of sockets.

e.g.

A system with 2 sockets, each with unified L3 cache, one thread from each socket could perform

loop:
transposition of slice of data into batch
distributed task distribution (and participating in shortest enqueued tasks)
end loop

The parallel_pipeline should be able to handle this type of programming.

Jim Dempsey
0 Kudos
RafSchietekat
Valued Contributor III
524 Views

"The transposition experiment should be done but the payback will come only if the transposed data is referenced a sufficient number of times to recover the cost of rewriting this data (effectively you are performing a read one stream (cache line packed data) - write 20 streams then 20x read stream and operate(). Without additional information regarding what will be done it would be premature to speculate as to the effectiveness or the transposition."
I think the assumption is an architectural transposition, with data immediately written that way, maybe in a number of blocks.

0 Kudos
Reply