Re: Cuncurrent Containers vs STL Containers + Mutex/Lock

timminn · ‎10-10-2008

I am wondering, ifthe efficience issue is the ONLY factor concerned,

could we replace the Intel Official TBB Cuncurrent Containers with STL Containers in cooperation with TBB mutex?

If not, please specify the by efficiencedifference with some testing dataon timing.

If so,how to choose thebestkind of mutexes fora particular circumstance? like, to perform push/pop/sort/swap/... ona queue/vector?

robert-reed · ‎10-11-2008

Quoting - timminn

I am wondering, ifthe efficience issue is the ONLY factor concerned,

could we replace the Intel Official TBB Cuncurrent Containers with STL Containers in cooperation with TBB mutex?

If not, please specify the by efficiencedifference with some testing dataon timing.

If so,how to choose thebestkind of mutexes fora particular circumstance? like, to perform push/pop/sort/swap/... ona queue/vector?

Certainly a scoped TBB lock around all accesses to an STL container would be thread-safe as long as each transaction was contained within a single locking scope (to assure container integrity). How much slower would it be? That really depends on the type of container and distribution of accesses to it to support the parallel algorithm being implemented.

The engineering discipline includes a certain amount of trial and error. If you are interested enough to find out how the various scoping locks would affect common data structure access, perhaps youll want to try some experiments yourself. And if you discover anything interesting, be sure to share it with all of us.

Anton_Pegushin · ‎10-14-2008

Quoting - timminn

I am wondering, ifthe efficience issue is the ONLY factor concerned,

could we replace the Intel Official TBB Cuncurrent Containers with STL Containers in cooperation with TBB mutex?

If not, please specify the by efficiencedifference with some testing dataon timing.

If so,how to choose thebestkind of mutexes fora particular circumstance? like, to perform push/pop/sort/swap/... ona queue/vector?

Andrey Marochko has just posted a blog on how and why TBB concurrent containers are different from STL+mutex ones. Here is the linkhttp://software.intel.com/en-us/blogs/2008/10/13/tbb-containers-vs-stl-functionality-rift/. In short - no, it's not just efficiency that was a factor for choosing TBB concurrent containers API and implementation. Andrey gives a good background (even with example) on what "inherently not-thread safe" means for STL containers. He also references Arch's blog on "concurrent linked lists", but the link is broken. Watch out :).

And in a way I'd have to agree with Robert. Of course it's important to understand the choice of a certain container for a certain algorithm, but practice makes perfect. Go ahead, implement (for example) a word counter and find out what is the most frequent word in "War and peace" using concurrent_hash_map and std::map. I tried it and was impressed with results.

RafSchietekat · ‎10-14-2008

How about sharing those results anyway? I could not implement such a test without wondering what I'm doing not using parallel_reduce instead, looking for a way to strike the right balance between creating parallel slack and avoiding join overhead, and fretting about the most efficient merge algorithm (not something easily parallelised without privileged access to the data structure's internal workings, now there's an idea...). And wouldn't the existence of a more efficient implementation have some impact on the relevance of this benchmark to situations that cannot be improved the same way (probably not, but my only certainty is doubt)?

Anton_Pegushin · ‎10-15-2008

Quoting - raf_schietekat

How about sharing those results anyway? I could not implement such a test without wondering what I'm doing not using parallel_reduce instead, looking for a way to strike the right balance between creating parallel slack and avoiding join overhead, and fretting about the most efficient merge algorithm (not something easily parallelised without privileged access to the data structure's internal workings, now there's an idea...). And wouldn't the existence of a more efficient implementation have some impact on the relevance of this benchmark to situations that cannot be improved the same way (probably not, but my only certainty is doubt)?

I'm not sure we're talking about the same problem. The statement I'm proposing is: let's say there is a vector into which we're read all "War and peace". And we need to find one word, which is most frequently found in the text (by the way, the word is "and" :)... pretty obvious). The way I did it - I created a concurrent_hash_map with pairs , where the value is number of times the word (string) was found. And in parallel_for several threads go through the text (vector) modifying this concurrent_hash_map in parallel. I've been thinking about what you said - using parallel_reduce - for a good 15 minutes, but I can't figure out how it's applicable here.

Anyway, the results I've seen were ~4x overall speedup on my dual core laptop. The reason for it was concurrent access + hashing instead of locked access + no hashing.

Dmitry_Vyukov · ‎10-15-2008

Quoting - Anton Pegushin (Intel)

I'm not sure we're talking about the same problem. The statement I'm proposing is: let's say there is a vector into which we're read all "War and peace". And we need to find one word, which is most frequently found in the text (by the way, the word is "and" :)... pretty obvious). The way I did it - I created a concurrent_hash_map with pairs , where the value is number of times the word (string) was found. And in parallel_for several threads go through the text (vector) modifying this concurrent_hash_map in parallel. I've been thinking about what you said - using parallel_reduce - for a good 15 minutes, but I can't figure out how it's applicable here.

Anyway, the results I've seen were ~4x overall speedup on my dual core laptop. The reason for it was concurrent access + hashing instead of locked access + no hashing.

It's difficult to make any conclusions here. Because of what was speed-up? Because of replacement of binary tree with hashing? Or because of smarter locking? More sensibly would be to compare std::map+mutex with hash_map+mutex, and then to compare hash_map+mutex with concurrent_hash_map.

RafSchietekat · ‎10-15-2008

"I've been thinking about what you said - using parallel_reduce - for a good 15 minutes, but I can't figure out how it's applicable here." Why not? Aggregations are not limited to scalar values, e.g., an average can only be determined after sum and count have individually been computed, but this can still be combined in one pass. With std::map, joins could be merges of ordered collections, not necessarily maps (the result of a join might well be a list or vector); with tbb::concurrent_hash_map, it may be best to first just assemble a collection of maps and to then merge their segments in parallel (assuming privileged access).

"The reason for it was concurrent access + hashing instead of locked access + no hashing." Apples and oranges...

BTW, Dmitriy, what is the story with regard to those superfast hash maps we keep hearing about if every access involves a write, before a single read-only reduction pass finds the maximum?

Dmitry_Vyukov · ‎10-15-2008

Quoting - raf_schietekat

BTW, Dmitriy, what is the story with regard to those superfast hash maps we keep hearing about if every access involves a write, before a single read-only reduction pass finds the maximum?

I am not sure I get you.

It's just not the situation which can be significantly hasten with superfast hash map. Alhough it still be faster with superfast hash map. Note that not all accesses involve mutation of hash map. Assume all threads encounter word "and", only first thread will insert it into hash map, and all other threads will just atomically increase the counter - no mutation of hash map from those threads wrt word "and".

It's not my freak, usually all benchmarks of hash maps measure performance on substantially high read workload. Just because it's usually the case. And if it's not the case, then maybe one better end up with another structure (not hash map).

Anton_Pegushin · ‎10-15-2008

Quoting - raf_schietekat

Something's wrong with quoting here... anyway,

First - yeah, I totally agree on the "apples and oranges" thing. I should have compared concurrent_hash_map to a hash_map to eliminate the effect of hashing, but I myself wrote this small exercise at the time when I was learning TBB, so I just wanted to see how it worked and if it worked fast enough. And what's more important, I originally suggested this task in my first comment to focus on what the _difference_ between concurrent containers and STL containers are _besides_ the efficiency (ease of concurrent access, inherently thread-safe API, etc.)

Regarding parallel_reduce approach. That actually was my first guess about what you were suggesting, but for me it boiled down to comparing:

(1) overhead for penalties for write-accessing a locked concurrent_hash_map node (when several threads are updating the value for the same key in one global concurrent_hash_map)

with

(2) overhead for merging a bunch of locallycreated(inside a parallel_reduce function object) non-thread-safe containers.

I might be missing something here, but for me both approaches are single-pass ones with just different natures for overhead and I'm not sure one would be dramatically better (meaning smaller) than the other... should be a good idea to go back to that old code and try it though.

Dmitry_Vyukov · ‎10-15-2008

Quoting - Dmitriy V'jukov

It's not my freak, usually all benchmarks of hash maps measure performance on substantially high read workload. Just because it's usually the case. And if it's not the case, then maybe one better end up with another structure (not hash map).

For example, in this case, I think, it would be better to apply following reduction.

In 'map' phase every thread processes it's own part of text. So at the end of 'map' phase every thread creates private map.

Assume we have 4 threads/cores. 'Reduce' phase goes as follows. Every thread splits his map on two parts: A..M and N..Z (by first letters of words). Then thread 1 combines A..M parts of thread 1 and thread 2. Thread 2 combines N..Z parts of thread 1 and thread 2. Thread 3 combines A..M parts of thread 3 and thread 4. And thread 4 combines N..Z parts of thread 3 and thread 4.

Then every thread splits his map on two parts again: A..G, H..M, N..R, S..Z. Then thread 1 combines A..G parts of thread 1 and thread 3. And so on.

At this point we have 4 totally aggregated regions: A..G, H..M, N..R, S..Z. Then every thread finds maximum in his region. And then main thread find maximum between 4 values. Q.E.I.

I am not sure whether this algorithm can be expresed with predefined TBB algorithm. Nevertheless it can be straightforwardly expressed with TBB tasks (with finer granularity in order to eliminate possibility of load imbalance).

Dmitry_Vyukov · ‎10-15-2008

Quoting - Anton Pegushin (Intel)

Regarding parallel_reduce approach. That actually was my first guess about what you were suggesting, but for me it boiled down to comparing:

(1) overhead for penalties for write-accessing a locked concurrent_hash_map node (when several threads are updating the value for the same key in one global concurrent_hash_map)

with

(2) overhead for merging a bunch of locallycreated(inside a parallel_reduce function object) non-thread-safe containers.

I might be missing something here, but for me both approaches are single-pass ones with just different natures for overhead and I'm not sure one would be dramatically better (meaning smaller) than the other... should be a good idea to go back to that old code and try it though.

Well, it's one of those situations when it's difficult to predict something w/o measurements.

But my intuition says that approach with reduction can yield much better results, nevertheless they have the same "big Oh". I think it can be the matter of orders of magnitude difference (10x-100x). Although it will depend on details.

For example in approach that I propose here:

http://software.intel.com/en-us/forums/showpost.php?p=67547

I try to take advantage of locality, i.e. thread combines 2 lists one of which must be already in cache of the thread. Also it's possible to use prefetch of another list.

Dmitry_Vyukov · ‎10-15-2008

Quoting - Anton Pegushin (Intel)

Something's wrong with quoting here... anyway,

If this weird stuff happens I just manually copy the contents of the post to which I reply...

RafSchietekat · ‎10-15-2008

Dmitriy: "and all other threads will just atomically increase the counter" Oh yes, of course, it's only the membership mutations that are costly. Follow-up question: what if the distribution is mainly a very long tail of low-count words?

Dmitriy: "For example, in this case, I think, it would be better to apply following reduction." Interesting... are you quite sure that manipulating data in 3 different places (merge of 2 merges) is faster than doing 4-way merges? My intuition would be the opposite, especially for a majority of low counts.

ARCH_R_Intel · ‎10-15-2008

Generating histograms in paralell is aperplexing problem. I had a summer intern look at it a two summers ago, specifically the problem of writing an efficient "map-reduce" algorithm. There are two extreme cases to consider:

There are only a few keys. I.e., the keys are replicated many times. (E.g.,a book like "Green Eggs and Ham", written by Seuss in response to a bet he coudn't write a book with only 50 words). In this case, a parallel reduction is more efficient. Most of the time can be spent doing independent counting, with only a little work for merging.
There are many keys, and each key appears only a few times. In this case, a central hash table will be more efficient, because there is practically no independent work. Even if each thread sorts its portion of the histogram before merging, the merging work will dominate. Maybe parallel merging will pay off compared to a centralized hash table.

The point is that the best algorithm is sensitive to the data distribution.

Dmitry_Vyukov · ‎10-15-2008

Quoting - raf_schietekat

Dmitriy: "and all other threads will just atomically increase the counter" Oh yes, of course, it's only the membership mutations that are costly. Follow-up question: what if the distribution is mainly a very long tail of low-count words?

I didn't say that fast hash map will significantly hasten this algorithm. I said exactly the opposite: it won't significantly hasten this algorithm.

Though, faster hash map will make algorithm faster. Note that I didn't say to what extent.

Dmitry_Vyukov · ‎10-15-2008

Quoting - raf_schietekat

Dmitriy: "For example, in this case, I think, it would be better to apply following reduction." Interesting... are you quite sure that manipulating data in 3 different places (merge of 2 merges) is faster than doing 4-way merges? My intuition would be the opposite, especially for a majority of low counts.

I think it a good idea!

So we split map to fine-grained ranges: A..B, C..D, ....

Then every thread picks up a range and aggregate data in that range from all threads. Right?

I think it will be better than my initial proposal.

Dmitry_Vyukov · ‎10-15-2008

Quoting - Dmitriy V'jukov

I think it a good idea!

So we split map to fine-grained ranges: A..B, C..D, ....

Then every thread picks up a range and aggregate data in that range from all threads. Right?

I think it will be better than my initial proposal.

Hmmm... If we are talking only about finding the most frequent word, then I think thread doesn't even need to build an aggregation, thread can calculate most frequent word on the fly. I.e. it remembers only most frequent word and it's count.

Hmmm... Well... I think this solution will completely blow others out of the water...

Dmitry_Vyukov · ‎10-15-2008

Quoting - Arch Robison (Intel)

Generating histograms in paralell is aperplexing problem. I had a summer intern look at it a two summers ago, specifically the problem of writing an efficient "map-reduce" algorithm. There are two extreme cases to consider:

There are only a few keys. I.e., the keys are replicated many times. (E.g.,a book like "Green Eggs and Ham", written by Seuss in response to a bet he coudn't write a book with only 50 words). In this case, a parallel reduction is more efficient. Most of the time can be spent doing independent counting, with only a little work for merging.

There are many keys, and each key appears only a few times. In this case, a central hash table will be more efficient, because there is practically no independent work. Even if each thread sorts its portion of the histogram before merging, the merging work will dominate. Maybe parallel merging will pay off compared to a centralized hash table.

The point is that the best algorithm is sensitive to the data distribution.

What do you think about following algorithm which proposed by raf_schietekat?

Note that we calculating only the most frequent word, not the whole histogram.

Thread gets a part of initial text and calculates sorted aggregation, i.e. map. This is purely thread local operation.

When whole initial text is processed we split all thread local aggregations to small regions, for example: A..B, B..C, D..E, ... (by initial letters of words).

Then every thread grabs region and calculates the most frequent word in that region on the fly. Aggregations are already sorted, so this must be easy. No additional data structures are created on this step.

When all regions are processed, then most frequent word from all regions is chosen. This can be done in single thread, because number of regions is, for example, NUMBER_OF_PROCESSORS*16.

Anton_Pegushin · ‎10-16-2008

Quoting - Dmitriy V'jukov

Note that we calculating only the most frequent word, not the whole histogram.

Dmitriy, I'm not sure the proposed algorithm will be any different in case of "most frequent word" or in case of "whole histogram". As Arch noted earlier the implementation of the "map" part of this map-reduce algorithm will depend on the data distribution. Lets assume we've chosen the "local maps" approach. Then we aggregate all those partial results into one big map and divide the map into parts as proposed (A-F, F-L, ... etc.) and we let each thread take one part and process it parallel with others. We can ask for a histogram as a result or we can ask for the most frequent word - it would not matter, the algorithm will work exactly the same.

As for the techniques proposed to better parallelize the aggregation+post-processing part, of course optimizations are possible, but the significant difference in the performance will be caused by data distribution and either by the concurrent access to one global map or by mergine the local maps. Right?

RafSchietekat · ‎10-16-2008

Not having to compute a global histogram means that the merging phase is even more efficient, over and above the advantage of running in parallel (which Arch has also referred to). Data has to be moved around, but there is no significant locking activity, so everything seems fairly scalable. I think we have a winner...

ARCH_R_Intel · ‎10-16-2008

It seems a very reasonable approach. Between the collection of the local maps and analysis of the pieces is roughly equivalent to a matrix transpose in memory. Itreminds me of columnsort and a parallel quicksort that I've seen somewhere.