Cuncurrent Containers vs STL Containers + Mutex/Lock - Page 2

timminn · ‎10-10-2008

I am wondering, ifthe efficience issue is the ONLY factor concerned,

could we replace the Intel Official TBB Cuncurrent Containers with STL Containers in cooperation with TBB mutex?

If not, please specify the by efficiencedifference with some testing dataon timing.

If so,how to choose thebestkind of mutexes fora particular circumstance? like, to perform push/pop/sort/swap/... ona queue/vector?

Dmitry_Vyukov · ‎10-18-2008

Quoting - Anton Pegushin (Intel)

Quoting - Dmitriy V'jukov

Note that we calculating only the most frequent word, not the whole histogram.

Dmitriy, I'm not sure the proposed algorithm will be any different in case of "most frequent word" or in case of "whole histogram". As Arch noted earlier the implementation of the "map" part of this map-reduce algorithm will depend on the data distribution. Lets assume we've chosen the "local maps" approach. Then we aggregate all those partial results into one big map...

NO! In the case of most frequent word, no need for further aggregation of local maps.

So this method basically has no synchronization/shared state (except coarse-grained work distribution), it must be extremely scalable.

Assume we are running on 864-core Azul system, what will contention be on shared hash map?

jimdempseyatthecove · ‎10-18-2008

Quoting - Anton Pegushin (Intel)

Dmitriy, I'm not sure the proposed algorithm will be any different in case of "most frequent word" or in case of "whole histogram". As Arch noted earlier the implementation of the "map" part of this map-reduce algorithm will depend on the data distribution. Lets assume we've chosen the "local maps" approach. Then we aggregate all those partial results into one big map and divide the map into parts as proposed (A-F, F-L, ... etc.) and we let each thread take one part and process it parallel with others. We can ask for a histogram as a result or we can ask for the most frequent word - it would not matter, the algorithm will work exactly the same.

As for the techniques proposed to better parallelize the aggregation+post-processing part, of course optimizations are possible, but the significant difference in the performance will be caused by data distribution and either by the concurrent access to one global map or by mergine the local maps. Right?

Due to the likelyhood ofdifferent word sets per thread one cannot simply pick the most frequent word in each set and then compare the count values. The sum of the tally counts of a lesser valued word in each list may exceed the count of the most valued word of each list. What you could do is a clipped merge whereby you obtain the word max count of each thread, multiply by number of threads (collections) then merge to hash+count containeronly words from each list who's counts exceed or equalthe clipping point.

A faster route to find the most frequent word (assume words only A-Z in this example) is for each thread to take 1/N of the "book" (adjust character zoneto next word break). Then each thread creates a tally count list for each of the possible 1st letters (26 countersin this case), initialize the counts to 0 and then each thread tallys the counts of words in their zone beginning with each letter. When all N tally lists are complete, merge the list to find first letter of the most frequent word (each thread can duplicate the merge as opposed to stop/start).

Then begin a second parallelscan of list with each thread allocating an array of size_t or char* elementsof a locator listwho's each extent is the count value for their respective index tally of the letter of the most frequent word. Then rezero the A:Z tally counters (or recurse and create a new tally counter list) and rescan the threads respective portion of the book finding the locations of words beginning with the most frequent 1st letter found on the first pass. When word found, place index or pointer into the locator list while sumulteaneouslytallying the count of the 2nd letter in the word (assuming there is a 2nd letter). Once tallys of the 2nd lettter are complete, merge the tallys to find the most frequent 2nd letter (of the words of the most frequent 1st letter).

Then begin a 3rd pass, which clears the A:Z tally counters (or recursesand creates new tally counter list)and uses the locator list as a thread local reduction list (in and out) where by it filters and compresses the list to contain only the entries matching the most frequent2nd letter while simulteaneously tallying the counts for the 3rd letter (assuming there is a 3rd letter). (Note, the locator list needs to be allocated once as it only shrinks).

This process continues until there are no words with remaining letters in each list. Some particulars to be worked out on this such as if this is a recursive or loop process. If you recurse then you could propigate the tally count upwards and produce the frequency count in addition to the word with the highest frequency.

In this manner there is only one container used (that containing the string for the book), no additional containers are created for hash keys, no locks required (each thread can independently sum the tallys of the other threads), but you will need a barrier between each pass.

The whole book is scanned twice, 2nd letter of reduced word list examined twice more, 3rd letter of reduced-reduced word list examined twice more, etc...

Jim Dempsey

jimdempseyatthecove · ‎10-18-2008

Oops "multiply by number of threads" -> "divide by the number of threads" (to compute clipping point)

Jim Dempsey

RafSchietekat · ‎10-19-2008

A loose end in the building of partial histograms is load balancing. parallel_reduce will reuse the same body for adjacent regions in order, but what if we would prefer it to reuse that body also for "stolen regions", to limit the number of histograms to be merged later by the number of processors?

For load balancing the merge phase it seems best to encapsulate an array of hash maps, with the array expressing part of the full hash key and sufficiently large to allow good load balancing with another parallel_reduce. Even if the desired end result is a full histogram in alphabetical order, wouldn't it always be best to merge from hash values, and to order alphabetically only as an additional stage?

Dmitry_Vyukov · ‎10-19-2008

Quoting - Raf Schietekat

A loose end in the building of partial histograms is load balancing. parallel_reduce will reuse the same body for adjacent regions in order, but what if we would prefer it to reuse that body also for "stolen regions", to limit the number of histograms to be merged later by the number of processors?

Efficient parallel algorithms expressed only with off-the-shelf components - is different problem. I don't think it's solvable in general case at all :)

General solution is simple - just store pointer to histogram in TLS/TSS.

Quoting - Raf Schietekat

For load balancing the merge phase it seems best to encapsulate an array of hash maps, with the array expressing part of the full hash key and sufficiently large to allow good load balancing with another parallel_reduce. Even if the desired end result is a full histogram in alphabetical order, wouldn't it always be best to merge from hash values, and to order alphabetically only as an additional stage?

What do you mean by "with the array expressing part of the full hash key"? Please, elaborate a bit more here.

RafSchietekat · ‎10-19-2008

"General solution is simple - just store pointer to histogram in TLS/TSS." Hmm, what's the cost of TLS on various architectures again?

"What do you mean by "with the array expressing part of the full hash key"? Please, elaborate a bit more here." I was thinking of concurrent_hash_map's implementation as a fixed array of dynamic arrays of chains, with the lowest bits of the hash value used to index the outer array, and some of the rest to index a dynamic array. In this case the array would hold single-threaded hash maps, which I suppose is faster than any alternative during the write phase.

Dmitry_Vyukov · ‎10-19-2008

Quoting - Raf Schietekat

"General solution is simple - just store pointer to histogram in TLS/TSS." Hmm, what's the cost of TLS on various architectures again?

Usually it's few loads.

But if it will be 1 TLS access per big block of input data, I think, it doesn't matter.

jimdempseyatthecove · ‎10-19-2008

Quoting - jimdempseyatthecove

Due to the likelyhood ofdifferent word sets per thread one cannot simply pick the most frequent word in each set and then compare the count values. The sum of the tally counts of a lesser valued word in each list may exceed the count of the most valued word of each list. What you could do is a clipped merge whereby you obtain the word max count of each thread, multiply by number of threads (collections) then merge to hash+count containeronly words from each list who's counts exceed or equalthe clipping point.

A faster route to find the most frequent word (assume words only A-Z in this example) is for each thread to take 1/N of the "book" (adjust character zoneto next word break). Then each thread creates a tally count list for each of the possible 1st letters (26 countersin this case), initialize the counts to 0 and then each thread tallys the counts of words in their zone beginning with each letter. When all N tally lists are complete, merge the list to find first letter of the most frequent word (each thread can duplicate the merge as opposed to stop/start).

Then begin a second parallelscan of list with each thread allocating an array of size_t or char* elementsof a locator listwho's each extent is the count value for their respective index tally of the letter of the most frequent word. Then rezero the A:Z tally counters (or recurse and create a new tally counter list) and rescan the threads respective portion of the book finding the locations of words beginning with the most frequent 1st letter found on the first pass. When word found, place index or pointer into the locator list while sumulteaneouslytallying the count of the 2nd letter in the word (assuming there is a 2nd letter). Once tallys of the 2nd lettter are complete, merge the tallys to find the most frequent 2nd letter (of the words of the most frequent 1st letter).

Then begin a 3rd pass, which clears the A:Z tally counters (or recursesand creates new tally counter list)and uses the locator list as a thread local reduction list (in and out) where by it filters and compresses the list to contain only the entries matching the most frequent2nd letter while simulteaneously tallying the counts for the 3rd letter (assuming there is a 3rd letter). (Note, the locator list needs to be allocated once as it only shrinks).

This process continues until there are no words with remaining letters in each list. Some particulars to be worked out on this such as if this is a recursive or loop process. If you recurse then you could propigate the tally count upwards and produce the frequency count in addition to the word with the highest frequency.

In this manner there is only one container used (that containing the string for the book), no additional containers are created for hash keys, no locks required (each thread can independently sum the tallys of the other threads), but you will need a barrier between each pass.

The whole book is scanned twice, 2nd letter of reduced word list examined twice more, 3rd letter of reduced-reduced word list examined twice more, etc...

Jim Dempsey

After thinking about this I will have to admit that I made an error in the second proposed solution as it would produce the frequency count of the aggregate of the letter sequence in the words and not the frequency count of the words themself. To correct for this an alternate approch to contimplate is a sparse tree of nodes. A derivation of the following might be best.

const int MAX_CHARS =27; //number of different characters in alphabet plus 1 for null.
const int MAX_CHARS_SQUARED =MAX_CHARS * MAX_CHARS;
struct node
{
size_t counts[MAX_CHARS_SQUARED];
struct node* links[MAX_CHARS_SQUARED];
node();
};
// The ctor for node would produce a wiped node.
node::node() { memset(this,0,sizeof(*this)); }

...
ReadBook(); // Book is null terminated string
node* ResultsTallys = NULL;
#pragma omp parallel
{
node*localResultsTallys = new node; // also wipes
size_t iFrom, iTo;
// get thread unique zone of book
if(GetFromTo(iFrom, iTo))
{
// scan for next word advancing iFrom,true if found
while(NextWord(iFrom, iTo)
Tally(localResultsTallys, iFrom); // tallys and advances iFrom
// iif we are first thread done, try to substitute our list for null results list
if(!ResultsTallys && CAS(&ResultsTallys, NULL, localResultsTallys))
{
// our localResultsTallys set into ResultsTallys
// nothing more to do
} else {
Merge(ResultsTallys,localResultsTallys);
delete localResultsTallys;
}
}
}
...

void Tally(node* aNode, size_t& iFrom)
{
// create an index fromone or two of the remaining letters in word
// advances iFrom in the process
size_t i = Index(iFrom));
// See if next letter is word break
// N.B. requires end of book to contain a word break character
if(isWordBreak(iFrom))
++aNode->count;
else
{
//iif next node not allocated, allocate
if(!aNode->links)
aNode->links = new node;
//recurse to complete tally
Tally(aNode->links, iFrom);
}
} // end of Tally
...
size_t Index(size_t& iFrom)
{
// compute index from 1 or 2 characters of word
// assuming there is a word
size_t i = tolower(Book[iFrom])- 'a' + 1;
if(i <= 0 || i >= MAX_CHARS) return 0;
size_tj = tolower(Book[++iFrom])- 'a' + 1;
if(j <= 0 ||j >= MAX_CHARS) return i;
returni*MAX_CHARS + j;
}
inline bool isWordBreak(size_t iFrom)
{
return(!isalpha(Book[iFrom]); // replace with appropriate test
}
bool GetFromTo(size_t& iFrom, size_t& iTo)
{
// initializeindexes
// return true if zone available,false if not
//returns iFrom = index of first word in zone
// iTo = index ofWordBreak following last word of zone
...
}

Jim Dempsey

jimdempseyatthecove · ‎10-19-2008

And the merge

void Merge(node& results, node& localResults)
{
for(size_t i = 0; i if(localResults.count)
InterlockedAdd(&results.count, localResults.count);

for(size_t i = 0; i {
if(localResults.link)
{
if(!results.links && CAS(&results.links, NULL, localResults.links))
{
// we were the firstthread to do this
// therefore our node is the results node
// nothing more to do
} else {
// recurse
Merge(results.links, localResults.links);
delete localResults.links;
}
}
}
}

RafSchietekat · ‎10-19-2008

"Usually it's few loads." I see how TLS can be made fairly efficient, but how about "unusually"...

"But if it will be 1 TLS access per big block of input data, I think, it doesn't matter." Well, the thing is that this is all about using a sufficiently small grain size to allow good load balancing, so the words "big block" are not very reassuring.

Dmitry_Vyukov · ‎10-23-2008

Quoting - Raf Schietekat

"Usually it's few loads." I see how TLS can be made fairly efficient, but how about "unusually"...

I just meant that data must be "private to thread". I used term TLS/TSS incorrectly.

It's always possible to pass all necessary data as parameters to functions.

Quoting - Raf Schietekat

"But if it will be 1 TLS access per big block of input data, I think, it doesn't matter." Well, the thing is that this is all about using a sufficiently small grain size to allow good load balancing, so the words "big block" are not very reassuring.

What you mean by "sufficiently small grain size"? Please, refine.

With current TBB scheduler implementation one have to make grain size AT LEAST 10,000 cycles. I think that 1 TLS access on USUAL platform (i.e. around few cycles) will not change anything.

I am not sure what you are talking about...

RafSchietekat · ‎10-23-2008

"It's always possible to pass all necessary data as parameters to functions." How would that help a body find a suitable map?

"With current TBB scheduler implementation one have to make grain size AT LEAST 10,000 cycles. I think that 1 TLS access on USUAL platform (i.e. around few cycles) will not change anything." Ah, that's what you mean with "big block", nothing to worry about then. But there's that "usual" again...

Actually, only the splitting constructor would need to access TLS/TSS, so performance is not a significant issue even if TLS/TSS is costly. On the other hand, exploring a loose end such as this might lead to the potentially useful conclusion that it would be nice to have a flag to parallel_reduce saying that the operation is commutative and that this should be used to avoid the creation of a new body, which would make the use of TLS/TSS unnecessary, resulting in less time wasted to write and maintain the code. Or not, as the case may be. Still, best to have a quick glance at the idea before tossing it.

Dmitry_Vyukov · ‎10-23-2008

Quoting - Raf Schietekat

"It's always possible to pass all necessary data as parameters to functions." How would that help a body find a suitable map?

Suitable map will be passed to body as parameter.

Quoting - Raf Schietekat

"With current TBB scheduler implementation one have to make grain size AT LEAST 10,000 cycles. I think that 1 TLS access on USUAL platform (i.e. around few cycles) will not change anything." Ah, that's what you mean with "big block", nothing to worry about then. But there's that "usual" again...

Granularity must be chosen accordingly to number of available processors. There is really no need to have 10^6 tasks on quad-core platform. So every task can take seconds and minutes. It's the case on Google clusters. I mention 10,000 cycles only as lower boundary.

Quoting - Raf Schietekat

Actually, only the splitting constructor would need to access TLS/TSS, so performance is not a significant issue even if TLS/TSS is costly.

Indeed. This is what I am talking about.

Hey! Stop! And what if access to TLS takes 1 year? :)

Quoting - Raf Schietekat

Actually, only the splitting constructor would need to access TLS/TSS, so performance is not a significant issue even if TLS/TSS is costly. On the other hand, exploring a loose end such as this might lead to the potentially useful conclusion that it would be nice to have a flag to parallel_reduce saying that the operation is commutative and that this should be used to avoid the creation of a new body, which would make the use of TLS/TSS unnecessary, resulting in less time wasted to write and maintain the code. Or not, as the case may be. Still, best to have a quick glance at the idea before tossing it.

This looks reasonably. I think in some situations this can reduce number of "bodies" several times.

RafSchietekat · ‎10-23-2008

"Suitable map will be passed to body as parameter." Well, no, because you don't know what its new thread is going to be, do you?

"Granularity must be chosen accordingly to number of available processors. There is really no need to have 10^6 tasks on quad-core platform. So every task can take seconds and minutes. It's the case on Google clusters. I mention 10,000 cycles only as lower boundary." In this case granularity affects the time wasted between the phases, and maybe it's an interactive situation.

Dmitry_Vyukov · ‎10-27-2008

Quoting - Raf Schietekat

"Suitable map will be passed to body as parameter." Well, no, because you don't know what its new thread is going to be, do you?

New thread will pass it's own hash map to body.

Quoting - Raf Schietekat

"Granularity must be chosen accordingly to number of available processors. There is really no need to have 10^6 tasks on quad-core platform. So every task can take seconds and minutes. It's the case on Google clusters. I mention 10,000 cycles only as lower boundary." In this case granularity affects the time wasted between the phases, and maybe it's an interactive situation.

I have to agree. Absolute size of a task matters. You are right.

RafSchietekat · ‎10-27-2008

"New thread will pass it's own hash map to body." How?

Dmitry_Vyukov · ‎10-27-2008

Quoting - Raf Schietekat

"New thread will pass it's own hash map to body." How?

Something like:

void thread_func() {

hash_map m;

while (task* t = get_next_task_to_execute()) {

t->execute(&m);

}
}

RafSchietekat · ‎10-27-2008

But we were talking about parallel_reduce here...

Dmitry_Vyukov · ‎10-27-2008

Quoting - Raf Schietekat

But we were talking about parallel_reduce here...

I am talking about general solution starting from post #25:

Efficient parallel algorithms expressed only with off-the-shelf components - is different problem. I don't think it's solvable in general case at all :)

General solution is simple - just store pointer to histogram in TLS/TSS.