parallel_find() - Page 2

Walter_D_ · ‎04-18-2013

TBB does not provide a parallel_find(), which would act similar to std::find(): find the first element in a range meeting a given condition/predicate. I think those who know the tbb library well should be able to implement this as follows: a range is split into a first and second half with the first half searched at higher priority (put on the stack last). If the first child task succeeds (finds a match), the second task is cancelled. This is done recursively, of course.

Could you consider implementing this? I'm not sure how to implement the cancellation of the second child task if the first succeeds.

jimdempseyatthecove · ‎04-20-2013

>>Yeah, I was thinking of "first" more like "give me any as fast as possible and if you have currently more than one hit give me the one most to the left".

Then in my posted algorithm you would abort the search when index found != "not found" (0xFFFFFFFF or 0x7FFFFFFF)
With additional CAS should you observe an "index found" while you have yourself a found and your found is .lt. current index found.

In this situation, best case is one CAS (or 0 CAS in event of no match), worst case is thread pool size number of CAS's

----------------------------

Variation on binary split

The ..._helper function:
1) split iteration space into non-splitable size + remainder
2) searches non-splitible size, if found follow CAS rules stated earlier, return (done)
3) (must handle remainder) if no remainder, return
4) if remainder .le. non-splitable size, search remainder, if found follow CAS rules stated earlier, return (done)
5) parallel_invoke(first half of remainder, second half of remainder)

What the above does is defer additional task enqueues for the higher portion of each branch. This will permit each thread in thread pool to drill down to lowest partition size without enqueuing additional tasks.

Jim Dempsey

Walter_D_ · ‎04-21-2013

amazing what you all work out on the weekend ...

okay. So, I now regret to have quoted Anthony Williams, because, obviously, his approach to this particular problem is just stupid, as it searches many elements potentially unnecessarily: testing elements to the right of the one we want. A clever algorithm must minimise unnecessary tests.

If there was no parallel overhead and computers were perfectly symmetric machines, then the best stragegy with p synchronuous parallel threads is simple: the ith thread tests elements i+k*p k=0,1,... So, the p first elements are tested synchronuously first (for k=0), then the next p (k=1) etc. Thus, the number of elements tested unnecessarily it never larger than p-1 and p/2 on average.

In practice, there is parallel overhead and you presumably want to avoid reading the same cache line at different threads, i.e. each "element" in my above algorithm should be a super element no smaller than a chache line. It also seems that this algorithm is better implemented directly on the thread level rather than using tasks. An important issue is that the performance depends critically on the ability of the threads to act near-synchronously. If one thread is much slower than the others and the element to be found is in its part of elements, performance is reduced a lot. In such a case, one may consider using an atomic index to the next (super-)element to be searched.

Perhaps I have time later to implement a simple algorithm directly in C++11 threads with these ideas.

jimdempseyatthecove · ‎04-22-2013

>>Perhaps I have time later to implement a simple algorithm directly in C++11 threads with these ideas.

You can do much of what you want in tbb. This may only require the addition of a few features. I suggest looking at the problem and determining the probability of number of nodes matching the search chriteria. Divide the search set size by the (probability of number of matches) * (thread pool size) and use this as a grain size to parallel_for. This might be a good starting point. See my next post relating to hypotheticals.

Jim Dempsey

jimdempseyatthecove · ‎04-22-2013

This is in response to private message with Raf. We though it be pertinent to add to this forum thread. Raf’s questions begin with >>

>> I'll leave you the opportunity to correct yourself about that comparison (or not).

Raf is referring to an error in my pseudo code where the if test for early termination was backwards. Thank you Raf.

>> I must say I'm "surprised" about your suggestion to use tasks as if they were threads. How could that possibly be beneficial for any purpose?

Threads perform tasks. Due to lack of knowledge of number of threads immediately available one tends to find it beneficial to partition the workspace up into more tasks than threads.

Example:

Should partitioning == nThreads with say 8 hw threads this yields 8 tasks (and assume equal work per task/partition). With all 8 hw threads available the work gets done in 1/8 the time (plus 8x thread scheduling time). However, should one of the hw threads not be available, the work gets done in 1/8 + 1/8 of the time (2x that of expected).

Alternately should the partitioning be in finer partitions, say 2 per thread (16 tasks), when all 8 threads are available 1/16(8) + 1/16(8) + 16x thread scheduling time, or same time as 8 task +8x more thread scheduling time.

Now look at 7 hw threads with 16 tasks: 1/16(7) + 1/16(7) + 1/16(2) + 16x thread scheduling time.

8 threads, 8 partitions = 1/8 run time (.125rt) + 8x thread scheduling
7 threads, 8 partitions = 1/4 run time (.25rt) + 8x thread scheduling (~2x ideal)
8 threads, 16 partitions = 1/8 run time (.125rt) + 16x thread scheduling (~== ideal)
7 threads, 16 partitions = 3/16 run time (.1875) + 16x thread scheduling (~1.5x ideal)
If on average 1 thread is not available then larger number of tasks yields ~33% faster code.

The smaller partitioning (more tasks than potential number of threads) yields faster code when not all threads are available, but at the expense of higher thread scheduling overhead. At some point you reach a diminishing return on creating additional tasks.

What is the optimal number of partitions (tasks)?

This will vary under the dynamics of the system and the ratio of thread scheduling overhead verses iteration through partition. Let’s consider the circumstance where whatever the number of threads are immediately available at the start of the parallel_foobar will be available for the duration of the activity and that no additional threads will become available during that run time. Also, let’s consider 0 overhead for task enqueue/dequeue. The optimal number of partitions (tasks) is the number that is evenly divisible by each of the potential numbers of threads.

8 hw threads, nThreads {1,2,3,4,5,6,7,8}, nTasks 8*7*6*5 = 1680 (note, 4 fits into 8, 3 fits into 6, 2 fits into 8)
16 hw threads, nThreads {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16}, nTasks = 16*15*14*13*12*11*10*9 = 518,918,400

Task scheduling overhead is not insignificant (computationally nor memory requirements). Using the above requirements, partitioning to optimal number of tasks even under the assumption of 0 overhead for task enqueue/deque is untenable.

Some foreknowledge of iteration overhead verses task scheduling overhead is required to keep task scheduling overhead portion of compute time to an acceptable level.

What’s acceptable is subjective?

Assume a high level of code is required for match test results in 1,000 iterations = task scheduling overhead, then 100,000 iterations per task yields 1% task overhead, 10,000 iterations yields 10% overhead. On an 8 hw thread system, at the 1%, the vector size would have to be .gt. 1,680 x 100,000.

Now then ask yourself: Is the default partitioning of say parallel_for knowledgeable of state of the application at the time of the partitioning? The answer is no, so they use a rule of thumb for partitioning. As to what this is, I cannot say. The programmer can tune the partitioning to minimize both functional runtime and thread scheduling overhead.

>> Any time I see a grainsize that depends on problem size, I see that as a red flag.

tbb::parallel_for uses grain size whether you specify it or not. In many cases one finds it beneficial to specify a grain size: programmer’s knowledge about how many threads will be available at this point in the program (e.g. is this parallel_for run at top level where all threads are available or some nested level where few threads are available), as well as overhead of task scheduling verses iteration.

>> But it seems more important that the algorithm shouldn't start searching toward the right before the left half has been exhausted,

With random distribution of nodes to find, and with unique entries (one match in vector), it does not matter which side you start processing first. The better strategy is to have each thread take an equi-partition and search left to right (iow nTasks == nThreads).

With an average of 2 possible matches, then it may be beneficial for all treads to examine the left half first, then all threads to examine the right half next. IOW with average of n possible matches, divide the range into n primary partitions, and process each of n primary partitions in sequence using all preferred number of threads (task set size) (see next question).

>> And maybe something to automatically avoid using too many CPUs in the presence of memory saturation (how would one go about that?).

This requires the foreknowledge of the ratio of L1 cache hits (instruction and data) per memory (or L3) hits. When the search is light weight (int key == value) or ((double key >= MinVal) && (double key <= MaxVal)), then depending on memory system 2-4 threads per memory channel might be optimal. For higher weight tests, possibly more threads per memory channel due to more code and match criteria fitting in L1. The default settings for grain size won’t know this, the programmer may know this. Heuristics might help to automate this assuming overhead for heuristics is insignificant.

>> Things certainly depend on numerical details not provided, but I'm assuming that they are such that parallel execution could be beneficial, and I wouldn't go as far as wondering about element removal.

Agreed. Knowledge of the application requirements is paramount to writing efficient code. Selecting a generic parallel_find may be counter productive.

Jim Dempsey

RafSchietekat · ‎04-22-2013

It seems to have been the royal "we" who decided it was "pertinent" to bring this back online...

There is overhead to task scheduling, but I'll go with the analysis in the User Guide, section "Parallelizing Simple Loops/parallel_for/Controlling Chunking", that shows a graph indicating that execution time significantly decreases with increasing grainsize and then stays flat "because the parallel overhead becomes insignificant for a sufficiently large grainsize", and a "rule of thumb is that grainsize iterations of operator() should take at least 100,000 clock cycles to execute". I also don't want to run into the right end of the graph because of undersubscription either right away or at the end (because of insufficient parallel slack). The takeaway here is that chunks should not become too big. It is true that auto_partitioner starts with chunks about 1/16th of the fair share of the work, i.e., they do depend on the problem size, but that is not an absolute, because stolen work will be subdivided further (even smaller than 1/16th). A specified grainsize based on the total amount of work cannot do that, so it certainly should not approach a fair share and thereby make tasks behave like threads. (Note that it can still be useful to use a tuned but fixed grainsize together with auto_partitioner.)

The assumption here is that finding the first is not the same as finding any (despite what one Anthony Williams may or may not have written), i.e., there are more than one qualifying elements, so searching should indeed be asymmetrical. If you have an average of 2, finding 2 does not give you the knowledge that you have found all of them and that therefore the leftmost of those 2 is the first in the collection (there might be another one further toward the beginning), and you will still have to exhaustively search from the beginning to the presumptive first hit before you can stop and go home. That goes for any number n. And your argument also carries in itself the reason that most of the work will have been wasted, even before my argument that an exhaustive search is still required: if all n partitions have on average 1 hit, on average n-1 partition searches will have been wasted, and these are exactly the ones not starting from the beginning.

Avoiding memory bandwidth saturation can obviously be done by tuning, but it does seem to depend on the system in a way that TBB or we won't be able to abstract away just yet.

(Corrected 2013-05-02) I was aiming for "imperial we" when mixing it up with another word (that I'm not repeating now), but that isn't it, either, according to Wikipedia at least. Serves me right for trying to translate to English when "pluralis maiestatis" will do just fine... :-)

jimdempseyatthecove · ‎04-22-2013

>>if all n partitions have on average 1 hit, on average n-1 partition searches will have been wasted

Raf, you need to re-read my lengthy post.

When you have ~n matches per criteria, and when these nodes are randomly placed in the vector, then you make a primary partitioning of the vector into n partitions. Each of these partitions on average will have 1 node contained at a random position within the partition. You do NOT assign a task to each of these partitions. Rather you sub-partition each partition into tasks, # tasks == # available threads (an unknown in TBB, therefore you choose thread pool size). The first major partition is searched by all threads, most of the time a hit is made in the first primary partition. Search time ~(size() / averageNumberOfNodes / threadPoolSize / 2) iterations. Note, this further requires 1 CAS. In the next smaller fraction of circumstances, the first major partition will be void of matching criteria, and 2 nodes present in the second partition matching criteria. This yields a search time of:

~(size() / averageNumberOfNodesMatchingCriteria / threadPoolSize) iterations for first primary partition
+~(size() / averageNumberOfNodesMatchingCriteria / threadPoolSize / 4) iterations for second primary partition
+ 1 CAS most of the time, 2 CAS a smaller fraction of time.

Jim Demspey

RafSchietekat · ‎04-22-2013

Ah, missed the redesign...

Well, in general you most likely won't know what to expect, and even if you do it would still be suboptimal to skip ahead by O(input size).