Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.
2464 Discussions

abstracted parallalel sort, when to use task_scheduler_init

joeansys
Beginner
306 Views
Hi.
I'm new to this forum and TBB. I have a dll that gets loaded by my main application. I then have a customized container class that is used througout this dll. It currently uses std::sort to sort data in the container. Since there are many sorts in this dll some of which can be sorting over 10,000 entries, I was wondering how efficient it would be to change the sort blindly to parallel_sort as parallel_sort calls std::sort for < 500 entries.
My next question involves using task_scheduler_init. If I only use task_scheduler init once either in my main or when my dll gets loaded, my program crashes inside parallel_sort. If add an extra task_scheduler_init right before the sort, my program runs very, very slowly. I saw another post that said that having 1 task_scheuler_init in main and many elsewhere would not be problematic b/c of reference counting.

Does anyone have any ideas? p.s. I would hope to massivley parallelize my code with paralle_for and parallel_... as well, so any tips in advance wr.t. task_scheuduler_init would be greatly appreciated.

Thanks
Joe
0 Kudos
5 Replies
Alexey-Kukanov
Employee
306 Views
Quoting - joeansys
Hi.
I'm new to this forum and TBB. I have a dll that gets loaded by my main application. I then have a customized container class that is used througout this dll. It currently uses std::sort to sort data in the container. Since there are many sorts in this dll some of which can be sorting over 10,000 entries, I was wondering how efficient it would be to change the sort blindly to parallel_sort as parallel_sort calls std::sort for < 500 entries.
My next question involves using task_scheduler_init. If I only use task_scheduler init once either in my main or when my dll gets loaded, my program crashes inside parallel_sort. If add an extra task_scheduler_init right before the sort, my program runs very, very slowly. I saw another post that said that having 1 task_scheuler_init in main and many elsewhere would not be problematic b/c of reference counting.

Does anyone have any ideas? p.s. I would hope to massivley parallelize my code with paralle_for and parallel_... as well, so any tips in advance wr.t. task_scheuduler_init would be greatly appreciated.

Thanks
Joe

With parallel_sort, you don't lose much for <500 elements, and you could potentially gain performance for 10000 entries. Why not give it a try? :)

Rethe TBB initialization, for the moment you have to create a task_scheduler_init object in every thread that uses TBB tasks or algorithms. When TBB parallel algorithms are used inside DLLs, our recommendation is to have one "global" task_scheduler_init object (it could be either in the main module, or in the DLL covering most of its lifetime) and then create additional local object in functions where TBB is used. The global object will create TBB worker threads and keep them alive at least for the lifetime of the DLL, and local objects are necessary to ensure every thread that calls the DLL have initialized TBB scheduler supporting structures. And these extra initialization should not cost much; but your post suggests that in your experience it might be very slow. Unless the global task_scheduler_init object is somehow absent, this problem is very strange. Can you share the relevant part of the code, or might be a reproducing test?
0 Kudos
RafSchietekat
Valued Contributor III
306 Views
"I was wondering how efficient it would be to change the sort blindly to parallel_sort as parallel_sort calls std::sort for < 500 entries"
If you were supposed to be able to do it blindly, they wouldn't call it software engineering, I suppose, but in this case it seems like a fairly safe bet to me. If you have any doubts, how about posting your own benchmark results about what the tipping point should be? Maybe it differs for different kinds of objects, and then we would all have learned something...

Regarding task_scheduler_init: do you have an instance that gets allocated at dll load time and deallocated at dll unload time (good), or only a temporary object at dll load time (bad), or the equivalent in main()? Despite its name, this is not a function that is called once (or once per thread): you should have the various instances work together to keep the common scheduler alive for sufficiently long periods (although it typically falls on one instance to make sure of that), with at least one instance per participating thread. Sorry for asking whether you plugged in the device, but your findings are rather suspicious. :-)
0 Kudos
joeansys
Beginner
306 Views
Thanks all. It seems that the overhead of creating the threads is more than the data that I'm sorting. I will try to unit test this code at some point to verify this. Thanks again for the help

Joe
0 Kudos
Alexey-Kukanov
Employee
306 Views
Quoting - joeansys
Thanks all. It seems that the overhead of creating the threads is more than the data that I'm sorting. I will try to unit test this code at some point to verify this. Thanks again for the help

Joe

If you have a global task_scheduler_init object alive all the time your DLL is loaded, as suggested above by me and Raf, then threads will only be created once. The overhead of creating additional task_scheduler_init objects is rather low. Still it might happen that you have too few data to justify total overhead induced by parallelization.
0 Kudos
RafSchietekat
Valued Contributor III
306 Views
"Thanks all. It seems that the overhead of creating the threads is more than the data that I'm sorting. I will try to unit test this code at some point to verify this. Thanks again for the help"
You have not provided us with any information to justify this conclusion. Do you do only few or even just one sort operation(s) per process invocation, perhaps, with mainly small data sets? Then it would probably help to use TBB only conditionally, i.e., externalise the size test that parallel_sort now does internally, and only create task_scheduler_init when the outcome of the test warrants it, as long as it is still kept around until DLL unload time if there is a chance it might be needed again (and you have not confirmed yet that you have such a long-lived task_scheduler_init object). But otherwise I don't see any reason not to use TBB, because sometimes you do have those larger data sets to sort.
0 Kudos
Reply