Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.
2465 Discussions

Can I know on which core operator() is mapped during runtime in case of parallel_for

sunwei1688
Beginner
1,262 Views
Hello,

Can I know on which core is operator() mapped during runtime in case of parallel_for, by TBB method or others? I am curious to see TBB mapping result. At this momemt, my code gets nearly 2X gain with TBB on a dualcore processor but I have little idea what really happens except that TBB works. :)

It can also be potentially useful in case of code optimization. Thanks.

Sunwei
0 Kudos
10 Replies
TimP
Honored Contributor III
1,262 Views
No, in the normal case, on a single processor dual core, there is no reason to pin threads to cores. Whenever it happens that a thread must restart, the OS will prefer to assign it to the idle core. As there is a unified last level cache, there is no problem with data locality in the usual case. The threads will in fact swap cores occasionally. Only for certain detailed performance measurements does it become desirable to restrict this.
Concerns about optimum mapping of threads to cores will become relevant on most current machines with more than 2 cores. Intel OpenMP supports an environment variable KMP_AFFINITY which gives some control over this, without tieing your application to a specific core topology. I hope, but don't have sufficient knowledge, that such a facility is available for TBB.

0 Kudos
robert-reed
Valued Contributor II
1,262 Views
Actually, you might check out Kevin Farnham's blogfor a technique that won't give you the mapping per se, but could give you enough information to infer the mapping. The technique involves recording ranges, start and end times in the operator() and then sorting on times to see which ranges are executed concurrently.
0 Kudos
ARCH_R_Intel
Employee
1,262 Views

One crude trick I've used for finding the mapping from tasks to threads (not cores) is taking the address of a local variable. Since each thread has its own stack, this works for simple programs. E.g., the following code prints the address of a local variable x and a range.

#include 
#include "tbb/task_scheduler_init.h"
#include "tbb/blocked_range.h"
#include "tbb/parallel_for.h"
#include "tbb/task.h"
#include
using namespace tbb;
struct Body {
void operator()( const blocked_range& r ) const {
int x;
printf("&x=%p [%d,%d) ", &x, r.begin(), r.end() );
usleep(1000);
}
};
int main() {
task_scheduler_init init;
parallel_for( blocked_range(0,100,10), Body() );
};

I consider this only a demo for learning purposes. It won't work for programs with more complexnesting or recursion.The technique in Kevin Farnham's blogpresents more useful information; specifically the execution begin/end times.

TBB pushes the paradigm that the programmer should concern themselves with breaking a program up into tasks, and let the TBB scheduler to the mapping of those tasks onto threads/cores. For serious study of a large program's concurrent behavior, a tool like Intel's Thread Profiler is the way to go.

0 Kudos
robert-reed
Valued Contributor II
1,262 Views
Yes, and my latest blog entry shows just such an example of using IntelThread Profiler to investigate the threading behavior of a TBB application.
0 Kudos
robert-reed
Valued Contributor II
1,262 Views
Oops, what I meant to say was that my latest blog entry shows such an example.
0 Kudos
bjoernknafla
Beginner
1,262 Views
A facility to identify the thread you are running on would be very helpful for example when using thread specific memory to collect results from the tasks scheduled to the thread.

Example: while running parallel game AI you want to collect graphic commands to visualize the internal state of the simulated characters to understand why they are doing what they are doing. You could allocate a memory buffer for such graphic commands for each thread in use.

On a NUMA system these buffers could be created from each thread during a startup phase and therefore associating the memory with the thread (and hopefully the OS will try to schedule the thread to the same core most of the time).

Later on in a sequential part of the application these thread specific buffers could be read and send to non-thread-safe OpenGL.

OpenMP associates a thread number with each thread, though with nested parallelism the thread numbers aren't really usable anymore (as far as I know).

Summary: it would be a great and important enhancement to TBB to 1) be able to identify the current thread and 2) to be able to set a thread-core affinity.

Cheers,
Bjoern
0 Kudos
bjoernknafla
Beginner
1,262 Views
Two more considerations:
- the thread id and thread affinity might not belong to the current level of TBB but to a lower-level thread abstraction that might belong to TBB.
- thread ids would be helpful to build a general purpose logging system (like the one described above just for graphics commands).

Cheers,
Bjoern
0 Kudos
sunwei1688
Beginner
1,262 Views
Hello tim18, rreed, adrobiso, rreed, Bjoern,

Thanks for your reply. I checked out Kevin's blog and that helps. I did some experiements to get more flavor about the algorithm inside. To my surprise, it appears that the mapping isn't exact to that I imaged by intuition. For instance, if set to 65, the grain size is modified to 64. If the grain size is set to384 and iteration range is 512, it's modified to 256. But I think it's smart to do that anyway.

Robert's blog is also very interesting to show the advantage of Thread Profiler in complex design. I should check out that later. Thanks for your information.


0 Kudos
ARCH_R_Intel
Employee
1,262 Views

The original prototype of TBB had a function that let users get the thread id. The OpenMP experts here arguedstronglyto remove the function, on the basis that it encouragedregrettable programming practices (e.g. programming in terms of threads instead of in terms of tasks). We took their advice and removed the thread id function.

We are working on an affinity mechanism based on U. Acar, G. Blelloch, and R. Blumofe. "The Data Locality of Work Stealing", in Proceedings of the Twelfth Annual ACM Symposium on Parallel Algorithms and Architectures (Bar Harbor, Maine, United States, July 09 - 13, 2000). SPAA '00, pp. 1-12. See section 5.3 of http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2006/n2104.pdffor what the interface might look like.

0 Kudos
sunwei1688
Beginner
1,262 Views
Hello adrobiso,

I can understand your worry. On the other hand, as a convention, it's probably helpful to have an easy-to-use interface for beginner as well as some instruments for advanced users.

Sunwei
0 Kudos
Reply