- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I am experiencing a performance issue with TBB on Xeon Phi. On a server machine with two X5680, TBB runs faster than OpenMP from a group of benchmarks I have. This is also true with one of my other machine which has one i7-3820. However, the opposite happens on Xeon Phi.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The output log I put seems malformatted. Below should work:
[userid@machine test]$ make icpc -o bin.x64 main.cc work.cc -fopenmp -g -O3 -no-vec -tbb icpc -o bin.mic main.cc work.cc -fopenmp -g -O3 -DCOMPILE_FOR_XEON_PHI -no-vec -tbb [userid@machine test]$ make run ./bin.x64 tbb: 0.592992 ./bin.x64 omp: 0.937134 ./bin.mic tbb: 0.807429 ./bin.mic omp: 0.284160 [userid@machine test]$ which icc /var/intel/parallel_studio_xe_2013/composer_xe_2013_sp1.2.144/bin/intel64/icc
Thanks,
Hee-Seok
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The first thing I would look at is the difference between the total time and the minimum time required to execute the dummy workload.
The second thing I would look at is whether the compiler actually generated (or used) the same code for the dummy workload in all cases. (The dummy work is defined in a single function, but since the entire source is in one file it is possible that the compiler could apply different optimizations for different programming models -- and for different processor models.)
If it looks like the code being run is the same, but that there are significant differences in overhead, then I would want to compare the default scheduling policies for TBB vs OpenMP.
There are a number of differences in the performance balances of Xeon and Xeon Phi that could make different scheduling models have very different relative performance:
- Xeon Phi has a lot more threads.
- Xeon Phi has much higher cache-to-cache intervention latency.
- Xeon Phi does not have a shared L3 cache.
These differences will make dynamic thread allocation much more expensive on Xeon Phi than on Xeon. In OpenMP codes, for example, static scheduling gives the best results on all systems (assuming the work is naturally load-balanced), but the differences are typically small on Xeon systems. In contrast Xeon Phi pays big performance penalties for guided or dynamic scheduling, and the cost of changing the size of the thread pools is also large.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
There are a number of reasons.
1. You measure not what you think. The 'work' in DummyClass::run() can be optimized out by most compilers. To prevent it, make `j` or `m` volatile.
2. TBB threads creation on MIC can take up to few seconds. And TBB has no implicit barrier like OpenMP for the warm-up loop. It can finish the first parallel_for without all the threads created and ready. Thus, under-utilization is likely included into your results. Write the barrier explicitly (it's ok for benchmarking purpose only) using tbb::atomic<int> or increase the warm-up workload to let more time for thread creation.
3. TBB has more overheads for work distribution on MIC than Intel OpenMP and it may become visible for very small workloads. This presentation might give additional clue how to handle this situation if necessary. Shortly: pin worker threads and use tbb::affinity_partitioner.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
also you can increase grain size to improve cache locality for this particular example.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you so much.
I simplified the code too much so that the benchmark blurred some of my point to make. I actually deliver a function pointer to either runtime to execute. So the fact that OpenMP compiler may perform better optimization is not what I intended.
Nevertheless, there are still noticeable performance gap. I think Anton's presentation covers this very much. Thanks again sharing this.
Regards,
Hee-Seok

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page