Performance of TBB on Xeon Phi

Hee-Seok_K_ · ‎09-26-2014

Hi,

I am experiencing a performance issue with TBB on Xeon Phi. On a server machine with two X5680, TBB runs faster than OpenMP from a group of benchmarks I have. This is also true with one of my other machine which has one i7-3820. However, the opposite happens on Xeon Phi.

Attached is a simple test program which shows how I use both of them. It compares the performance of the two by printing out execution time of a dummy workload. Simply typing make and make run would show you the result. The following is an example output:

[xxx@yyy test]$ make

icpc -o bin.x64 main.cc work.cc -fopenmp -g -O3 -no-vec -tbb

icpc -o bin.mic main.cc work.cc -fopenmp -g -O3 -DCOMPILE_FOR_XEON_PHI -no-vec -tbb

[xxx@yyy test]$ make run

./bin.x64 tbb: 0.592992

./bin.x64 omp: 0.937134

./bin.mic tbb: 0.807429

./bin.mic omp: 0.284160

[xxx@yyy test]$ which icc

/var/intel/parallel_studio_xe_2013/composer_xe_2013_sp1.2.144/bin/intel64/icc

As you can see, TBB is ~2x faster on the Xeon server but ~3x slower on Xeon Phi. What would be a proper explanation of this symptom? Also, do you see any problem or have comments on the usage? (One thing I tried is adjusting the granularity of tbb::blocked_range but no good) Ultimately, is TBB not good for Xeon Phi for performance?

Please advise. Thanks in advance.

Regards,

Hee-Seok

Hee-Seok_K_ · ‎09-26-2014

The output log I put seems malformatted. Below should work:

[userid@machine test]$ make
icpc -o bin.x64 main.cc work.cc -fopenmp -g -O3 -no-vec -tbb
icpc -o bin.mic main.cc work.cc -fopenmp -g -O3 -DCOMPILE_FOR_XEON_PHI -no-vec -tbb
[userid@machine test]$ make run
./bin.x64 tbb: 0.592992
./bin.x64 omp: 0.937134
./bin.mic tbb: 0.807429
./bin.mic omp: 0.284160
[userid@machine test]$ which icc
/var/intel/parallel_studio_xe_2013/composer_xe_2013_sp1.2.144/bin/intel64/icc

Thanks,

Hee-Seok

McCalpinJohn · ‎09-26-2014

The first thing I would look at is the difference between the total time and the minimum time required to execute the dummy workload.

The second thing I would look at is whether the compiler actually generated (or used) the same code for the dummy workload in all cases. (The dummy work is defined in a single function, but since the entire source is in one file it is possible that the compiler could apply different optimizations for different programming models -- and for different processor models.)

If it looks like the code being run is the same, but that there are significant differences in overhead, then I would want to compare the default scheduling policies for TBB vs OpenMP.

There are a number of differences in the performance balances of Xeon and Xeon Phi that could make different scheduling models have very different relative performance:

Xeon Phi has a lot more threads.
Xeon Phi has much higher cache-to-cache intervention latency.
Xeon Phi does not have a shared L3 cache.

These differences will make dynamic thread allocation much more expensive on Xeon Phi than on Xeon. In OpenMP codes, for example, static scheduling gives the best results on all systems (assuming the work is naturally load-balanced), but the differences are typically small on Xeon systems. In contrast Xeon Phi pays big performance penalties for guided or dynamic scheduling, and the cost of changing the size of the thread pools is also large.

Anton_M_Intel · ‎09-26-2014

Hi,

There are a number of reasons.

1. You measure not what you think. The 'work' in DummyClass::run() can be optimized out by most compilers. To prevent it, make `j` or `m` volatile.

2. TBB threads creation on MIC can take up to few seconds. And TBB has no implicit barrier like OpenMP for the warm-up loop. It can finish the first parallel_for without all the threads created and ready. Thus, under-utilization is likely included into your results. Write the barrier explicitly (it's ok for benchmarking purpose only) using tbb::atomic<int> or increase the warm-up workload to let more time for thread creation.

3. TBB has more overheads for work distribution on MIC than Intel OpenMP and it may become visible for very small workloads. This presentation might give additional clue how to handle this situation if necessary. Shortly: pin worker threads and use tbb::affinity_partitioner.

Vladimir_P_1234567890 · ‎09-26-2014

also you can increase grain size to improve cache locality for this particular example.

Hee-Seok_K_ · ‎09-29-2014

Thank you so much.

I simplified the code too much so that the benchmark blurred some of my point to make. I actually deliver a function pointer to either runtime to execute. So the fact that OpenMP compiler may perform better optimization is not what I intended.

Nevertheless, there are still noticeable performance gap. I think Anton's presentation covers this very much. Thanks again sharing this.

Regards,

Hee-Seok