Hi Tim,

Amlesh_K_ · ‎12-05-2015

Hi,

We are trying to improve performance of our climate modeling application on Xeon Phi (offload mode), but not able to achieve desired speed-up. So, we picked up a small section of code from the application we are working on, and created a standalone version out of it to check the performance for Xeon vs Xeon Phi as well as scaling efficiency on both the machines. The code is fully vectorized, and we are using O2 optimization level on the intel fortran compiler.

Ideally, xeon phi 1 thread should be only 4X slower than xeon 1 thread (no consecutive instructions for a single thread and half clock speed) and xeon phi 2 threads should be only 2X slower than xeon 1 thread. But for this standalone code, we are getting 7.3X slowdown over a single thread on xeon phi vs xeon and 3.7X slowdown for 2 xeon phi threads vs single xeon thread. Also, we see that for this particular code, scaling efficiency on the Xeon Phi drops significantly after about 60 threads, regardless of the input problem size. Could you please help us in finding out the reason for these two observations, namely:

1. Why is Xeon Phi single thread roughly 7-8 times slower than the Xeon single thread when we should expect roughly 4x slower?

2. Why are we seeing the scaling efficiency drop for the Xeon Phi after 60 threads, but scaling efficiency remains at 100% for Xeon from 1->16 threads?

Hardware used:

Xeon (Host) : E5-2650v2

Xeon Phi : 7120 P

Please find attached the following:

1. Standalone code sample which we have used in this experiment

2. Timings for various runs (variables are problem sizes and threads)

jimdempseyatthecove · ‎12-06-2015

Have you verified (examined the disassembly) to assure your statement function dbvt actually vectorized?

If there is an issue, consider transforming your statement function into an (inlinable) vector function (see !DIR$ ATTRIBUTES VECTOR) placed in one of your modules.

Jim Dempsey

McCalpinJohn · ‎12-07-2015

There are lots of reasons why Xeon and Xeon Phi have different scaling characteristics -- too many to list briefly....

A few thoughts:

It is not practical to try to figure out how to interpret the results without knowing how the threads were bound (on both host and Phi).
The vector lengths are quite short for Xeon Phi. With 8 elements per vector instruction and a pipeline length of 6 cycles, it takes a vector length of 48 just to fill the pipeline with independent instructions. To approach asymptotic performance takes a lot longer. This may be provided by the multiple computations in the statement function or by the 10 invocations of that function, but you would need to look at the assembly code to see how well the compiler manages to interleave independent instructions.
It is generally helpful to try to come up with a model of how long it "should" take to execute a piece of code so that you can compare to the observations quantitatively. This code has a boat-load (TM) of exponentiations and I don't have a good intuition about how those are likely to be implemented by the compiler -- or whether they are likely to be implemented differently on the host and on the Xeon Phi.

Amlesh_K_ · ‎12-08-2015

Hi,

Thanks Jim and John. We are looking into your suggestions.

We have tried to put !dir$ attributes vector to the statement function but the compiler ignores it. The affinity of the threads is COMPACT.

In the meantime, We've made some changes to the code to make it easier to understand. Now, the code has only two files, where emulate.f90 is calling a function in radae.f90. But before making the call to radae.f90, first the input data for radae.f90 is buffered in an array of structures (namely array). After buffering is complete, the function in radae.f90 (namely radabs) is called for all the chunks. Also, we have put all the test cases in emulate.f90 itself (i.e. timings for different number of threads on Xeon and Xeon Phi). The timing results are the same and I am attaching it again along with the changed code and the -opt-report for the modules.

The summary of the results remain the same, where Xeon Phi single thread is performing 7.3X slower than Xeon single thread. I am giving the timings for one of the problem sizes, (the ratios don't change with problem sizes).

We are looking into the assembly code, is there anything else that you want to suggest to us?

Thanks

Amlesh_K_ · ‎12-08-2015

Hi,

Thanks Jim and John. We are looking into your suggestions.

We have tried to put !dir$ attributes vector to the statement function but the compiler ignores it. The affinity of the threads is COMPACT.

In the meantime, We've made some changes to the code to make it easier to understand. Now, the code has only two files, where emulate.f90 is calling a function in radae.f90. But before making the call to radae.f90, first the input data for radae.f90 is buffered in an array of structures (namely array). After buffering is complete, the function in radae.f90 (namely radabs) is called for all the chunks. Also, we have put all the test cases in emulate.f90 itself (i.e. timings for different number of threads on Xeon and Xeon Phi). The timing results are the same and I am attaching it again along with the changed code and the -opt-report for the modules.

The summary of the results remain the same, where Xeon Phi single thread is performing 7.3X slower than Xeon single thread. I am giving the timings for one of the problem sizes, (the ratios don't change with problem sizes).

We are looking into the assembly code, is there anything else that you want to suggest to us?

Thanks

Amlesh_K_ · ‎12-08-2015

Hi,

Thanks Jim and John. We are looking into your suggestions.

We have tried to put !dir$ attributes vector to the statement function but the compiler ignores it. The affinity of the threads is COMPACT.

In the meantime, We've made some changes to the code to make it easier to understand. Now, the code has only two files, where emulate.f90 is calling a function in radae.f90. But before making the call to radae.f90, first the input data for radae.f90 is buffered in an array of structures (namely array). After buffering is complete, the function in radae.f90 (namely radabs) is called for all the chunks. Also, we have put all the test cases in emulate.f90 itself (i.e. timings for different number of threads on Xeon and Xeon Phi). The timing results are the same and I am attaching it again along with the changed code and the -opt-report for the modules.

The summary of the results remain the same, where Xeon Phi single thread is performing 7.3X slower than Xeon single thread. I am giving the timings for one of the problem sizes, (the ratios don't change with problem sizes).

We are looking into the assembly code, is there anything else that you want to suggest to us?

Thanks

Amlesh_K_ · ‎12-08-2015

Hi John,

Also, apart from scaling efficiency, we are more concerned about the single thread slowdown on Xeon Phi as compared to Xeon.

Thanks.

jimdempseyatthecove · ‎12-08-2015

Attribute VECTOR to be applied to an actual function, not to "statement function"

!-------------------------------------------------------------------------------------
! Planck fnctn tmp derivative for o3
!
!dir$ attributes offload : mic :: dbvt
!DIR$ ATTRIBUTES VECTOR :: dbvt
function dbvt(t)
    real(kind=8) :: dbvt
    real(kind=8), intent(in) :: t
    dbvt=10**(10**(10**(-2.8911366682e-4+log(2.3771251896e-6+1.1305188929e-10**t)**t)/ &
    (1.0+log(-6.1364820707e-3+1.5550319767e-5**t)**t)))
!-------------------------------------------------------------------------------------
end function dbvt

and remove statement function from radabs

Jim Dempsey

TimP · ‎12-08-2015

You might check whether you are seeing the expected vector speedup. If the time is spent executing the vectorized loop of length 26 as 3 vector iterations plus a remainder which may take longer than one of those vector iterations, you can see that John's point about falling short of full vector speedup is well taken.

Knc firmware acceleration of exponential applies, if at all, only to single precision. With normal compiler options, your single precision cconstants are rounded to single precision then widened to double at compile time.

Charles_C_Intel1 · ‎12-08-2015

I've always understood that scalar code running on a single thread of Xeon Phi is about 10x slower (ballpark) than scalar code running on a Xeon. Xeon Phi has a slower clock frequency, more primitive prefetching, in-order processing and a smaller pipeline, one instruction every 2 cycles, etc. Which is why doing a good job on all three optimization axes is key to best performance: vectorization, parallelization, and memory utilization.

As "the usual suspects" have suggested, short loops can do away with the benefits of vectorization, loops vectorized by scalar vector instructions don't run nearly as fast as those with packed vector instructions (both will be reported as "vectorized"), and I think we can only play precision games to speed things up with the single-precision math functions. Now might be a good time to bring out VTune as well, to understand where your execution time is going and to verify which exponentiation functions you are running.

Srinivasan_R_1 · ‎12-09-2015

Hi Tim,

I would like to point out that the inner loop has number of iterations as 16, so it is a multiple of vector length (both on the Xeon and the Xeon Phi). This is the loop that is getting vectorized. The one with 26 iterations is the outer loop (which is not vectorized, since the inner loop is getting vectorized).

We are looking at the assembly code, and it seems that vector instructions are generated for pow and log functions (svml_pow and svml_log).

I would be grateful if you confirm our understanding of a couple of things:

1. Is our assumption that the theoretical upper limit to Xeon Phi single thread : Xeon single thread performance as 1:4 correct? (Please note that we are using a Xeon E5 2650-v2 and Xeon Phi 7110P in our experiments)

2. Related to question 1: Could you point us to a whitepaper / article that will help us programmers in understanding the key architectural differences between the specific classes of processors listed above?

3. Although we are looking into the assembly code, we aren't entirely sure of what we should be looking for? (I am referring to the second point that John makes about filling the Xeon Phi pipeline).

@Jim: We tried the !dir$ attributes vector on the "dbvt" function. There is no difference in performance, both on the Xeon and the Xeon Phi.

Thanks for your help.

Srinivasan Ramesh

McCalpinJohn · ‎12-09-2015

There is no "theoretical upper limit" to the performance ratio between a Xeon and a Xeon Phi core unless you provide a very tight definition of exactly what is being executed. The processors have very different implementations and very different performance characteristics in many different axes of performance.

The Peak GFLOPS ratio is easy enough to compute, but this is of marginal utility unless you are primarily running large DGEMM-based codes using Intel's optimized (MKL) implementations. Xeon Phi has a peak FLOPS rate of 16 DP FLOPS per cycle per core, but a single thread is limited to 1/2 this performance. The Xeon E5 v2 has a peak FLOPS rate of 8 DP FLOPS per cycle per core, and a single thread is enough to approach this value. Assuming that a single thread is running at the maximum Turbo frequency on each system, the Peak GFLOPS ratio is therefore (2.6*8)/(1.33*8) = 1.95x in favor of the Xeon E5-2620 v2.

This analysis is not particularly useful because it does not take into account the wider instruction issue of the Xeon E5 v2, the shorter latencies of the vector instructions on the Xeon E5 v2, the ability to use unaligned memory references in vector instructions on the Xeon E5 v2, and the out-of-order execution capability of the Xeon E5 v2.

In the case of OpenMP codes, it is also important to note that the Xeon E5 v2 has much smaller overhead for parallel loop synchronization. I don't have numbers for these exact models, but on my systems the Xeon E5 OpenMP Parallel For overhead is about 8x lower than on the Xeon Phi -- roughly 3 usec using 16 cores (2 sockets) on a Xeon E5-2680 vs roughly 25 usec using 240 threads on a Xeon Phi SE10P. So it is a good idea to check the execution time per loop to see if it is big enough to make these overheads small.

jimdempseyatthecove · ‎12-09-2015

Results Windows 7 Pro x64 Xeon E5-2620 V2 (6 core), Xeon Phi 5110P

 Time taken is Phi         240  0.889602007111534
 Time taken is Phi         240  0.144868887960911
 Time taken is Phi         240  0.144316749181598
 Time taken is Phi         240  0.144506318261847
 Time taken is Phi         240  0.143957591149956
 Time taken is Phi         240  0.145747046452016
 Time taken is Phi         240  0.145542857469991
 Time taken is Phi         240  0.144442966207862
 Time taken is Phi         240  0.144829901866615
 Time taken is Phi         240  0.148381035076454
 Time taken is Xeon          12  0.383179801749066
 Time taken is Xeon          12  0.194863015320152
 Time taken is Xeon          12  0.194480953039601
 Time taken is Xeon          12  0.199148059356958
 Time taken is Xeon          12  0.193625698564574
 Time taken is Xeon          12  0.192459044046700
 Time taken is Xeon          12  0.193541391752660
 Time taken is Xeon          12  0.192257291637361
 Time taken is Xeon          12  0.202465276932344
 Time taken is Xeon          12  0.193891777889803
 Time taken is Phi           1   10.4746447526850
 Time taken is Phi           1   10.4768747494090
 Time taken is Phi           1   10.4756520523224
 Time taken is Phi           1   10.4757992243394
 Time taken is Phi           1   10.4762899600901
 Time taken is Phi           1   10.4757714467123
 Time taken is Phi           1   10.4762987317517
 Time taken is Phi           1   10.4754127759952
 Time taken is Phi           1   10.4757402581163
 Time taken is Phi           1   10.4778225955088
 Time taken is Phi           2   5.24285326502286
 Time taken is Phi           2   5.24383814772591
 Time taken is Phi           2   5.24338396149687
 Time taken is Phi           2   5.24539563665166
 Time taken is Phi           2   5.24507156619802
 Time taken is Phi           2   5.24515246204101
 Time taken is Phi           2   5.27666139882058
 Time taken is Phi           2   5.24337616423145
 Time taken is Phi           2   5.24409740441479
 Time taken is Phi           2   5.24369536177255
 Time taken is Phi           3   3.50098610110581
 Time taken is Phi           3   3.50043006381020
 Time taken is Phi           3   3.49954313342459
 Time taken is Phi           3   3.49829022213817
 Time taken is Phi           3   3.49863183661364
 Time taken is Phi           3   3.49817716307007
 Time taken is Phi           3   3.49935210216790
 Time taken is Phi           3   3.49822394596413
 Time taken is Phi           3   3.49944664328359
 Time taken is Phi           3   3.49791595689021
 Time taken is Phi           4   2.62827865802683
 Time taken is Phi           4   2.62598628387786
 Time taken is Phi           4   2.62757544871420
 Time taken is Phi           4   2.62849161867052
 Time taken is Phi           4   2.62607448967174
 Time taken is Phi           4   2.62721434142441
 Time taken is Phi           4   2.62601503590122
 Time taken is Phi           4   2.62675674376078
 Time taken is Phi           4   2.62652623909526
 Time taken is Phi           4   2.62654719431885
 Time taken is Phi           5   2.10510418750346
 Time taken is Phi           5   2.10604569828138
 Time taken is Phi           5   2.10648770164698
 Time taken is Phi           5   2.10593848698772
 Time taken is Phi           5   2.10716459527612
 Time taken is Phi           5   2.10523186647333
 Time taken is Phi           5   2.10511295939796
 Time taken is Phi           5   2.10512952832505
 Time taken is Phi           5   2.10524989757687
 Time taken is Phi           5   2.10513342684135
 Time taken is Phi           8   1.31911320588551
 Time taken is Phi           8   1.31968971085735
 Time taken is Phi           8   1.31825064169243
 Time taken is Phi           8   1.31840853486210
 Time taken is Phi           8   1.31906593544409
 Time taken is Phi           8   1.31799382157624
 Time taken is Phi           8   1.31815512594767
 Time taken is Phi           8   1.32028132257983
 Time taken is Phi           8   1.31947528803721
 Time taken is Phi           8   1.31832325318828
 Time taken is Phi          15  0.714768026256934
 Time taken is Phi          15  0.712407913990319
 Time taken is Phi          15  0.714274853933603
 Time taken is Phi          15  0.714610133087263
 Time taken is Phi          15  0.712441052077338
 Time taken is Phi          15  0.713129641488194
 Time taken is Phi          15  0.713629148900509
 Time taken is Phi          15  0.714674460003152
 Time taken is Phi          15  0.712888903217390
 Time taken is Phi          15  0.714587716152892
 Time taken is Phi          30  0.363400764297694
 Time taken is Phi          30  0.363277471391484
 Time taken is Phi          30  0.361719495151192
 Time taken is Phi          30  0.362908566836268
 Time taken is Phi          30  0.361732165561989
 Time taken is Phi          30  0.361592790810391
 Time taken is Phi          30  0.362836929969490
 Time taken is Phi          30  0.362052824813873
 Time taken is Phi          30  0.362059647450224
 Time taken is Phi          30  0.362176605500281
 Time taken is Phi          60  0.236995625309646
 Time taken is Phi          60  0.237463456811383
 Time taken is Phi          60  0.236693971324712
 Time taken is Phi          60  0.236688123550266
 Time taken is Phi          60  0.235453730914742
 Time taken is Phi          60  0.236695433501154
 Time taken is Phi          60  0.237739769741893
 Time taken is Phi          60  0.237671056995168
 Time taken is Phi          60  0.235785598633811
 Time taken is Phi          60  0.236349920276552
 Time taken is Phi         120  0.171246298123151
 Time taken is Phi         120  0.170057713752612
 Time taken is Phi         120  0.170133249135688
 Time taken is Phi         120  0.168912501540035
 Time taken is Phi         120  0.168748273048550
 Time taken is Phi         120  0.169358403189108
 Time taken is Phi         120  0.170121066039428
 Time taken is Phi         120  0.170868621673435
 Time taken is Phi         120  0.170095237903297
 Time taken is Phi         120  0.168979752110317
 Time taken is Phi         180  0.135696954326704
 Time taken is Phi         180  0.134168217657134
 Time taken is Phi         180  0.135523466859013
 Time taken is Phi         180  0.135474247159436
 Time taken is Phi         180  0.136024436447769
 Time taken is Phi         180  0.135391889372841
 Time taken is Phi         180  0.134440632071346
 Time taken is Phi         180  0.134434784064069
 Time taken is Phi         180  0.134247164009139
 Time taken is Phi         180  0.135335847036913
 Time taken is Xeon           1   1.67457740427926
 Time taken is Xeon           1   1.62742973864079
 Time taken is Xeon           1   1.64145006309263
 Time taken is Xeon           1   1.64677018416114
 Time taken is Xeon           1   1.67419729125686
 Time taken is Xeon           1   1.65926128439605
 Time taken is Xeon           1   1.60814436106011
 Time taken is Xeon           1   1.66869101254269
 Time taken is Xeon           1   1.64853722252883
 Time taken is Xeon           1   1.63643598183990

Modified your program:

module mod_input
    use radae
    type input
     integer*8 :: lchnkbuf
     real*8 :: padd(7)
     real*8 :: tintbuf1(pcols,pverp)
     real*8 :: tlayrbuf1(pcols,pverp)
     real*8 :: tintbuf2(pcols,pverp)
     real*8 :: tlayrbuf2(pcols,pverp)
     real*8 :: tintbuf3(pcols,pverp)
     real*8 :: tlayrbuf3(pcols,pverp)
     real*8 :: tintbuf4(pcols,pverp)
     real*8 :: tlayrbuf4(pcols,pverp)
     real*8 :: tintbuf5(pcols,pverp)
     real*8 :: tlayrbuf5(pcols,pverp)
    end type input

    !dir$ attributes offload : mic :: array
    !dir$ attributes align : 64 :: array
    type(input) :: array(1:chunks)
end module mod_input
    
program mainfunction
!use radlw
use omp_lib
use mod_input


real(kind=8) :: tlayr1(pcols,pverp)
real(kind=8) :: tint1(pcols,pverp)
real(kind=8) :: tlayr2(pcols,pverp)
real(kind=8) :: tint2(pcols,pverp)
real(kind=8) :: tlayr3(pcols,pverp)
real(kind=8) :: tint3(pcols,pverp)
real(kind=8) :: tlayr4(pcols,pverp)
real(kind=8) :: tint4(pcols,pverp)
real(kind=8) :: tlayr5(pcols,pverp)
real(kind=8) :: tint5(pcols,pverp)



do lchnk = 1,chunks

tint1 = 345.2345
tlayr1 = 354.1354
tint2 = 345.2345
tlayr2 = 354.1354
tint3 = 345.2345
tlayr3 = 354.1354
tint4 = 345.2345
tlayr4 = 354.1354
tint5 = 345.2345
tlayr5 = 354.1354

array(lchnk)%lchnkbuf = lchnk
array(lchnk)%tlayrbuf1 = tlayr1
array(lchnk)%tintbuf1 = tint1
array(lchnk)%tlayrbuf2 = tlayr2
array(lchnk)%tintbuf2 = tint2
array(lchnk)%tlayrbuf3 = tlayr3
array(lchnk)%tintbuf3 = tint3
array(lchnk)%tlayrbuf4 = tlayr4
array(lchnk)%tintbuf4 = tint4
array(lchnk)%tlayrbuf5 = tlayr5
array(lchnk)%tintbuf5 = tint5

end do

    call TestMIC(240)
    
    call TestXeon(12)

    call TestMIC(1)

    call TestMIC(2)

    call TestMIC(3)

    call TestMIC(4)

    call TestMIC(5)

    call TestMIC(8)

    call TestMIC(15)

    call TestMIC(30)

    call TestMIC(60)

    call TestMIC(120)
     
    call TestMIC(180)

    call TestXeon(1)


end program mainfunction

subroutine TestMIC(nThreads)
    use omp_lib
    use mod_input
    implicit none
    integer :: nThreads
    
    integer :: ii
    real*8 :: start_time, end_time
    integer*8 :: lchnk
    do ii = 1,10

    start_time = omp_get_wtime()

    !dir$ offload begin target(mic:0) inout(array) in(nThreads)
    !$omp parallel do num_threads(nThreads)

    do lchnk = 1,chunks

    call radabs( array(lchnk)%lchnkbuf,  array(lchnk)%tlayrbuf1,    array(lchnk)%tintbuf1, array(lchnk)%tlayrbuf2, array(lchnk)%tintbuf2, array(lchnk)%tlayrbuf3,    array(lchnk)%tintbuf3, array(lchnk)%tlayrbuf4, array(lchnk)%tintbuf4, array(lchnk)%tlayrbuf5,    array(lchnk)%tintbuf5 )

    enddo
    !$omp end parallel do
    !dir$ end offload 

    end_time = omp_get_wtime()

    print *,'Time taken is Phi',nThreads, (end_time - start_time)

    enddo
end subroutine TestMIC
    

subroutine TestXeon(nThreads)
    use omp_lib
    use mod_input
    implicit none
    integer :: nThreads
    
    integer :: ii
    real*8 :: start_time, end_time
    integer*8 :: lchnk
    do ii = 1,10

    start_time = omp_get_wtime()

    !$omp parallel do num_threads(nThreads)

    do lchnk = 1,chunks

    call radabs( array(lchnk)%lchnkbuf,  array(lchnk)%tlayrbuf1,    array(lchnk)%tintbuf1, array(lchnk)%tlayrbuf2, array(lchnk)%tintbuf2, array(lchnk)%tlayrbuf3,    array(lchnk)%tintbuf3, array(lchnk)%tlayrbuf4, array(lchnk)%tintbuf4, array(lchnk)%tlayrbuf5,    array(lchnk)%tintbuf5 )

    enddo

    end_time = omp_get_wtime()

    print *,'Time taken is Xeon',nThreads, (end_time - start_time)

    enddo
end subroutine TestXeon

The 1 thread ratio on this system is 1 : 6.35

Running a performance comparison using 1 thread per core on Xeon Phi is a non-practical endeavor.

The results table above indicates use of 3 threads per core yields best performance for this application.

Note, running an offload, transferring data of the size of your array, which takes 0.135 seconds to complete, is likely impractical (though on my system the Xeon Phi was 44% faster than the host).

Jim Dempsey

Amlesh_K_ · ‎12-11-2015

Hi,

The reason we are focusing on single thread performance comparison for both the machines is because if we are getting hit initially itself, then even if we are able to get good scalability, we will still be slower than the Xeon.

Also, if we see the scalability on Xeon Phi, we see that it drops off after 60 threads. As a result of this, even 240 threads of Xeon Phi are barely able to keep up with Xeon 16 threads. In our climate modeling application (part of which we have tried to simulate here), the scalability numbers are similar for the Xeon Phi 240 threads, but single thread performance in our climate modeling application is about 12-13x slower than the Xeon.

Thanks,

Amlesh

McCalpinJohn · ‎12-11-2015

A couple of points.....

You should optimize for the best performance on the target configuration, not for single thread performance.
Some techniques that are useful for optimization using a single thread do not give the best performance for a multi-threaded implementation.
- On Xeon Phi we see significant numbers of jobs that get the best throughput using 1, 2, 3, or 4 threads per core, with no obvious way to determine the optimum thread count in advance.
- The variation in performance is the result of a complex interaction between instruction issue rates, latency tolerance, cache size per thread, the ratio of memory access streams to DRAM banks, etc., etc., etc....
  - Jim Dempsey's results (above) show the best Xeon Phi performance at 3 threads per core. This is not unusual.
- On Xeon Phi we see a majority of cases that get the best performance using N-1 cores.
  - Some applications get the best performance using significantly fewer than all of the cores.
    - This is usually due to inadequate parallelism in at least part of the code.
    - The original results show decreasing performance for more than 60 threads. If this was really run with COMPACT affinity, then you are only using 15 of the 60 cores. This suggests that the code does not have enough parallelism.
Single-thread instruction issue on Xeon Phi is restricted to 1/2 of the single core instruction issue rate.
- This makes the single thread performance ratio mostly useless as a comparison tool.
Scaling studies are useful, but they need to be performed on each proposed implementation (to see if optimizations for a single thread degrade the multi-threaded performance), and they need to be performed using both COMPACT and SCATTER affinities (preferably with KMP_PLACE_THREADS used to keep the application from running on at least one physical core.
- Xeon Phi has an extremely low-overhead RDTSC instruction. I have found it useful to use this to measure the start & stop times of each OpenMP thread to use in evaluating thread synchronization overhead and load imbalance.

TimP · ‎12-11-2015

Scaling to 3 or 4 threads per core usually depends on achieving local cache sharing between those threads by use of KMP_PLACE_THREADS, as well as on sufficiently small cache footprint that increasing number of threads doesn't cause excessive capacity evictions.

I'm not sure if John has in mind aggressive unrolling as one of the measures for peak single thread performance which could degrade multi-thread performance.

I have examples, such as those which offer a choice between array reduction and parallel scalar reductions, where the choice which gives best single thread performance will peak at less than 20% of best multi-thread performance. My examples of outer loop vector parallel via omp parallel simd give up to 3 times the multi-thread performance of optimum single thread vectorization with loop nests switched, which isn't impressive in terms of efficiency per thread, but may be worth doing if otherwise cores would be left idle.

As John hinted, coprocessor core 0 may be busy enough with data transfers and MPSS system functions that it can't be used effectively for application threads (more so when profiling performance by VTune). The logical thread numbers for core 0 are 0 and the last 3 (possibly equivalent to [-3..0]).

If total thread stack usage among all threads approaches the hardware limit, optimum scaling may occur with hybrid MPI/OpenMP, for example 6 ranks of 30 threads each, with each rank pinned to and spread evenly across a separate group of 10 cores. If your single thread optimization involved increasing stack usage, that would degrade multi-thread performance.

Xeon Phi single thread performance and Scaling.