- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
We are trying to improve performance of our climate modeling application on Xeon Phi (offload mode), but not able to achieve desired speed-up. So, we picked up a small section of code from the application we are working on, and created a standalone version out of it to check the performance for Xeon vs Xeon Phi as well as scaling efficiency on both the machines. The code is fully vectorized, and we are using O2 optimization level on the intel fortran compiler.
Ideally, xeon phi 1 thread should be only 4X slower than xeon 1 thread (no consecutive instructions for a single thread and half clock speed) and xeon phi 2 threads should be only 2X slower than xeon 1 thread. But for this standalone code, we are getting 7.3X slowdown over a single thread on xeon phi vs xeon and 3.7X slowdown for 2 xeon phi threads vs single xeon thread. Also, we see that for this particular code, scaling efficiency on the Xeon Phi drops significantly after about 60 threads, regardless of the input problem size. Could you please help us in finding out the reason for these two observations, namely:
1. Why is Xeon Phi single thread roughly 7-8 times slower than the Xeon single thread when we should expect roughly 4x slower?
2. Why are we seeing the scaling efficiency drop for the Xeon Phi after 60 threads, but scaling efficiency remains at 100% for Xeon from 1->16 threads?
Hardware used:
Xeon (Host) : E5-2650v2
Xeon Phi : 7120 P
Please find attached the following:
1. Standalone code sample which we have used in this experiment
2. Timings for various runs (variables are problem sizes and threads)
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Have you verified (examined the disassembly) to assure your statement function dbvt actually vectorized?
If there is an issue, consider transforming your statement function into an (inlinable) vector function (see !DIR$ ATTRIBUTES VECTOR) placed in one of your modules.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
There are lots of reasons why Xeon and Xeon Phi have different scaling characteristics -- too many to list briefly....
A few thoughts:
- It is not practical to try to figure out how to interpret the results without knowing how the threads were bound (on both host and Phi).
- The vector lengths are quite short for Xeon Phi. With 8 elements per vector instruction and a pipeline length of 6 cycles, it takes a vector length of 48 just to fill the pipeline with independent instructions. To approach asymptotic performance takes a lot longer. This may be provided by the multiple computations in the statement function or by the 10 invocations of that function, but you would need to look at the assembly code to see how well the compiler manages to interleave independent instructions.
- It is generally helpful to try to come up with a model of how long it "should" take to execute a piece of code so that you can compare to the observations quantitatively. This code has a boat-load (TM) of exponentiations and I don't have a good intuition about how those are likely to be implemented by the compiler -- or whether they are likely to be implemented differently on the host and on the Xeon Phi.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thanks Jim and John. We are looking into your suggestions.
We have tried to put !dir$ attributes vector to the statement function but the compiler ignores it. The affinity of the threads is COMPACT.
In the meantime, We've made some changes to the code to make it easier to understand. Now, the code has only two files, where emulate.f90 is calling a function in radae.f90. But before making the call to radae.f90, first the input data for radae.f90 is buffered in an array of structures (namely array). After buffering is complete, the function in radae.f90 (namely radabs) is called for all the chunks. Also, we have put all the test cases in emulate.f90 itself (i.e. timings for different number of threads on Xeon and Xeon Phi). The timing results are the same and I am attaching it again along with the changed code and the -opt-report for the modules.
The summary of the results remain the same, where Xeon Phi single thread is performing 7.3X slower than Xeon single thread. I am giving the timings for one of the problem sizes, (the ratios don't change with problem sizes).
We are looking into the assembly code, is there anything else that you want to suggest to us?
Thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thanks Jim and John. We are looking into your suggestions.
We have tried to put !dir$ attributes vector to the statement function but the compiler ignores it. The affinity of the threads is COMPACT.
In the meantime, We've made some changes to the code to make it easier to understand. Now, the code has only two files, where emulate.f90 is calling a function in radae.f90. But before making the call to radae.f90, first the input data for radae.f90 is buffered in an array of structures (namely array). After buffering is complete, the function in radae.f90 (namely radabs) is called for all the chunks. Also, we have put all the test cases in emulate.f90 itself (i.e. timings for different number of threads on Xeon and Xeon Phi). The timing results are the same and I am attaching it again along with the changed code and the -opt-report for the modules.
The summary of the results remain the same, where Xeon Phi single thread is performing 7.3X slower than Xeon single thread. I am giving the timings for one of the problem sizes, (the ratios don't change with problem sizes).
We are looking into the assembly code, is there anything else that you want to suggest to us?
Thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thanks Jim and John. We are looking into your suggestions.
We have tried to put !dir$ attributes vector to the statement function but the compiler ignores it. The affinity of the threads is COMPACT.
In the meantime, We've made some changes to the code to make it easier to understand. Now, the code has only two files, where emulate.f90 is calling a function in radae.f90. But before making the call to radae.f90, first the input data for radae.f90 is buffered in an array of structures (namely array). After buffering is complete, the function in radae.f90 (namely radabs) is called for all the chunks. Also, we have put all the test cases in emulate.f90 itself (i.e. timings for different number of threads on Xeon and Xeon Phi). The timing results are the same and I am attaching it again along with the changed code and the -opt-report for the modules.
The summary of the results remain the same, where Xeon Phi single thread is performing 7.3X slower than Xeon single thread. I am giving the timings for one of the problem sizes, (the ratios don't change with problem sizes).
We are looking into the assembly code, is there anything else that you want to suggest to us?
Thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi John,
Also, apart from scaling efficiency, we are more concerned about the single thread slowdown on Xeon Phi as compared to Xeon.
Thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Attribute VECTOR to be applied to an actual function, not to "statement function"
!------------------------------------------------------------------------------------- ! Planck fnctn tmp derivative for o3 ! !dir$ attributes offload : mic :: dbvt !DIR$ ATTRIBUTES VECTOR :: dbvt function dbvt(t) real(kind=8) :: dbvt real(kind=8), intent(in) :: t dbvt=10**(10**(10**(-2.8911366682e-4+log(2.3771251896e-6+1.1305188929e-10**t)**t)/ & (1.0+log(-6.1364820707e-3+1.5550319767e-5**t)**t))) !------------------------------------------------------------------------------------- end function dbvt
and remove statement function from radabs
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You might check whether you are seeing the expected vector speedup. If the time is spent executing the vectorized loop of length 26 as 3 vector iterations plus a remainder which may take longer than one of those vector iterations, you can see that John's point about falling short of full vector speedup is well taken.
Knc firmware acceleration of exponential applies, if at all, only to single precision. With normal compiler options, your single precision cconstants are rounded to single precision then widened to double at compile time.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I've always understood that scalar code running on a single thread of Xeon Phi is about 10x slower (ballpark) than scalar code running on a Xeon. Xeon Phi has a slower clock frequency, more primitive prefetching, in-order processing and a smaller pipeline, one instruction every 2 cycles, etc. Which is why doing a good job on all three optimization axes is key to best performance: vectorization, parallelization, and memory utilization.
As "the usual suspects" have suggested, short loops can do away with the benefits of vectorization, loops vectorized by scalar vector instructions don't run nearly as fast as those with packed vector instructions (both will be reported as "vectorized"), and I think we can only play precision games to speed things up with the single-precision math functions. Now might be a good time to bring out VTune as well, to understand where your execution time is going and to verify which exponentiation functions you are running.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Tim,
I would like to point out that the inner loop has number of iterations as 16, so it is a multiple of vector length (both on the Xeon and the Xeon Phi). This is the loop that is getting vectorized. The one with 26 iterations is the outer loop (which is not vectorized, since the inner loop is getting vectorized).
We are looking at the assembly code, and it seems that vector instructions are generated for pow and log functions (svml_pow and svml_log).
I would be grateful if you confirm our understanding of a couple of things:
1. Is our assumption that the theoretical upper limit to Xeon Phi single thread : Xeon single thread performance as 1:4 correct? (Please note that we are using a Xeon E5 2650-v2 and Xeon Phi 7110P in our experiments)
2. Related to question 1: Could you point us to a whitepaper / article that will help us programmers in understanding the key architectural differences between the specific classes of processors listed above?
3. Although we are looking into the assembly code, we aren't entirely sure of what we should be looking for? (I am referring to the second point that John makes about filling the Xeon Phi pipeline).
@Jim: We tried the !dir$ attributes vector on the "dbvt" function. There is no difference in performance, both on the Xeon and the Xeon Phi.
Thanks for your help.
Srinivasan Ramesh
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
There is no "theoretical upper limit" to the performance ratio between a Xeon and a Xeon Phi core unless you provide a very tight definition of exactly what is being executed. The processors have very different implementations and very different performance characteristics in many different axes of performance.
The Peak GFLOPS ratio is easy enough to compute, but this is of marginal utility unless you are primarily running large DGEMM-based codes using Intel's optimized (MKL) implementations. Xeon Phi has a peak FLOPS rate of 16 DP FLOPS per cycle per core, but a single thread is limited to 1/2 this performance. The Xeon E5 v2 has a peak FLOPS rate of 8 DP FLOPS per cycle per core, and a single thread is enough to approach this value. Assuming that a single thread is running at the maximum Turbo frequency on each system, the Peak GFLOPS ratio is therefore (2.6*8)/(1.33*8) = 1.95x in favor of the Xeon E5-2620 v2.
This analysis is not particularly useful because it does not take into account the wider instruction issue of the Xeon E5 v2, the shorter latencies of the vector instructions on the Xeon E5 v2, the ability to use unaligned memory references in vector instructions on the Xeon E5 v2, and the out-of-order execution capability of the Xeon E5 v2.
In the case of OpenMP codes, it is also important to note that the Xeon E5 v2 has much smaller overhead for parallel loop synchronization. I don't have numbers for these exact models, but on my systems the Xeon E5 OpenMP Parallel For overhead is about 8x lower than on the Xeon Phi -- roughly 3 usec using 16 cores (2 sockets) on a Xeon E5-2680 vs roughly 25 usec using 240 threads on a Xeon Phi SE10P. So it is a good idea to check the execution time per loop to see if it is big enough to make these overheads small.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Results Windows 7 Pro x64 Xeon E5-2620 V2 (6 core), Xeon Phi 5110P
Time taken is Phi 240 0.889602007111534 Time taken is Phi 240 0.144868887960911 Time taken is Phi 240 0.144316749181598 Time taken is Phi 240 0.144506318261847 Time taken is Phi 240 0.143957591149956 Time taken is Phi 240 0.145747046452016 Time taken is Phi 240 0.145542857469991 Time taken is Phi 240 0.144442966207862 Time taken is Phi 240 0.144829901866615 Time taken is Phi 240 0.148381035076454 Time taken is Xeon 12 0.383179801749066 Time taken is Xeon 12 0.194863015320152 Time taken is Xeon 12 0.194480953039601 Time taken is Xeon 12 0.199148059356958 Time taken is Xeon 12 0.193625698564574 Time taken is Xeon 12 0.192459044046700 Time taken is Xeon 12 0.193541391752660 Time taken is Xeon 12 0.192257291637361 Time taken is Xeon 12 0.202465276932344 Time taken is Xeon 12 0.193891777889803 Time taken is Phi 1 10.4746447526850 Time taken is Phi 1 10.4768747494090 Time taken is Phi 1 10.4756520523224 Time taken is Phi 1 10.4757992243394 Time taken is Phi 1 10.4762899600901 Time taken is Phi 1 10.4757714467123 Time taken is Phi 1 10.4762987317517 Time taken is Phi 1 10.4754127759952 Time taken is Phi 1 10.4757402581163 Time taken is Phi 1 10.4778225955088 Time taken is Phi 2 5.24285326502286 Time taken is Phi 2 5.24383814772591 Time taken is Phi 2 5.24338396149687 Time taken is Phi 2 5.24539563665166 Time taken is Phi 2 5.24507156619802 Time taken is Phi 2 5.24515246204101 Time taken is Phi 2 5.27666139882058 Time taken is Phi 2 5.24337616423145 Time taken is Phi 2 5.24409740441479 Time taken is Phi 2 5.24369536177255 Time taken is Phi 3 3.50098610110581 Time taken is Phi 3 3.50043006381020 Time taken is Phi 3 3.49954313342459 Time taken is Phi 3 3.49829022213817 Time taken is Phi 3 3.49863183661364 Time taken is Phi 3 3.49817716307007 Time taken is Phi 3 3.49935210216790 Time taken is Phi 3 3.49822394596413 Time taken is Phi 3 3.49944664328359 Time taken is Phi 3 3.49791595689021 Time taken is Phi 4 2.62827865802683 Time taken is Phi 4 2.62598628387786 Time taken is Phi 4 2.62757544871420 Time taken is Phi 4 2.62849161867052 Time taken is Phi 4 2.62607448967174 Time taken is Phi 4 2.62721434142441 Time taken is Phi 4 2.62601503590122 Time taken is Phi 4 2.62675674376078 Time taken is Phi 4 2.62652623909526 Time taken is Phi 4 2.62654719431885 Time taken is Phi 5 2.10510418750346 Time taken is Phi 5 2.10604569828138 Time taken is Phi 5 2.10648770164698 Time taken is Phi 5 2.10593848698772 Time taken is Phi 5 2.10716459527612 Time taken is Phi 5 2.10523186647333 Time taken is Phi 5 2.10511295939796 Time taken is Phi 5 2.10512952832505 Time taken is Phi 5 2.10524989757687 Time taken is Phi 5 2.10513342684135 Time taken is Phi 8 1.31911320588551 Time taken is Phi 8 1.31968971085735 Time taken is Phi 8 1.31825064169243 Time taken is Phi 8 1.31840853486210 Time taken is Phi 8 1.31906593544409 Time taken is Phi 8 1.31799382157624 Time taken is Phi 8 1.31815512594767 Time taken is Phi 8 1.32028132257983 Time taken is Phi 8 1.31947528803721 Time taken is Phi 8 1.31832325318828 Time taken is Phi 15 0.714768026256934 Time taken is Phi 15 0.712407913990319 Time taken is Phi 15 0.714274853933603 Time taken is Phi 15 0.714610133087263 Time taken is Phi 15 0.712441052077338 Time taken is Phi 15 0.713129641488194 Time taken is Phi 15 0.713629148900509 Time taken is Phi 15 0.714674460003152 Time taken is Phi 15 0.712888903217390 Time taken is Phi 15 0.714587716152892 Time taken is Phi 30 0.363400764297694 Time taken is Phi 30 0.363277471391484 Time taken is Phi 30 0.361719495151192 Time taken is Phi 30 0.362908566836268 Time taken is Phi 30 0.361732165561989 Time taken is Phi 30 0.361592790810391 Time taken is Phi 30 0.362836929969490 Time taken is Phi 30 0.362052824813873 Time taken is Phi 30 0.362059647450224 Time taken is Phi 30 0.362176605500281 Time taken is Phi 60 0.236995625309646 Time taken is Phi 60 0.237463456811383 Time taken is Phi 60 0.236693971324712 Time taken is Phi 60 0.236688123550266 Time taken is Phi 60 0.235453730914742 Time taken is Phi 60 0.236695433501154 Time taken is Phi 60 0.237739769741893 Time taken is Phi 60 0.237671056995168 Time taken is Phi 60 0.235785598633811 Time taken is Phi 60 0.236349920276552 Time taken is Phi 120 0.171246298123151 Time taken is Phi 120 0.170057713752612 Time taken is Phi 120 0.170133249135688 Time taken is Phi 120 0.168912501540035 Time taken is Phi 120 0.168748273048550 Time taken is Phi 120 0.169358403189108 Time taken is Phi 120 0.170121066039428 Time taken is Phi 120 0.170868621673435 Time taken is Phi 120 0.170095237903297 Time taken is Phi 120 0.168979752110317 Time taken is Phi 180 0.135696954326704 Time taken is Phi 180 0.134168217657134 Time taken is Phi 180 0.135523466859013 Time taken is Phi 180 0.135474247159436 Time taken is Phi 180 0.136024436447769 Time taken is Phi 180 0.135391889372841 Time taken is Phi 180 0.134440632071346 Time taken is Phi 180 0.134434784064069 Time taken is Phi 180 0.134247164009139 Time taken is Phi 180 0.135335847036913 Time taken is Xeon 1 1.67457740427926 Time taken is Xeon 1 1.62742973864079 Time taken is Xeon 1 1.64145006309263 Time taken is Xeon 1 1.64677018416114 Time taken is Xeon 1 1.67419729125686 Time taken is Xeon 1 1.65926128439605 Time taken is Xeon 1 1.60814436106011 Time taken is Xeon 1 1.66869101254269 Time taken is Xeon 1 1.64853722252883 Time taken is Xeon 1 1.63643598183990
Modified your program:
module mod_input use radae type input integer*8 :: lchnkbuf real*8 :: padd(7) real*8 :: tintbuf1(pcols,pverp) real*8 :: tlayrbuf1(pcols,pverp) real*8 :: tintbuf2(pcols,pverp) real*8 :: tlayrbuf2(pcols,pverp) real*8 :: tintbuf3(pcols,pverp) real*8 :: tlayrbuf3(pcols,pverp) real*8 :: tintbuf4(pcols,pverp) real*8 :: tlayrbuf4(pcols,pverp) real*8 :: tintbuf5(pcols,pverp) real*8 :: tlayrbuf5(pcols,pverp) end type input !dir$ attributes offload : mic :: array !dir$ attributes align : 64 :: array type(input) :: array(1:chunks) end module mod_input program mainfunction !use radlw use omp_lib use mod_input real(kind=8) :: tlayr1(pcols,pverp) real(kind=8) :: tint1(pcols,pverp) real(kind=8) :: tlayr2(pcols,pverp) real(kind=8) :: tint2(pcols,pverp) real(kind=8) :: tlayr3(pcols,pverp) real(kind=8) :: tint3(pcols,pverp) real(kind=8) :: tlayr4(pcols,pverp) real(kind=8) :: tint4(pcols,pverp) real(kind=8) :: tlayr5(pcols,pverp) real(kind=8) :: tint5(pcols,pverp) do lchnk = 1,chunks tint1 = 345.2345 tlayr1 = 354.1354 tint2 = 345.2345 tlayr2 = 354.1354 tint3 = 345.2345 tlayr3 = 354.1354 tint4 = 345.2345 tlayr4 = 354.1354 tint5 = 345.2345 tlayr5 = 354.1354 array(lchnk)%lchnkbuf = lchnk array(lchnk)%tlayrbuf1 = tlayr1 array(lchnk)%tintbuf1 = tint1 array(lchnk)%tlayrbuf2 = tlayr2 array(lchnk)%tintbuf2 = tint2 array(lchnk)%tlayrbuf3 = tlayr3 array(lchnk)%tintbuf3 = tint3 array(lchnk)%tlayrbuf4 = tlayr4 array(lchnk)%tintbuf4 = tint4 array(lchnk)%tlayrbuf5 = tlayr5 array(lchnk)%tintbuf5 = tint5 end do call TestMIC(240) call TestXeon(12) call TestMIC(1) call TestMIC(2) call TestMIC(3) call TestMIC(4) call TestMIC(5) call TestMIC(8) call TestMIC(15) call TestMIC(30) call TestMIC(60) call TestMIC(120) call TestMIC(180) call TestXeon(1) end program mainfunction subroutine TestMIC(nThreads) use omp_lib use mod_input implicit none integer :: nThreads integer :: ii real*8 :: start_time, end_time integer*8 :: lchnk do ii = 1,10 start_time = omp_get_wtime() !dir$ offload begin target(mic:0) inout(array) in(nThreads) !$omp parallel do num_threads(nThreads) do lchnk = 1,chunks call radabs( array(lchnk)%lchnkbuf, array(lchnk)%tlayrbuf1, array(lchnk)%tintbuf1, array(lchnk)%tlayrbuf2, array(lchnk)%tintbuf2, array(lchnk)%tlayrbuf3, array(lchnk)%tintbuf3, array(lchnk)%tlayrbuf4, array(lchnk)%tintbuf4, array(lchnk)%tlayrbuf5, array(lchnk)%tintbuf5 ) enddo !$omp end parallel do !dir$ end offload end_time = omp_get_wtime() print *,'Time taken is Phi',nThreads, (end_time - start_time) enddo end subroutine TestMIC subroutine TestXeon(nThreads) use omp_lib use mod_input implicit none integer :: nThreads integer :: ii real*8 :: start_time, end_time integer*8 :: lchnk do ii = 1,10 start_time = omp_get_wtime() !$omp parallel do num_threads(nThreads) do lchnk = 1,chunks call radabs( array(lchnk)%lchnkbuf, array(lchnk)%tlayrbuf1, array(lchnk)%tintbuf1, array(lchnk)%tlayrbuf2, array(lchnk)%tintbuf2, array(lchnk)%tlayrbuf3, array(lchnk)%tintbuf3, array(lchnk)%tlayrbuf4, array(lchnk)%tintbuf4, array(lchnk)%tlayrbuf5, array(lchnk)%tintbuf5 ) enddo end_time = omp_get_wtime() print *,'Time taken is Xeon',nThreads, (end_time - start_time) enddo end subroutine TestXeon
The 1 thread ratio on this system is 1 : 6.35
Running a performance comparison using 1 thread per core on Xeon Phi is a non-practical endeavor.
The results table above indicates use of 3 threads per core yields best performance for this application.
Note, running an offload, transferring data of the size of your array, which takes 0.135 seconds to complete, is likely impractical (though on my system the Xeon Phi was 44% faster than the host).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
The reason we are focusing on single thread performance comparison for both the machines is because if we are getting hit initially itself, then even if we are able to get good scalability, we will still be slower than the Xeon.
Also, if we see the scalability on Xeon Phi, we see that it drops off after 60 threads. As a result of this, even 240 threads of Xeon Phi are barely able to keep up with Xeon 16 threads. In our climate modeling application (part of which we have tried to simulate here), the scalability numbers are similar for the Xeon Phi 240 threads, but single thread performance in our climate modeling application is about 12-13x slower than the Xeon.
Thanks,
Amlesh
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
A couple of points.....
- You should optimize for the best performance on the target configuration, not for single thread performance.
- Some techniques that are useful for optimization using a single thread do not give the best performance for a multi-threaded implementation.
- On Xeon Phi we see significant numbers of jobs that get the best throughput using 1, 2, 3, or 4 threads per core, with no obvious way to determine the optimum thread count in advance.
- The variation in performance is the result of a complex interaction between instruction issue rates, latency tolerance, cache size per thread, the ratio of memory access streams to DRAM banks, etc., etc., etc....
- Jim Dempsey's results (above) show the best Xeon Phi performance at 3 threads per core. This is not unusual.
- On Xeon Phi we see a majority of cases that get the best performance using N-1 cores.
- Some applications get the best performance using significantly fewer than all of the cores.
- This is usually due to inadequate parallelism in at least part of the code.
- The original results show decreasing performance for more than 60 threads. If this was really run with COMPACT affinity, then you are only using 15 of the 60 cores. This suggests that the code does not have enough parallelism.
- Some applications get the best performance using significantly fewer than all of the cores.
- Single-thread instruction issue on Xeon Phi is restricted to 1/2 of the single core instruction issue rate.
- This makes the single thread performance ratio mostly useless as a comparison tool.
- Scaling studies are useful, but they need to be performed on each proposed implementation (to see if optimizations for a single thread degrade the multi-threaded performance), and they need to be performed using both COMPACT and SCATTER affinities (preferably with KMP_PLACE_THREADS used to keep the application from running on at least one physical core.
- Xeon Phi has an extremely low-overhead RDTSC instruction. I have found it useful to use this to measure the start & stop times of each OpenMP thread to use in evaluating thread synchronization overhead and load imbalance.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Scaling to 3 or 4 threads per core usually depends on achieving local cache sharing between those threads by use of KMP_PLACE_THREADS, as well as on sufficiently small cache footprint that increasing number of threads doesn't cause excessive capacity evictions.
I'm not sure if John has in mind aggressive unrolling as one of the measures for peak single thread performance which could degrade multi-thread performance.
I have examples, such as those which offer a choice between array reduction and parallel scalar reductions, where the choice which gives best single thread performance will peak at less than 20% of best multi-thread performance. My examples of outer loop vector parallel via omp parallel simd give up to 3 times the multi-thread performance of optimum single thread vectorization with loop nests switched, which isn't impressive in terms of efficiency per thread, but may be worth doing if otherwise cores would be left idle.
As John hinted, coprocessor core 0 may be busy enough with data transfers and MPSS system functions that it can't be used effectively for application threads (more so when profiling performance by VTune). The logical thread numbers for core 0 are 0 and the last 3 (possibly equivalent to [-3..0]).
If total thread stack usage among all threads approaches the hardware limit, optimum scaling may occur with hybrid MPI/OpenMP, for example 6 ranks of 30 threads each, with each rank pinned to and spread evenly across a separate group of 10 cores. If your single thread optimization involved increasing stack usage, that would degrade multi-thread performance.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page