Keeping data in cache

Bruce_Weaver · ‎11-23-2012

HI,

I have a program that takes a week or so to run on an i7 or E5-2690 (single threaded) and I'm looking forward to moving it to a MIC/PHI board (multi-threaded). About 15% of the time is spent on evaluating a single polynomial that uses four 122-element parameter statements as sources for the coefficients in the equation. I would like to try to lock those real coefficients into cache. Is there a way to do that from Fortran? Is there a way to see if they are already being kept in cache (there is a fair amount of work between calls to that function so I doubt it)?

Thanks for any ideas here.

TimP · ‎11-24-2012

PARAMETER constants are an unlikely subject for cache miss activity, but you can check with profiling tools such as Intel VTune. Code cache misses would be equally likely. Intel(c) Xeon Phi(tm) increases the importance of vectorization as well as threading for full performance, so the list of questions which are usually ignored about performance of polynomial evaluation is lengthened.

Bruce_Weaver · ‎11-24-2012

I used VTune to identify this equation as the hottest hotspot in a couple thousand lines of code. I suppose I could, if I knew how, lock the equation in cache as well. They would certainly fit in the Phi or E5-2690 L2 caches and it would be reasonable to try it in the L1 caches. I understand the importance of the vectorization, which is a difficult in this code, and we are addressing that as well. The question remains: in there a way I can lock the data (and code) in cache from a Fortran program? H=a*H4(i) H=a*(H3(i)+H) H=a*(H2(i)+H) H=a*(H1(i)+H) H=H0(i)+H or any other way to speed this up? thanks

mecej4 · ‎11-24-2012

The statements shown access four different arrays. Thus, the memory accesses have rather large stride. Have you considered declaring Hx(5,122), and writing, for example,

H=(((Hx(5,i)*a+Hx(4,i))*a+Hx(3,i))*a+Hx(2,i))*a+Hx(1,i)?

Bruce_Weaver · ‎11-24-2012

Hi mecej4: Good idea, we'll test it. I had discarded that idea on the basis that double indices were slow, but had not considered the stride issue. Still it would be much faster is it didn't have to go out to memory for the array. Anyone? Locking in cache?

Bruce_Weaver · ‎11-24-2012

mecej4: I should have mentioned that I appreciate the Boltzmann gravestone as the Saha-Boltzmann equation is a central part of what I'm doing. TimP: could you elaborate on "PARAMETER constants are an unlikely subject for cache miss activity"?

TimP · ‎11-24-2012

As your code is limited by the latencies of multiplication and addition, and you don't appear to be making an effort to vectorize, there would be plenty of time to resolve any cache misses on the coefficient array. Even if there are no cache misses, you are spending 52 cycles (on corei7-2) on multiply and add instructions per result. You could achieve more instruction level parallelism by H = H0(i) + a*H1(i) + a*a*(H2(i) + a*H3(i) + a*a*H4(i)) but you would still need to get both vector and threaded parallelism to approach available CPU performance, where now you are well below 1% of peak floating point performance. If you were limited by cache misses, you would expect VTune to show most of the time spent on the first multiply and none on the adds, accompanied by a high level of last level cache misses. If you vectorize, you have effectively locked each coefficient in register at least across the parallel operands. So, if vectorization has no other appeal to you, you can at least consider it as improving data locality as you suggested.

John_Campbell · ‎11-24-2012

If "i" is varying more than "a", you could create an a power vector of the form "forall (i=0:5) aa(i) = a**i" or : aa(0) = 1 ; do i = 1,4 ; aa(i) = a*aa(i-1) ; end do then, using Mecej4, approach H=(((Hx(5,i)*a+Hx(4,i))*a+Hx(3,i))*a+Hx(2,i))*a+Hx(1,i)? becomes H = dot_product (hx(:,i), aa) This might be more suited to vectorisation ie, simplify what is not in the inner loop

Bruce_Weaver · ‎11-24-2012

TimP: Thanks for the elaboration. I had originally written it simply: H= H0(i)+ a* H1(i)+ a*a* H2(i)+ a*a*a* H3(i)+ a*a*a*a* H4(i), but found, with a non-Intel compiler on an earlier chip, that the multi-line version seemed to be more efficient. From the old days, that form would help the compiler optimize register use. One of the major reasons we recently switched to the Intel compiler was Vtune. Actually, the first line: H=a*H4(i), does take longer than any of the rest but I was/am puzzled by the fact that it was not hugely more time consuming. I assumed this meant that all the coefficients were being pulled from memory. It seems that one lesson here is to reconsider some of the optimizations when changing compilers and CPUs. Threaded parallelism is probably not much of an option as we are multi-threading at the program level (the outputs of each program image can be added). Also, only one of these is calculated at a time (calculate H, do a lot of calculations to determine a new a, calculate a new H) so I think that the only target for vectorization of this polynomial is one instance of this equation. I have not figured out how to show L2 or L1 data cache misses yet. I'll start breaking the code apart to tease out some better diagnostics.

Bruce_Weaver · ‎11-24-2012

Hi John, This function is not in an inner loop and i and a are unpredictable. The routine is called by a parent routine that does much calculation to come up with an a and a (precursor to) i. I'll be trying all the suggested forms as small improvements in this evaluation will have a noticeable effect on my run times. thanks

John_Campbell · ‎11-24-2012

Bruce, The combination of the DO in a, then Dot_Product might improve the cacheing. It can be difficult to predict Also, based on what I have understood from your posts, it appears to me that the searching for the updated values of a and i might be the more time consuming part. A smarter way of finding a better estimate of "a" can be an elusive goal. Would using a history of "a" help, say using the last 2 estimates of "a", as a 2-point interpolation or 3-point extrapolation, although this would require storing 2 "a" fields?

Bruce_Weaver · ‎11-25-2012

Hi John, H is actually the calculation of the Hjerting function, used to calculate the Voigt function. The values that go into the calculation are usually independent of their history. About 15% of the time they are not uncorrelated, which turns out to be about the cost in saving the resulting H and testing from the previous call to the function H. Not saving the previous value, which is a push, lets me make the H function elemental. It has to be calculated each time.

TimP · ‎11-25-2012

"I had originally written it simply: H= H0(i)+ a* H1(i)+ a*a* H2(i)+ a*a*a* H3(i)+ a*a*a*a* H4(i)," In principle, various re-arrangements between this and the Horner form (which minimizes the number of multiplications) are possible within the limits of the Fortran (but not C) rules. C rules would permit this to be optimized down to 7 multiplications. I haven't seen a Fortran compiler which could optimize polynomial evaluation beyond what C rules permit. Where the series has terms of magnitude decreasing with order, the Horner evaluation should be the most accurate. If the evaluation of H is at all time consuming, your VTune profile could be showing a delay in availability of H for the first usage. That would further swamp the possibility of re-fetching the coefficients as a bottleneck. It would also indicate that vectorization would not help unless this H evaluation can be vectorized.

jimdempseyatthecove · ‎11-25-2012

Bruce, I haven't tested the code (I do not have all the particulars) an outline of my suggestion is observed the the following code [fortran] subroutine compute(n,a,H,H0,H1,H2,H3,H4) integer :: n real(8) :: a real(8) :: H(n),H0(n),H1(n),H2(n),H3(n),H4(n) real(8) :: Awide(4),Hwide(4) integer :: i Awide = a Awide(2:4) = Awide(2:4) * a Awide(3:4) = Awide(3:4) * a Awide(4) = Awide(4) * a do i=1,n Hwide(1) = H1(i) Hwide(2) = H2(i) Hwide(3) = H3(i) Hwide(4) = H4(i) Hwide = Hwide * Awide H(i) = H0(i) + (Hwide(1) + Hwide(2) + Hwide(3) + Hwide(4)) end do end subroutine compute [/fortran] Where your actual implimentation might use module arrays for H,H0,... Jim Dempsey

Bruce_Weaver · ‎11-25-2012

I appreciate all the responses to my cache question. This function costs me about 25 cpu hours/week/thread! Again, there is no significant hysteresis in u and a. Since folks have asked questions about other features of the function and it is very short (now), I reproduce here the full working code: real elemental function H(u,a) !elemental intent (in) u,a real::a,u real, parameter :: H0(121)= (/1.,.990050,.960789,.913931,.852144, & .778801,.697676,.612626,.527292,.444858, & .367879,.298197,.236928,.184520,.140858, & .105399,.077305,.055576,.039164,.027052, & .0183156,.0121552,.0079071,.0050418,.0031511, & .0019305,.0011592,.0006823,.0003937,.0002226, & .0001234,.0000671,.0000357,.0000186,.0000095, & .0000048,.0000024,.0000011,.0000005,.0000002,.0000001, & .0,.0,.0,.0,.0,.0,.0,.0,.0,.0,.0,.0,.0,.0,.0,.0,.0,.0, & .0,.0,.0,.0,.0,.0,.0,.0,.0,.0,.0,.0, & .0,.0,.0,.0,.0,.0,.0,.0,.0,.0, & .0,.0,.0,.0,.0,.0,.0,.0,.0,.0, & .0,.0,.0,.0,.0,.0,.0,.0,.0,.0, & .0,.0,.0,.0,.0,.0,.0,.0,.0,.0, & .0,.0,.0,.0,.0,.0,.0,.0,.0,.0/) real, parameter :: H1(121)= (/-1.12838,-1.10596,-1.04048,-.93703,-.80346, & -0.64945,-0.48552,-0.32192,-.16772,-.03012, & .08594,.17789,.24537,.28981,.31394, & .32130,.31573,.30094,.28027,.25648, & .231726,.207528,.184882,.164341,.146128, & .130236,.116515,.104739,.094653,.086005, & .078565,.072129,.066526,.061615,.057281, & .053430,.049988,.046894,.044098,.041561,.039250, & .037222,.035195,.033478,.031762,.030293,.028824, & .027556,.026288,.025184,.024081,.023113,.022146, & .021293,.020441,.019685,.018929,.018255,.017582, & .016978,.016375,.015833,.015291,.014801,.014312, & .013869,.013426,.013023,.012620,.012253,.011886, & .01155025,.0112145,.01090675,.0105990,.0103161,.0100332, & .00977255,.0095119,.00927125,.0090306, & .0088079,.0085852,.0083787,.0081722,.00798035,.0077885, & .00760995,.0074314,.00726495,.0070985, & .0069430,.0067875,.0066421,.0064967,.0063605,.0062243, & .00609655,.0059688,.00584875,.0057287, & .00561585,.0055030,.00539665,.0052903,.00519005,.0050898, & .0049952,.0049006,.00481115,.0047217, & .00463715,.0045526,.0044725,.0043924,.00431645,.0042405, & .00416845,.0040964,.00402795,.0039595/) real, parameter :: H2(121)= (/1.,.9702,.8839,.7494,.5795, & .3894,.1953,.0123,-.1476,-.2758, & -.3679,-.4234,-.4454,-.4392,-.4113, & -.3689,-.3185,-.2657,-.2146,-.1683, & -.12821,-.09505,-.06863,-.04830,-.03315, & -.02220,-.01451,-.00927,-.00578,-.00352, & -.00210,-.00122,-.00070,-.00039,-.00021, & -.00011,-.00006,-.00003,-.00001,-.00001,.00000, & .0,.0,.0,.0,.0,.0,.0,.0,.0,.0,.0,.0,.0,.0,.0, & .0,.0,.0,.0,.0,.0,.0,.0,.0,.0,.0,.0,.0,.0,.0, & .0,.0,.0,.0,.0,.0,.0,.0,.0,.0, & .0,.0,.0,.0,.0,.0,.0,.0,.0,.0, & .0,.0,.0,.0,.0,.0,.0,.0,.0,.0, & .0,.0,.0,.0,.0,.0,.0,.0,.0,.0, & .0,.0,.0,.0,.0,.0,.0,.0,.0,.0/) real, parameter :: H3(121) = (/-.752,-.722,-.637,-.505,-.342, & -.165,.007,.159,.280,.362, & .405,.411,.386,.339,.280, & .215,.153,.097,.051,.015, & -.0101,-.0265,-.0355,-.0391,-.0389, & -.0363,-.0325,-.0282,-.0239,-.0201, & -.0167,-.0138,-.0115,-.0096,-.0080, & -.0068,-.0058,-.0050,-.0043,-.0037,-.0033, & -.00293,-.00257,-.00231,-.00205,-.00186,-.00166, & -.00151,-.00137,-.00125,-.00113,-.00104,-.00095, & -.00087,-.00080,-.00074,-.00068,-.00063,-.00059, & -.00055,-.00051,-.00047,-.00044,-.00041,-.00038, & -.00036,-.00034,-.00032,-.00030,-.00028,-.00026, & -.000245,-.00023,-.00022,-.00021,-.00020, & -.00019,-.00018,-.00017,-.00016,-.00015,-.00014, & -.00013,-.000125,-.00012,-.000115,-.00011,-.000105, & -.00010,-.000095,-.00009,-.000085,-.00008,-.00008, & -.00008,-.000075,-.00007,-.00007,-.00007,-.000065, & -.00006,-.00006,-.00006,-.000055,-.00005,-.00005, & -.00005,-.000045,-.00004,-.00004,-.00004,-.00004, & -.00004,-.000035,-.00003,-.00003,-.00003,-.00003, & -.00003,-.00003,-.00003/) real, parameter :: H4(121)= (/.50,.48,.40,.30,.17, & .03,-.09,-.20,-.27,-.30,-.31,-.28,-.24,-.18,-.12, & -.07,-.02,.02,.04,.05, & .058,.056,.051,.043,.035,.027,.020,.015,.010,.007, & .005,.003,.002,.001,.001,.000,.000,.000,.000,.000,.000, & .0,.0,.0,.0,.0,.0,.0,.0,.0,.0,.0,.0,.0,.0,.0,.0,.0,.0, & .0,.0,.0,.0,.0,.0,.0,.0,.0,.0,.0,.0, & .0,.0,.0,.0,.0,.0,.0,.0,.0,.0, & .0,.0,.0,.0,.0,.0,.0,.0,.0,.0, & .0,.0,.0,.0,.0,.0,.0,.0,.0,.0, & .0,.0,.0,.0,.0,.0,.0,.0,.0,.0, & .0,.0,.0,.0,.0,.0,.0,.0,.0,.0/) i= (10* abs(u)) +1 if(i.gt.121) then !cc outside of table use far wing formula p229 Gray. u2=u*u u4=u2*u2 Hsub1=0.56419/u2 + 0.846/u4 Hsub3=-0.56/u4 H=a*(Hsub1+a*a*Hsub3) return endif !cc H= H0(i)+ a* H1(i)+ a*a* H2(i)+ a*a*a* H3(i)+ a*a*a*a* H4(i) H=a*H4(i) H=a*(H3(i)+H) H=a*(H2(i)+H) H=a*(H1(i)+H) H=H0(i)+H return end There have been some good ideas about re-structuring the expression, which I will test in the coming week. HOWEVER, my first question (TimP?) remains: how can I lock the parameters into cache? If it is reasonably possible, it seems worth trying. I have not yet found in Vtune how to see if there are L1 or L2 cache misses. There seem to be a modest number of LLC misses, but it should be useful to move up the cache hierarchy. thanks to all.

John_Campbell · ‎11-25-2012

An alternative might be: i = (10* abs(u)) + 1 ! if (i > 121) then !cc outside of table use far wing formula p229 Gray. u2 = u*u u4 = u2*u2 Hsub1 = 0.56419/u2 + 0.846/u4 Hsub3 = -0.56/u4 ! H = a*(Hsub1+a*a*Hsub3) ! else if (i > 41) then ! ! H = a*a* H3(i) ! H = a * (H1(i) + H) H = a* (H1(i)+ a*a* (H3(i)) else ! !cc H= H0(i)+ a* H1(i)+ a*a* H2(i)+ a*a*a* H3(i)+ a*a*a*a* H4(i) ! H = a* H4(i) ! H = a*(H3(i) + H) ! H = a*(H2(i) + H) ! H = a*(H1(i) + H) ! H = H0(i) + H ! H = H0(i)+ a* (H1(i)+ a* (H2(i)+ a* (H3(i)+ a* H4(i) ))) endif return end . I would expect this to perform better if a significant proportion of I > 41. I tried to test this, but found your use of the following might do better: H=a*H4(i) H=a*(H3(i)+H) H=a*(H2(i)+H) H=a*(H1(i)+H) H=H0(i)+H . It might be that H is retained in a register and improves performance ? Test both and see. These results which defy old style optimisation logic make it difficult to understand how to do better !! . John

John_Campbell · ‎11-25-2012

Bruce, . I wondered if the following (change of order) might work better: H = ((( H4(i)*a + H3(i))*a + H2(i))*a + H1(i))*a + H0(i) . Certainly the performance difference between this and your 5-line expression is a puzzle. . I am attaching my test example, which shows, using Ifort Ver 12.1, that all attempts I have made to improve performance fail !! Even the attempt to minimise multiplications for i > 41 fails !! . I note that the accuracy of the Hi tables appears to be poor. Could these be calculated at the first call, or is accuracy not that important ? . Attached is my test example which shows that all alternatives I have tried do not work ! H_dev was by best guess at an improved coding and it does not get there. The only explaination I can give is that your 5-line expression is more suited to vectorising. Otherwise, how can fewer multiplications produce a worse result ? Including the test "if (i > 41) then" to reduce multiplication count makes things worse ? Is the code size causing the problem ? Does anyone else have an explaination for this result ? . John

John_Campbell · ‎11-25-2012

If I have not made a silly mistake, this is a good example of how frustrating it can be to optimise code, when you don't know what existing optimisation the compiler and the processor are providing. The simple approach assumes that a reduced operations count or fewer instructions should improve performance. I can make excuses that I don't understand the way the processor manages the cache, as being the reason for this simple approach not working, but at the end of the day, I don't understand this result ! ( I have tried to write the testing code to trick the compiler out of skipping redundant calls, which can be a problem, so hopefully this is not the reason.) (Testing for i>41 looks to be a good option, although I don't know the probability that i > 41, to make the test worthwile.) . Does anyone have a good explaination why testing for i > 41, to reduce the multiply count, does not work in my test example ? . John

TimP · ‎11-26-2012

This article http://chadaustin.me/2009/02/latency-vs-throughput/ contains a brief mention of dependency chains, for those who haven't become used to the concepts of instruction pipelining, throughput, and latency, over the last 3 decades. A fuller description of the topics might be found in Agner Fog's articles.

John_Campbell · ‎11-26-2012

Tim, . Thank you for providing advice on the concepts of instruction pipelining, throughput, and latency, over the last 3 decades. My question relates to the impact of changes I made to the H function. Basically, I changed the H function code from: . H=a*H4(i) H=a*(H3(i)+H) H=a*(H2(i)+H) H=a*(H1(i)+H) H=H0(i)+H return . to: . if (i > 41) then H_alt = a*a* H3(i) H_alt = a * (H1(i) + H_alt) else H_alt = a* H4(i) H_alt = a*(H3(i)+H_alt) H_alt = a*(H2(i)+H_alt) H_alt = a*(H1(i)+H_alt) H_alt = H0(i)+H_alt end if return . where i > 41 occurs 66% of the time in my test program. . A traditional analysis would suggest that identifying i>41 should provide a significant improvement, especially in my test example, where the value of i is randomly selected between 1:120. The run times I get with ifort Ver 12.1 are: H 1.60600000000000 Original multi-line formula H_new 1.68500000000000 use of dot product for vectorisation (not much worse!) H_alt 2.32500000000000 limit multiply if i>41 H_dev 2.37100000000000 limit multiply if i>41 and single line formula H_xx 2.37100000000000 limit multiply if i>41 and use multi-line formula . This basicaly shows that function H_alt runs significantly slower than function H, while there are significantly fewer multiplications in H_alt. The introduction of the "if(i>41)" test slows the function down. The article you referred to identifies ways of removing dependency chain latency, although function H does not appear to have solved that problem, as the main calculation sequentially updates the value of H. . I am still wondering why the function H_alt did not succeed in improving performance? . Perhaps I should try: H=a* H3(i) H=a* H H=a*(H1(i)+H) . John Postscript: I can not see how either: - the introduction of the "if (i > 41)" test can change the dependency chain latency characteristic of the main (i<=41) code, or - how the code for i>41 runs slower. . There is however, a definate change in the efficiency of the modified code, as the slower performance of H_alt occurs for multiple compilers, so somenthing is improving the performance of the original H calculation. . There must be something to the way the processor groups the instructions, which the if test is upsetting. Could Bruce's idea be correct, where jumping to i>41 drops H0, H2 and H4 from the cache ? Without the code for i>41, the code for H is more compact. Could this be an explaination ? If anyone can explain this change I'd appreciate their input.

Bruce_Weaver · ‎11-26-2012

Hi, The way I broke up the lines originally reflected what I had learned about FORTRAN compilers in the 60s and 70s. It is nice to see that that approach seems to still work well -- thanks John. Agner Fog's manuals are on C++ or assembly; the compilers of the first are, I gather, substantially different than Fortran compilers and writing assembly would seem is what one does when the compiler can't do its job well. I think that my original coding reflects chadaustin to some degree (H and a should be kept in registers and the parameters are only pulled once and there are no intermediate calculations), I'll have to look at it more carefully to test the optimization. BUT, Tim, returning again to my original question about locking the parameter data in cache: I take it that either I can't do it or you don't know how to do it. But thanks for the other insights.