Keeping data in cache - Page 2

Bruce_Weaver · ‎11-23-2012

HI,

I have a program that takes a week or so to run on an i7 or E5-2690 (single threaded) and I'm looking forward to moving it to a MIC/PHI board (multi-threaded). About 15% of the time is spent on evaluating a single polynomial that uses four 122-element parameter statements as sources for the coefficients in the equation. I would like to try to lock those real coefficients into cache. Is there a way to do that from Fortran? Is there a way to see if they are already being kept in cache (there is a fair amount of work between calls to that function so I doubt it)?

Thanks for any ideas here.

John_Campbell · ‎11-26-2012

Bruce, . I am not aware of how you could lock your parameter arrays in the cache to improve performance. I would not think it possible or a good idea, as optimisation of the cache is best kept away from us programmers. . I have tested two ideas for improving your routine: 1) was to generate an a power vector and store the Hi values in a vector to utilise the vector instructions. I called this H_new in the attached file. I tested this and it runs slightly slower than your H routine; so no help. . 2) was to identify that for i > 41, only H1 and H3 are non zero, so I wrote a change of the form: . else if (i > 41) then ! H = a* (H1(i)+ a*a* (H3(i)) else ! H = H0(i)+ a* (H1(i)+ a* (H2(i)+ a* (H3(i)+ a* H4(i) ))) end if . Unfortunately this change results in significantly slower performance, which I did not expect. . I did change your H function by: -introducing a MODULE to store your parameter constants H0...H4, and -using a 1-line statement for the general case. . Both "tidy" up the code (personal opinion) but do nothing to improve performance. I also noted that your parameter array values are not very accurate, especially H4 and H3, which might imply a << 1 . Compiling with /O2 will introduce vector instructions, which is the best option of improving a single thread. . If you want to improve performance, then you might want to look at other parts of the code. While I don't understand your solution method, if the values of "a" and "u" are being interatively estimated, then a trending algorithm could help. By identifying if they are monotonically increasing or oscilating, this might help for predicting a better estimate, from past values. . If this assumption of convergence is relevant, then steping between values of i might also cause problems. Could using the value of U interpolating between H0(i) and H0(i+1) provide a smoother value to converge more quickly ? r= (10*abs(u)) + 1 i = r j = i+1 fj = (r-i) ! factor to apply to H0(i+1) fi = 1 - fj ! factor to apply to H0(i) ! H = a * (fi*H4(i)+fj*H4(i+1) H = a * (fi*H3(i)+fj*H3(i+1) + H) H = a * (fi*H2(i)+fj*H2(i+1) + H) H = a * (fi*H1(i)+fj*H1(i+1) + H) H = fi*H0(i)+fj*H0(i+1) + H ! ! or H = ( H0(i)+ a* (H1(i)+ a* (H2(i)+ a* (H3(i)+ a* H4(i) ))) * fi + & ( H0(j)+ a* (H1(j)+ a* (H2(j)+ a* (H3(j)+ a* H4(j) ))) * fj . possibly another idea that could work . Sorry I could not be of more help. (still puzzled as to why i>41 fails!). . John

jimdempseyatthecove · ‎11-27-2012

Bruce, Your function (H) is declared as elemental. Implying your intended use is: SomeWhere(i) = H(u,a) Or more likely: SomeWhere(i) = H(u(i),a) A performance problem as I see it is the conditional branching in your elemental function H making it non-friendly to vectorization. An optimization technique to explore is to create (perhaps a once-only thing) a pick list of regions of u(i) that qualify for each branch in your H functions. Then: ! run thru the .LE. 121 sections of U DO I=1,maxPick if(PickLE121low(i) .eq. 0) exit Somewhere(PickLE121low(i):PickLE121high(i)) = HLE121(u(PickLE121low(i):PickLE121high(i)), a) END DO ! run through the .GT. sections of U DO I=1,maxPick if(PickGT121low(i) .eq. 0) exit Somewhere(PickGT121low(i):PickGT121high(i)) = HGT121(u(PickGT121low(i):PickGT121high(i)), a) END DO The idea being you have clusters of adjacent (vectorizable) like entries in u. Jim Dempsey

John_Campbell · ‎11-27-2012

Bruce, Your function H uses a lookup table for U and a quartic polynomial for a. An alternative might be to also use a lookup table for a, then use linear interpolation from the table. You could also use a by-linear interpolation for a and u, although I do not know if u=>i is a continuous or step function. Depending on the range of values for a, a lookup table could be a good alternative. Transforming a to another value then using this can also help, such as b = log(a) or (exp, tan or sqrt) might transform to a table of b that better suits uniform spacing of the b table. This way you can reduce the run time, and size the spacing of values in the table to achieve acceptable accuracy. I would expect that a linear interpolation from this table would be faster than your quartic polynomial, while the generation of the table values could be an initialisation state or a set of data statements. John

Bruce_Weaver · ‎11-27-2012

Hi, The lookup table(s) for a is about 45 Mbytes and about to become much larger; u is calculated from quantum thermodynamics conditions under some otherwise unpredictable conditions and is continuous.. The tables for the a values are pre-calculated by another program for a specific set of conditions. I don't think that there are any interpolation schemes that will be profitable. Bruce

John_Campbell · ‎11-27-2012

Bruce, I looked at the polynomials for H0-H4 and the values of H for various values of "a" and "u"(i). Depending on the range of a, I would expect that you could achieve similar accuracy with a look up table of values, much less than the 45mb you are using to calculate a. You could populate the table of a/u interpolation, using the existing function for H, although it would be worth checking the accuracy of your H0-H4, especially if a can be large. Your dividing u into steps of 0.1 also appears to be a significant comprimise for the H estimate, in comparison to the accuracy provided to "a" in the formula for H. It looks to me that there is much better accuracy for H in relation to "a", than you achieve for "u". (perhaps du could be less than 0.1) Depending on the range of "a" values, you might be able to reduce the table size if you use a table for b = atan (a*factor), so that b is in the range -pi/2 to pi/2. If you can reduce the look-up tables spacing (du and da) to use linear interpolation, and provide sufficient accuracy, then I think there would be some run time gains. Hopefully some uninformed ideas that might provide a new look at your solution. John . I roughed up a lookup table approach for u and a; for u in the range 0 to 12 and a in the range 0 to 1.2 (using du = 0.1 and da = 0.01) I also changed the test program to report time as processor ticks per function call. From testing, the Function H takes about 16 processor ticks, (which is fairly quick) With H_table being a lookup for u and a, it takes about 6 ticks If H_table does a linear interpolation for a (see below) it takes about 8.5 ticks, which is half the reported ticks for H (I am attaching the latest version of the test program, which tries a few different ideas.) . This shows there is potential for this approach, depending on the range of u and a. You could use H_table lookup if a and u are within range, else use H. . real function H_yy (u,a) !elemental use H_tables real, intent (in) :: u,a ! ! retain multi-line formula, but test for i > 41 ! integer iu,ia real*4 :: du = 10 ! steps per U value ( U range 0:12.0 ) real*4 :: da = 100 ! steps per A value ( A range 0:1.2 ) logical :: first = .true. real*4 H, ut, at, fa external H ! if (first) then do ia = 1,120 at = (ia-1)/da do iu = 1,121 ut = (iu-1)/du ua_table(iu,ia) = H(ut,at) end do end do first = .false. end if ! iu = (abs(u)*du)+1 ! fa = (abs(a)*da)+1 ia = fa fa = fa-ia ! h_yy = ua_table(iu,ia)*(1.0-fa) + ua_table(iu,ia+1)*fa ! return end function H_yy

Bruce_Weaver · ‎11-28-2012

Thanks for the ideas. I guess I didn't explain the source of 'a' very completely...they are values from QM calculations for spectral lines. Each of thousands of lines has a series of these depending on the conditions at different altitudes in the atmosphere of a star. They are non-linear and typically range for 1e-3 to 1e-1. This doesn't keep me from bringing a & u into a 2-D table lookup as opposed to four 1-D lookups. I'll look into your suggestion. thanks.

John_Campbell · ‎11-28-2012

Bruce, . I produced a surface chart for H as a fn of a and u. For what I anticipate is the range of a and u, H shows greater variation wrt u, than a, which makes your 121 step look-up table for u appear to be a relatively high source of inaccuracy. On this basis, having a look-up for a and an interpolation for u could provide a more accurate estimate of H on average. (The example I posted was a look-up for u and an interpolation for a). Based on the performance target, doing a linear interpolation for both a and u from a 2D table would probably be similar/slower than your existing H function, so the better solution might be a look-up for a and a linear interpolation for u. You might also want to review the accuracy of the H function, for populating the 2D table values and to use it when a or u are outside the range of the 2D table. . you could also speed up the interpolation formula by also having a first difference table(s): h_yy = ua_table(iu,ia)*(1.0-fa) + ua_table(iu,ia+1)*fa ! original interpolation in a rearranging, becomes = ua_table(iu,ia) + fa*(ua_table(iu,ia+1)-ua_table(iu,ia)) ! only 1 multiplication by providing a first difference table, removes a difference calculation = ua_table(iu,ia) + fa*da_table(iu,ia) ! where difference table "da_table(iu,ia)" can be pre-calculated . To improve accuracy, and reduce the u spacing, two difference tables might help to improve performance. ( fa = (abs(a)*da)+1 ; ia = fa ; fa = fa-ia ) ( fu = (abs(u)*du)+1 ; iu = fu ; fu = fu-iu ) h = ua_table(iu,ia) + fa*da_table(iu,ia) + fu*du_table(iu,ia) ! is a possibility for 2 way interpolation with 3 pre-calculated tables . This would probably save on storage and improve performance ! The table spacing du and da could be refined, depending on the accuracy required, although a finer spacing only affects the size of the table and not the run time. . John . I did some tests with the following code sample: . real function H_yy (u,a) !elemental use H_tables real, intent (in) :: u,a ! ! bi-linear interpolation from a table ! assumes a and u are in range ! integer iu,ia real*4 :: du = 5 ! steps per U value ( U range 0:12.0 ) real*4 :: da = 500 ! steps per A value ( A range 0:0.12 ) logical :: first = .true. real*4 H, ut, at, fa, fu external H ! if (first) then do ia = 1,60 at = (ia-1)/da do iu = 1,61 ut = (iu-1)/du ua_table(iu,ia) = H(ut,at) end do end do do ia = 1,60-1 do iu = 1,61-1 da_table(iu,ia) = ua_table(iu,ia+1) - ua_table(iu,ia) du_table(iu,ia) = ua_table(iu+1,ia) - ua_table(iu,ia) end do end do first = .false. end if ! fu = (abs(u)*du)+1 iu = fu fu = fu - iu ! fa = (abs(a)*da)+1 ia = fa fa = fa - ia ! h_yy = ua_table(iu,ia) & + da_table(iu,ia)*fa & + du_table(iu,ia)*fu ! return end function H_yy . Based on modifying yesterday's test program : The H function uses about 16.5 clock cycles H_yy using bi-linear interpolation (above) uses 12.5 clock cycles H_yy using linear in a or u and lookup in the other (omiting 1 difference table) uses 10 clock cycles H_yy using lookup in a and u uses 6.2 clock cycles (yesterday's example) . If we knew why the "if(i>41)" did not work, then possibly we could improve it some more, as H_yy needs a test for ia or iu out of range ! . I've found this an interesting problem and hopefully this might give you some options.

Bruce_Weaver · ‎11-28-2012

Hi, I hope to examine the range of a next week. This is an interesting idea and I'll be able to explore it then. thanks

John_Campbell · ‎05-30-2013

Bruce,

I hope you had some success with your problem.

I have been addressing a similar problem, where I have a 2-dimensional interpolation, similar to the requirement of your function H(u,a). I have resisted the use of a look-up table of values, as the results from this approach produce a jagged plot of performance. I have persisted with the bi-linear interpolation, producing a smoother system response. This is easier than trying to explain the reason for rough chart results.

To improve the run-time performance, I have found an outer loop (do i = 1,7), which is independent of my H function, and shifted it to be an inner loop, and so reducing the number of times I have to call the interpolation functions by a factor of 7. This produces a vector of results, rather than a single value at the H loop level. There is a penalty to pay, as more information needs to be stored to support the new loop order, but there has been a significant performance improvement.

John

Bruce_Weaver · ‎05-31-2013

Hi John,

I've been off working on other enhancements to the code and just got back to looking at current hotspots. It has gotten much faster so now H takes up 10% of my compute time. I'm going to have a summer college intern start next week and the first thing I'm going to have him do is to compare the tabular-generated values to the computation of the exact (enough) value.

I'm not sure I completely understand your comments. I can only use one H at a time in the calling code. Are you saying that you generate several of your Hs at a time or are you saying that you get several interations of the interpolation at a time? The former doesn't help me and the later seems hard to imagine. I'm starting to wonder about a large table with very simple interpolation as the function is quite well behaved. I did a plot of it about the itme this thread started and it seemed pretty smooth with very modest second derivatives.

After I see how the comparisons are that the student generates, I am going to set him to to work on trying your approach or something like it. Since, for the entire program, I am getting huge differences between execution times in release and debug mode, I guess I will have to forgo Vtune and do old-fashioned timing.

--Bruce

jimdempseyatthecove · ‎06-01-2013

John,

I haven't seen your code. What are your thoughts if your inner loop was 1,8 (add dummy result), and you work the code such that the inner loop is fully vectorized?

Jim Dempsey

John_Campbell · ‎06-01-2013

@bruce,
The loop "do i = 1,7" is independent of the arguments of function H, so I was able to move this "i" loop to inside, but now have to store the results for the 7 values of I. Cost only about 10mb in extra storage for this change, although more arrays to kep track of.

@jim,
I have not fully optimised the code. There are 5 interpolation functions, 4 are linear and one is bi-linear. There was significnt savings in shifting all these outside the inner loop. This is a "working" code, so I don't have a lot of time to do fine optimisation. The first time I ran this latest approach, it took over 3 hours. I have now reduced it to 15 minutes with a number of changes. Prior to the need for this latest approach, the run took about 5 minutes, so I thought I needed to identify something. This will probably not be the final version in an evolving simulation, where as it adapts, I need to keep the run times under control. I will probably do more if this approach remains in use.

John

jimdempseyatthecove · ‎06-02-2013

Good catch on reorganizing your loop order (3hr to 15min). There may be some other changes that get you the additional distance back down to ~5min (that won't break working code).

Jim Dempsey

Clinton_Lewis · ‎06-12-2013

Just trying to follow alogn. Not sure which loop (do i = 1,7) is being discussed. Is the loop outside the call to H in freep1, which is in the line

alphadL= 0.014974* (fn/dnudop) * H(u,a) * dL * fudge

I have an old version of the code - has H been converted to a bilinear interpolation function? Does stride become an issue with this approach? Stride length depends upon which index is incremented squentially.