- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
HI,
I have a program that takes a week or so to run on an i7 or E5-2690 (single threaded) and I'm looking forward to moving it to a MIC/PHI board (multi-threaded). About 15% of the time is spent on evaluating a single polynomial that uses four 122-element parameter statements as sources for the coefficients in the equation. I would like to try to lock those real coefficients into cache. Is there a way to do that from Fortran? Is there a way to see if they are already being kept in cache (there is a fair amount of work between calls to that function so I doubt it)?
Thanks for any ideas here.
Link Copied
- « Previous
-
- 1
- 2
- Next »
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Bruce,
I hope you had some success with your problem.
I have been addressing a similar problem, where I have a 2-dimensional interpolation, similar to the requirement of your function H(u,a). I have resisted the use of a look-up table of values, as the results from this approach produce a jagged plot of performance. I have persisted with the bi-linear interpolation, producing a smoother system response. This is easier than trying to explain the reason for rough chart results.
To improve the run-time performance, I have found an outer loop (do i = 1,7), which is independent of my H function, and shifted it to be an inner loop, and so reducing the number of times I have to call the interpolation functions by a factor of 7. This produces a vector of results, rather than a single value at the H loop level. There is a penalty to pay, as more information needs to be stored to support the new loop order, but there has been a significant performance improvement.
John
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi John,
I've been off working on other enhancements to the code and just got back to looking at current hotspots. It has gotten much faster so now H takes up 10% of my compute time. I'm going to have a summer college intern start next week and the first thing I'm going to have him do is to compare the tabular-generated values to the computation of the exact (enough) value.
I'm not sure I completely understand your comments. I can only use one H at a time in the calling code. Are you saying that you generate several of your Hs at a time or are you saying that you get several interations of the interpolation at a time? The former doesn't help me and the later seems hard to imagine. I'm starting to wonder about a large table with very simple interpolation as the function is quite well behaved. I did a plot of it about the itme this thread started and it seemed pretty smooth with very modest second derivatives.
After I see how the comparisons are that the student generates, I am going to set him to to work on trying your approach or something like it. Since, for the entire program, I am getting huge differences between execution times in release and debug mode, I guess I will have to forgo Vtune and do old-fashioned timing.
--Bruce
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
John,
I haven't seen your code. What are your thoughts if your inner loop was 1,8 (add dummy result), and you work the code such that the inner loop is fully vectorized?
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@bruce,
The loop "do i = 1,7" is independent of the arguments of function H, so I was able to move this "i" loop to inside, but now have to store the results for the 7 values of I. Cost only about 10mb in extra storage for this change, although more arrays to kep track of.
@jim,
I have not fully optimised the code. There are 5 interpolation functions, 4 are linear and one is bi-linear. There was significnt savings in shifting all these outside the inner loop. This is a "working" code, so I don't have a lot of time to do fine optimisation. The first time I ran this latest approach, it took over 3 hours. I have now reduced it to 15 minutes with a number of changes. Prior to the need for this latest approach, the run took about 5 minutes, so I thought I needed to identify something. This will probably not be the final version in an evolving simulation, where as it adapts, I need to keep the run times under control. I will probably do more if this approach remains in use.
John
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Good catch on reorganizing your loop order (3hr to 15min). There may be some other changes that get you the additional distance back down to ~5min (that won't break working code).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Just trying to follow alogn. Not sure which loop (do i = 1,7) is being discussed. Is the loop outside the call to H in freep1, which is in the line
alphadL= 0.014974* (fn/dnudop) * H(u,a) * dL * fudge
I have an old version of the code - has H been converted to a bilinear interpolation function? Does stride become an issue with this approach? Stride length depends upon which index is incremented squentially.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- « Previous
-
- 1
- 2
- Next »