Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.

Faster 3-D array access?

Bruce_Weaver
Beginner
4,174 Views

Hi,

I've got a very large 3-D array I access randomly so that I am memory-bound 32% of the time.  A quick attempt to move the the data into several very large 2-D arrays didn't seem to help significantly.  I'm starting to wonder about some version of equivalence statements into thousands of vectors or some such.  There is no obvious (static) pattern of data access that would suggest a way to rearrange the indices.  Any thoughts?

thanks,

Bruce

0 Kudos
36 Replies
Brooks_Van_Horn
New Contributor I
1,372 Views

This problem's solution is very specific to what you are trying to do. Is the matrix symmetric? Sparsely populated? Is it a variance-covariance matrix? Is it a filter and if so what kind?

You are asking for a solution without stating the real issues. Please be more specific. Like why do you even need a matrix. Is the problem partitionable? Without loss of generality, what drove you to 3D?

Best of luck,

Brooks

0 Kudos
Bruce_Weaver
Beginner
1,372 Views

It is an array of physical data, not a matrix.

"The array contains pre-calculated parameters required in an astrophysics computation."  The 3-D-ness of the array are the coordinates of the data: species name, height in the atmosphere, precomputed parameter type.

The stated real issue is: "I've got a very large 3-D array I access randomly so that I am memory-bound 32% of the time."  The replies by Jim  & John, although not a solution, are very much to the point.

 

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,373 Views

>>The 3-D-ness of the array are the coordinates of the data: species name, height in the atmosphere, precomputed parameter type.

Is the transition of values from point to point smooth?

If so, you may be able to increase the spatial separation and interpolate the in between values. A factor of 2 in all three directions yields 8x reduction in array size.

Also, in astrophysics problems, data points tend to be sparsely located in clusters. Is it possible to partition your one very larger array into several/many smaller 3D arrays (spatially) with each having a different sample point density (distance between sample points varies amongst sub arrays). Note, in the atmosphere, air density varies with altitude, therefore, your sample point density may be able to vary by altitude.

Third suggestion, with a proper use of barycenters, you might be able to use REAL(4)'s where you currently use REAL(8)'s and maintain as good or better results. (Hard to say with code unseen)

Jim Dempsey

0 Kudos
Bruce_Weaver
Beginner
1,373 Views

Jim:

"You can also reduce the number of page table entries (and thus cache requirements for page table) by (if permissible) selecting larger page sizes."

How do I do that?

Could be useful if I could convince the system to keep the small array (~1 Meg) in the L3 cache.

 

0 Kudos
Bruce_Weaver
Beginner
1,373 Views

Jim:

"If so, you may be able to increase the spatial separation and interpolate the in between values. A factor of 2 in all three directions yields 8x reduction in array size."

Does making it smaller make it faster?  When I broke the array into five 2-D arrays, it didn't seem to improve the speed.  It'll never get small enough to fit into cache until I get a machine with near memory (although I don't understand how Knights Landing is going to make the near memory act like cache as large caches get slow as the system tries to go figure out where to find the data you're asking for).  Only the spacial dimension might be reduced, the other two are data type and transition case.

"Is the transition of values from point to point smooth?"

At the moment it is as I have fit sparser data with a spline to produce the data density I need.  This does sound like I might do this in real-time.  I will soon be putting in discontinuities but that could be done in real-time as well, I suppose.

"you might be able to use REAL(4)'s where you currently use REAL(8)'s and maintain as good or better results."  Nah, this is astrophysics, most of the data are not worth double precision.  I have very few double precision variables & then, only for precision, not accuracy.

These questions have led me to take a deeper look at the Vtune results & now I don't think I understand them very well at all.  It flags a bunch of lines as memory bound but I think it is including cache bound (32%) but only 0.1% DRAM bound.  I have one line of code listed as 33% memory bound but with zero LLC misses.  I've gotta look at the definitions of the Vtune outputs more carefully.  If, essentially, all my memory-bound times are cache bound, then there may be little to be gained here.

puzzled.

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,373 Views

>>Large-Page Support

For MIC in Fortran, set the environment variable MIC_USE_2MB_BUFFERS=nnnnK (or nnnnM). Works on KNC, not sure if this works on KNL. Unfortunately this won't work on Intel64 Windows. Linux has better Large Page support control. On Windows... (my limited search yields):

This is more of a do it your self project, unless you can find something on the internet. Here are links to the basic information:

https://msdn.microsoft.com/en-us/library/windows/desktop/aa366720(v=vs.85).aspx
https://msdn.microsoft.com/en-us/library/windows/desktop/aa366887(v=vs.85).aspx (VirtualAlloc, also look at VirtualAllocEx, VirtualAllocExNUMA, VirtualAllocFromApp)

Scheme: Reserve a block of unused virtual memory (using VirtualAlloc with NULL base address), then committing this memory (VirtualAlloc with same address, size, and attribute for Large/Huge page size):

https://msdn.microsoft.com/en-us/library/windows/desktop/aa366803(v=vs.85).aspx

Note, the above example uses C++ with exception handlers, you can rewrite to use C and your own exception code following calls.

Then you would use the C_F_POINTER to produce a pointer to the array.

(remember to add cleanup to return after use if reallocating, or just terminate process)

Additional info:

https://blogs.msdn.microsoft.com/ntdebugging/2010/02/05/understanding-pte-part-1-lets-get-physical/
(general info on page table)

If you successfully implement Large/Huge pages, please respond to this forum with a simple code example.

I am surprised that Intel Fortran doesn't have an attribute to selectively (allocation size) provide support for Large/Huge pages. (If they do, it is cleverly hidden in the documentation.) IVF has implemented the attribute FASTMEM for KNL, Intel could add an option (compatible with Windows and Linux) to indicate Large/Huge page allocation.

Jim Dempsey

0 Kudos
Brooks_Van_Horn
New Contributor I
1,373 Views

If you are running windows, can you increase the ram size. It seemz like you need 32-64GB of RAM and not worry about pagiung at all. I assume this large data problem will persist for you for some time. If it is not too much trouble you might look at doing a principal components analysis or a cluster analysis to get a better idea of your data structure. Lke Jim was saying above, in astrophysics, the database is sparse. If you are dealing with earth centric (ECS) observations, I suggest you change your coordinate system to solar centric (SCS) and you don't have to worry about seasons. You can always transform items back to ECS easily. This may improve your array size. Extending Jim's comments, you can look at visual clusters of say say 10 steradians cones and write special code to handle boundaries.

Brooks

0 Kudos
Bruce_Weaver
Beginner
1,373 Views

Brooks:

Thanks for your interest; if you read the posts, you'll see the problem has nothing to do these kinds of coordinates.  It turns out, if you turn off paging, Intel Fortran will not run, even if there is plenty of memory to run the program in (at least some of the time)!

Jim:

Thanks for the references.  That will be enough to chew on for a bit.

0 Kudos
John_Campbell
New Contributor II
1,373 Views

Brooks Van Horn wrote:

If you are running windows, can you increase the ram size.

My understanding of the problem being described is that it is the bottleneck for transfers between memory and cache. Increased memory will not improve this problem, it would probably make it worse. The solution is finding ways to reduce the required transfer rate between memory and cache, which I have been able to best address by changing the algorithm. Jim has also suggested other ways which I need to better understand. All these solution approaches can be difficult to identify. Ways that do more with what is in the cache before it is replaced can be a good approach, but unfortunately we don't have any direct control of what is there.

This problem becomes more significant when using !$OMP or vector instructions with large arrays. When learning how to use these approaches, the importance of cache management should be given more prominence.

John

0 Kudos
Anthony_Richards
New Contributor I
1,373 Views

You say a random function generates three indeces (i,j,k) which address an element in a look-up table and randomness is at the heart of your
algorithm, whatever it is.

Let's assume the look-up table remains unchanged as a result of calls made to access one of its values.

Let's also assume that use of an accessed value will affect subsequently-generated index triples.

In this case, the speed with which the index triple is randomly generated and then turned into an array address for accessing
a table value would benefit from optimisation. That is one area for investigation.

If you truly have to live with randomness, then the quickest gain in speed is obviously to reduce the size of one or more opf the array dimensions. At the minimum, a factor 2 reduction should be aimed at in order to get a proportionate decrease in execution time.

Without giving away sensitive stuff, I am interested to know what is the origin of the randomness. Are you tracking asomething through an atmosphere which is randomly being deflected/changed in up to 3 ways at each point in its trajectory?
 

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,373 Views

>>the speed with which the index triple is randomly generated and then turned into an array address for accessing
a table value would benefit from optimisation. That is one area for investigation.

I concur with Anthony. Your large 3D array may out of necessity require to be allocatable with the size of each rank undeterminable at compile time. This being the case, the compiler optimization cannot use a simple 3-index to linear index conversion expression using the size of each dimension as literals or as local variables. Instead it must fetch these from the array descriptor. One technique that works in this case is to continue to use your 3D array in the larger scope of your program, but when calling the compute intensive subroutines, you can express the Dummy receiving the array as 1D (you pass in the size of the 3D:

call foo(array3D, size(array3D, DIM=1), size(array3D, DIM=2), size(array3D, DIM=3))
...
subroutine foo(array1D, nX, nY, nZ)
implicit none
integer ::  nX, nY, nZ
real :: array1D(nX*nY*nZ)
integer :: iX, iY, iZ, i
...
do iZ=1,nZ
  do iY=1,nY
    do iX=1,nX
      i=iZ*nX*nY+iY*nX+iX
      ...array1D(i)...
...

The compiler should be able to relocate loop invariant code (iZ*nX*nY+iY*nX), but you can expressly do that yourself if you wish.

The compile can also do the relocate loop invariant code using the array descriptor, but due to register pressure, it may have to resort to refetching the dimension sizes from the rank size table of the array descriptor.

Please note, this is one tool for your toolbox. Do not be satisfied by making a single incremental improvement in performance... go for all improvements.

Jim Dempsey

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,373 Views

For a more complete description of the above suggestion, see:

https://software.intel.com/en-us/articles/peel-the-onion-optimization-techniques

Jim Dempsey

0 Kudos
Bruce_Weaver
Beginner
1,373 Views

Hi Anthony,

Thanks for your interest in this issue.  If you read through my improved explanations, you'll see that the columns in the table have nothing to do with directions. E.g., "The array contains pre-calculated parameters required in an astrophysics computation."   &

"The program follows photons though a stellar atmosphere.  Periodically along the path, the physics of the local area are checked to see if one of the possible interactions with matter occurs.  The data array has the required pre-calculated parameters for possible interactions at that altitude.  Depending on wavelength & random numbers, the next altitude could be any altitude.  However, the number of possible interactions is limited to wavelengths close to the wavelength of the photon -- that is what I meant by a 'local region'."

These pre-calculated values never change.  The  selection of the next set of  data is NOT from a random selection of indices but from combinations of physical processes selected from probability distributions of those processes.  In terms of predicting the next address, that appears random.  Also I noted earlier in the discussion that I was surprised to discover that converting to a bunch of 2-D arrays did NOT speed things up, contrary to my expectations. 

Jim:

Still chewing on your suggestion (when I get a break from other issues).  I think you're saying that if I convert a 3-D to a 1-D, access will be faster.  But that is contrary to my experience converting from a 3-D array to 2-D arrays.  I have not tried one-D yet.  I think that some earlier comments may be right.  The computations are so fast compared to memory fetch times that doing the 3-D memory calculation (calculating the memory location) may be inconsequential.

My last look at the 2017 Vtune claims that I'm cache bound  most of my 30% memory bound time.  If that is L1 (or L2?) bound, I suppose that's the end of that conversation.  However, Vtune keeps giving me zero action in the L2 cache, which I don't understand at all. That part of Vtune not ready for prime time?  These are tested with an I7, which seems faster than my pair of E-5s (all obvious things being equal).  QUESTION:  What happens to the contents of the L3 caches when the OS decides to switch the thread between the chips?

 

0 Kudos
andrew_4619
Honored Contributor III
1,373 Views

First I will add a disclaimer in that I will not claim an special expertise in this area however it is an interesting discussion so I will add my observations/summary which may or may not bring anything new to the party.

1) Your "database" is very large in relation in relation to cache sizes so if there is no method to pre-predict what data will be needed at the next step it seems inevitable that you will get a large number of cache misses.

2) Fetching data from main memory I would expect to equate to the processing time of several hundred CPU instructions.

3) Calculating a memory offset for your 3D arrays is but a few instructions so I would not expect that to make a significant difference.

4) If the program is memory bound that suggests the computation that you make at each 'point' is not particularly intensive in relation to point 2)

So IMO if it is not practical/possible to change algorithms for better pre-prediction and/or ordering of the 'database' it would seem to me that you need a computer with faster memory otherwise you are a bit stuffed. Not a very helpful conclusion......

0 Kudos
John_Campbell
New Contributor II
1,373 Views

Bruce,

You need to test that faster memory will reduce the delay. Could you get access to an i7-6950X. It would be interesting to see a comparison test on a processor with more cache and faster memory.

The benchmarking that I have done for large arrays shows that the memory speed (and bandwidth) is the most critical parameter for performance.

In terms of changing the code, cache aware algorithms would offer a much more significant improvement, but in some cases (like yours?) this is not possible. If it was possible to sort the calculations based on memory usage that would be a good approach, but you have indicated this is not compatible with your numerical method. The other option involves storing much less data in memory and re-calculating what you store in the large array. This would only be effective if the reduced data set approached the cache size. You could package this approach up in a routine that provides the data required for each random point, that optionally provides the large array or recalculate options.

I would be interested to read how you go with any improvements you may find.

John

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,373 Views

Andrew,

Good points....

If Bruce can implement the Huge/Large pages, this application may potentially see a halving (or some goodly reduction) of the number of memory fetches. Where the reduction comes from fewer non-cached page table accesses to update the TLB. Under the worst scenario, after the memory address is computed (from the multiple indexes), a memory fetch at that address could conceivably require 4 (maybe 5) memory fetches when using small pages. The worst scenario is observed only when none of the multi-level page table entries are resident in the TLB caching system. In a well behaved program, the majority of the memory references are handled with the currently mapped TLB entries, and when an entry needs to be replace, only the lowest level of the page table will experience a TLB miss (and thus experience one additional memory fetch).

Bruce's program, as currently described (written), has a very large array, much larger than cache, with a completely random access (as described). This results in the majority of the accesses being non-cacheable as well as experiencing a high level of TLB misses (Bruce can confirm this assumption with VTune).

>>What happens to the contents of the L3 caches when the OS decides to switch the thread between the chips?

In a single-threaded application, you can avoid this by affinity pinning your process to all the logical processors of a single CPU (chip). Or to single core when you are concerned about L2/L1. In a multi-threaded application, you typically avoid this by affinity pinning the individual threads of a process to a core or logical processor.

>>with an I7, which seems faster than my pair of E-5s

What happens when you run on one of the E-5's?

Jim Dempsey

0 Kudos
Reply