Faster 3-D array access?

Bruce_Weaver · ‎05-28-2016

Hi,

I've got a very large 3-D array I access randomly so that I am memory-bound 32% of the time. A quick attempt to move the the data into several very large 2-D arrays didn't seem to help significantly. I'm starting to wonder about some version of equivalence statements into thousands of vectors or some such. There is no obvious (static) pattern of data access that would suggest a way to rearrange the indices. Any thoughts?

thanks,

Bruce

John_Campbell · ‎05-29-2016

If you have a large array(i,j,k), then you can consider the option of changing the order of your i,j,k to improve performance by reducing the rate of memory transfers.

This can be achieved if you can identify which subscript is varied most frequently and make it the "i" subscript. If you can identify that the first subscript "i" that changes sequentially in the inner loop, then you are likely to have a best case outcome. The effect of this change is further enhanced if the memory footprint of the part of the array you frequently addressing is about the cache size of the processor you are using, so that this memory image of the active part of the array is stored in the cache. (note a large array is significantly larger than the cache)

Your description suggests that you have a very large array that uses the subscripts randomly. This implies that the solution approach I have described is not going to be easily achieved. You could experiment with changing the subscript order or partitioning the solution to do a multi- pass approach that considers a smaller sub-set of the array. If this is possible this approach can provide improved performance from both reduced memory access times and also improved efficiency of vector instructions, if the inner "i" subscript can be used sequentially. Partitioning the problem for array subsets that are about cache size can have a dramatic effect with this type of problem.

Using array sections can appear to overcome the problem of not using the inner subscript. In this case ifort can replace the array section with stride addressing, but my understanding of this approach is that it does not improve the memory addressing of cache problem. (others may correct my assumption)

Sometimes, there are multiple stages of the analysis process and the preferred inner subscript changes. You can choose the subscript order that effects the most time consuming stage of the analysis.

If you are having a memory accessing bottleneck with a single thread problem, changing to !$OMP is unlikely to improve the solution as this can make this problem worse.

Alternatively you can choose a subscript order that best documents the calculation and accept the performance penalty.

andrew_4619 · ‎05-29-2016

It would help if you define large - type & array bounds.

Is the array "static" or "allocated"

Do you grab 'chunks or slices' of the array or just 'random' individual elements. As John suggests that might give possibilities.

Bruce_Weaver · ‎05-29-2016

Hi all,

The allocated array is often gigabytes. The subscript order is probably as good as it can get in terms of the first index moves through regions each of the size of several thousand elements. I need a single element at a time and nothing is sequential. The program takes hours to days to run so modest improvements are significant in people time.

Since I'm running flat out [3/4] on OMP threads, I wonder if I assign affinity of each thread to a specific core, it might help the cache issue as the array region of each computational thread would be different for each thread? I.e., A thread has a useful cache content but when that computational thread is switched to a different cpu, the cache contents don't follow it and the new cpu must start loading the cache again. I gather that Windows moves the computational thread about between cores on a fairly short timescale...which would require a whole new area of the array to drift into the cache of the new assigned core. I had not thought about the conflict between OS scheduling and cache management before. Is this correct?

I'm not sure how stride addressing helps except when I know a lot about the order of access.

I gather that no one thinks there is any profit in persuing the Equivalence approach.

thanks for your thoughts.

Bruce_Weaver · ‎05-29-2016

Oops, I didn't answer all the questions. the array is real of size (hundreds of thousands: few thousand: five)

mecej4 · ‎05-29-2016

I need a single element at a time and nothing is sequential. The program takes hours to days to run so modest improvements are significant in people time.

Perhaps that is an aspect that ought to be worked on? I do not know the details of your program, so allow me to pick a simpler example: looking up items (e.g., words) in a list.

If the list has N items, and you want to do m lookups, the work would be proportional to m.N/2 if the list is in random order and the distribution is uniform. If, on the other hand, the list is sorted in a preparatory step before any lookups are done, the work would be proportional to N lg N for sorting plus m lg N for the subsequent lookups (lg stands for "log₂). Thus, with m and N about 10⁶, pre-sorting speeds up the overall task by a factor of over 10⁴.

I do not know what part of your computational task, if any, is the counterpart to the sorting in my example, but you would certainly know.

Bruce_Weaver · ‎05-29-2016

The array contains pre-calculated parameters required in an astrophysics computation. Other than the regions mentioned above, there is no order to their selection, which is driven by the simulation. The table is highly ordered already & I don't see how sorting, which implies I know something about the order of access, which I've already pointed out, I don't (other than this regionality I've discussed).

John_Campbell · ‎05-29-2016

What about a faster processer for memory access. You could look at an alternative processor, as memory transfer rate and bandwidth are critical for these types of calculation. I am not sure what processor you are using but there is good comparative info at ark.info.com. The key parameters are:

Memory type: identifies the type and speed of memory transfers
Max memory bandwidth: identifies the peak throughput rate, and
cache: identifies the size of cache available.
# of cores:

Probably there are not many practical alternatives, but it can identify that the cost of changing the hardware performance can be small compared to the software development cost.

You say "Since I'm running flat out [3/4] on OMP threads" You should check the performance multiplier are you achieving (not CPU ratio) as depending on the memory bandwidth and using gb size arrays I find that 3 to 4x processing performance can be optimistic, although there is a large range of possibilities for different large memory calculations. Regarding threads vs cores; I interpret you have seen you are maxing out at 3-4 (threads?) so more may not help unless you can get a higher memory bandwidth and when threads exceeds cores the benefit is not always there when the bottleneck is memory transfers.

My capability in this area is limited and allocating threads to cores is beyond me. I have not read how this is being well controlled. The main strategy I have had most success with is trying to manage the computation to stay in the cache as long as possible. I must admit I don't understand how the L3 cache is shared between threads and cpu's but if you can see a way of localising the calculation it could help. Could you process one element at a time, by creating a smaller dataset of the information for this element? ( this all assumes that the large arrays are shared and not private, as private large arrays would provide an increased level of memory transfer to bottleneck, especially in a REDUCTION clause)

This exhausts my knowledge on this and I suspect that without more detailed information of the calculation others are not likely to provide much more. I will be interested to read what others suggest as I also find the memory bottleneck is a major issue in using omp with large arrays.

andrew_4619 · ‎05-30-2016

My knowledge of OMP is next to nothing so in that regard i'm out on this one. However as a general point and maybe you have already covered such things in great depth but....

Is it that you:

1] select a 'random data point'

2] do some calculation on that data point

3] repeat 1-2

If that is the case and memory access is big bottleneck it would suggest 2] is a quick process. If 1] is not dependent on the result of previous 2]'s maybe you can only consider selections 1] that are within a defined partition of the whole data and deffer processing of other points until you have a certain number of points pending to process within some another partition of the whole data. That would prevent a lot of data thrashing maybe.

That said without a deep knowledge of what your program actualy does we are all guessing really but optimised algorithms in many instances is the way be make huge gains in speed....

Bruce_Weaver · ‎05-30-2016

John:

Sure. In general, I'm running with the fastest Xeons & memory of a year ago. The code is designed to scale well at any number of threads and does up to testing on a PHI. The limit of gain, however, seems to universally top out at about 3/4 the max number of threads, independent of the number of cores. Typically I run between 8 & 24 threads, depending on the machine.

My discussion of 'regions' is already our attempt to limit, for each thread, the extent of its span in the array -- we don't fuss with the array, this just greatly limits the number of calculations and, hopefully, limits the cache failures. The array is public.

Andrew_4619:

The element is selected from the requirement of the current physics calculation. But, besides the 'regional' issue discussed above, it is not predictable w/o running the physics, which is what the code is doing. The physics calculation reaches a point where it needs some specific data from the array (three or four real numbers) to complete the calculation. The thread cannot proceed to the next calculation until the current one is complete.

This discussion thread is not meant to be about how to change the physics code, it is about tactics in accessing data from large arrays. I guess what I'm learning is that data access has little to do with calculating the memory location & everything to do with the excruciatingly slow speed of memory. I.e., creating five 2-D arrays did nothing to speed things up.

Can anybody address the affinity issue -- what happens to the cache when the OS changes cpu for the computational thread? Microsoft seems mute on the issue.

thanks

JohnNichols · ‎05-30-2016

Dear Bruce:

When you run into problems in Fortran - often it helps to look at other languages for ideas. Fortran is super fast - granted, but LISP is so much better at conceptual data analysis if you are prepared to spend the time. Winston and Horn's book is worth reading at night - still one of the best. Not for the LISP but for the ideas.

You have an array of let us say 1 million, by 1000 by 5 -- so you have 5 billion entries. Break it into 5 worlds - each world is a separate data set. Use a separate processor to look up each world. Instead of mapping the data onto linear vector map it to a sphere, the first offset gives you a ring on the sphere and then search on the ring, likely your next search location is close on the ring - so offset from the last location do not return to to 111. I have no doubt you could then order each ring as mecej4 suggests.

I would then run a small problem and look for the statistical data -- there will be patterns in your search because there are patterns in everything - whether they are real or not - just good for searching. Think of Gauss and the addition problem.

Or think of the human mind, you want to know the lat and long of Chicago - no idea - but if I say Chicago your mind know where it is

if you break the problem into a 7 dimension problem and use 25 counters in each dimension to represent your i,j,k then you problem is solvable

if perhaps if you make the 25 counters alpha - and then do some statistical analysis to find the shortest path to a particular 7 element word, by ordering the dimensions

Good luck

John

jimdempseyatthecove · ‎05-31-2016

>>The element is selected from the requirement of the current physics calculation.

Without knowing the full problem, it is difficult to offer constructive advice. Some hints have been offered. If I were to assume "current physics calculation" (together with specified volume of data) implies you have many physics calculations. For example N-Body. You mentioned the problem requires random access (which means this is atypical of N-body) John Campbell suggested sorting the access, to which you though impractical. You may have failed to understand that he was suggesting to sort by address within the array, and then selectively choose which order to choose to perform your physics calculations. It may help you to understand why this may be effective.

The memory fetch architecture operates via the virtual memory page table. This page table is hierarchical. This means that not only is cache hierarchical, but Virtual Memory mapping is hierarchical. On a cache miss, the Virtual address is translated to a physical address using multi-level page tables. The entire page table cannot be cached. When a page table level page entry is not in cache then a memory access will be required. Depending on page size selected, (64-bit 48-bit virtual address space)

4 KiB pages 5 levels
2 MiB pages 4 levels
1 GiB pages 3 levels

By sorting the calculations by addresses of arguments, you can reduce the number of page table misses.
You can also reduce the number of page table entries (and thus cache requirements for page table) by (if permissible) selecting larger page sizes.

An additional tactic to test, is to configure pairs of threads per core, with one thread performing fetch ahead, a sorted quazi-random read and copy to ring buffer), while the other thread runs the selected calculations in order. You might be able to get by performing the read only, performing an innocuous operation that the compiler optimization won't optimize out, and examining a progress indicator of the compute thread such that you do not get too far ahead.

Jim Dempsey

Bruce_Weaver · ‎05-31-2016

Hi,

The program follows photons though a stellar atmosphere. Periodically along the path, the physics of the local area are checked to see if one of the possible interactions with matter occurs. The data array has the required pre-calculated parameters for possible interactions at that altitude. Depending on wavelength & random numbers, the next altitude could be any altitude. However, the number of possible interactions is limited to wavelengths close to the wavelength of the photon -- that is what I meant by a 'local region'. That will change sometimes as the photon is scattered to very different wavelengths (energies). So the cache should be good for awhile (unless it is lost when Windows switches the thread to a different CPU. Hence the questions about the affinity issue).

Some time back I asked about having some control over cache behavior, & was told there was no way so I'm surprised at

"You can also reduce the number of page table entries (and thus cache requirements for page table) by (if permissible) selecting larger page sizes." This might be helpful for one of the big bandwidth hogs is a lookup for polynomial coefficients. It is always the same size & if I can stick it in (L3, all threads access it) cache, it would be a big help. How do I control the page sizes?

I don't know how to coordinate your last suggestion between cores but the trade-off would be losing the 1/2 core of compute power I would give up (since I get about 1.5 cores worth out of each dual threaded core). Since the OpenMP threads are not synchronized, I suppose that suggests a box with fewer cores/chip but the largest size shared L3 caches. Of course the near memory in Knights Landing, should I ever see on, might help that.

JohnNichols · ‎05-31-2016

Try CUDA

John_Campbell · ‎05-31-2016

Bruce,

I think we are at a similar position in trying to identify how to get better efficiency out of omp calculations, where the bottleneck is the transfer rate between memory and L3 cache.

I have found that if I can do more calculations on the memory info that is in the cache, then I can get better performance. For my large skyline solver, by dividing the problem into half cache size blocks (one shared and one private) I have managed to increase the computation per rate of change of the cache and get a better rate of solution. I have tested up to a 23gb array so lots of blocks. However, I do not know how the L3 cache is shared between multiple threads, which can be an issue as the thread count is increased.

My impression is the linking of threads to cpu's does not appear to be a significant issue, as the bottleneck appears to be transfers between memory and (presumably) L3 cache and not between L1 to L2 to L3 caches. If you were able to control the cache behaviour (as you suggested), I suspect you would do a worse job than what the processor does at the moment. It appears to me that the trend with more recent i7 processors to a reduced size cache is going in the opposite direction to what is wanted for large array usage. It would be good to have a 1gb L3 cache, which is what I understand is a Phi processor.

I need to read Jim's comments a few more times to understand what he is suggesting, although it was mecej4's suggestion to sort the computation. My blocking strategies for my solver did involve significant re-ordering of the computation to reduce the rate of change of cache information. Your need for random computation could be significant problem for these approaches, but may be well worth investigating.

Could another approach be to change your shared array data and so provide some partially computed results ? This may result in larger arrays.

Another problem I have is load balancing between threads, which can be seen by fluctuating %CPU usage. I use SCHEDULE(DYNAMIC) but there is a compromise between required small cache blocks of computation (which have variable run times) vs larger packets of computation for each thread that could mitigate this problem.

An interesting related performance problem is that when the memory transfer bottleneck is the problem, AVX vector performance also suffers. AVX needs the vectors to be in the cache. Similarly, I would not expect the problem to respond to a CUDA approach, as the problem is memory transfer rates and not processor rates. I would expect it would make it worse, as it would increase the memory transfer rate demand for cache shared between more threads.

I hope my review is of some help, as this is a problem I am trying to understand and solve myself and if not solve, at least improve.

John

jimdempseyatthecove · ‎06-01-2016

>>The data array has the required pre-calculated parameters for possible interactions at that altitude.

What is the time cost to re-compute verses fetch from RAM?
Can the data array be reduced in size by holding partially pre-computed values (and you re-compute the remainder)?

Bear in mind that multiply and divide are a handful of clock ticks in latency (approaching 1 in throughput in some cases), whereas memory access has 100's of clock cycle latencies.

Jim Dempsey

Bruce_Weaver · ‎06-01-2016

Jim: How do I control the cache page sizes? I expect it could be problematic but it seems worth a try to play with. I've thought about the compute vs mem access tradeoff a bit. Most of the precomputation requires access to data as well,although a much smaller data set -- still bigger than cache but I think I would get many fewer cache fails. I've been a bit reluctant to pursue this as the target machine is Knights Landing where I'm hoping that the near memory, acting as cache (will it be cache fast?), will take care of the problem. Also, it is starting to be specific-processor oriented; for a code that is meant to be transportable. Still, I think I'll take a closer look. I think the issue is whether the small arrays I'll need for the computation will fit & stay in cache -- they'll be several meg. Thread balancing is not an issue; each thread run the whole program (after alotta serial preliminaries) and just sums the results of each photon in a shared output array. John: The cache to cache discussion may explain why my pair of E5s (reloading the L3) run significantly slower than my I7 (shared L3). I assume my observed fluctuating CPU usage is due to the OS swapping the tasks about since I only run about 3/4 the available threads & reserve the rest, since I don't gain after 3/4, for keeping the OS responsive & plotting tasks & such. Both: I read some time ago that if the cache gets too big, too much time is spent looking for the data in cache. Has that been solved or is that still an issue? thanks

Steven_L_Intel1 · ‎06-01-2016

What I am not seeing in this thread is any indication that you ran the program under Intel VTune Amplifier XE to see what it is actually doing. It looks to me as if you're just guessing as to what the problem might be. The whole point of a profiler such as VTune is to get hard data on what is going on and where. VTune can give you loads of memory use and parallelization info.

Bruce_Weaver · ‎06-01-2016

Steve,

"I've got a very large 3-D array I access randomly so that I am memory-bound 32% of the time. "

No guesswork here; I use VTune frequently. I used the beta 2017 VTune to discover detailed mem access delays & used VTune in general to locate exactly the lines of code (those accessing data from the array) that were causing the hangups....basically when I access one of two arrays. One of these arrays is GBytes but the other is just 1.1Meg & I would love to find a way to stick it permanently in cache because accessing it uses about half of my memory access time.

IS THERE A WAY to stick in cache? I've always been told there is no way to do this but it would solve half of my problem.

Cheers,

John_Campbell · ‎06-02-2016

Jim,

>> Bear in mind that multiply and divide are a handful of clock ticks in latency (approaching 1 in throughput in some cases), whereas memory access has 100's of clock cycle latencies.

Could you explain this a bit more:

Is the "100's of clock cycle latencies", which I interpret as delay, for transferring a "page" of memory between physical memory and cache ? So if the information required to generate this memory page is more concise and remains in cache and also the "page"'s of information are being randomly retrieved, then we can come out ahead, in terms of run time. (I have not considered this approach before)

Another feature I don't understand is how is the L1, L2 and L3 cache shared between multiple threads. If I have "8mb of smartcache", running 8 threads on an i7-4790k, each doing unrelated calculations, do each get about 1mb of cache ? (assuming they are all using different information) If however, the 8 threads are doing related calculations, so the part of the SHARED(array) is being cached and used by all threads, does this improve the situation? (I am trying to size the active parts of the shared and private arrays for each thread so not to flood the cache or at least minimise cache-memory transfers). The general operation of multiple threads is each thread uses it's own private memory and might use the same shared memory. With a strategy of squeezing this information into cache, I try to partition the calculation to do the most with these cache size chunks, but how many chunks can I put into the cache for 8 threads?

I also find that with more recent processors (4th gen +) that provide hyper-threading, it can be a good compromise to set the number of threads = number of cores. You loose some performance but the pc is more responsive for other use. (This apparent performance where 4 threads get 80% performance could be due to hyper-thread conflicts or simply increased cache transfers)

I have done an interesting test of what could be achieved by omp coding, by running multiple single thread processes on the same pc to replicate the memory transfer bottleneck when all threads are using different pages of memory. I think it does show the impact of memory bandwidth, as an indicator of what could be achieved with little memory sharing. Is this a reasonable test or am I totally wrong ?

John

jimdempseyatthecove · ‎06-02-2016

L3 is shared amongst all cores (and threads therein) within the CPU.

Depending on CPU model, L2 is either exclusive to core, or shared by two cores. Look at the data sheet.

L1 is almost always shared by the threads within a core (2 on IA32/Intel64, 4 on KNC/KNL)

In decades ago division took a long time, square root and sin/cos took much longer. Therefore algorithms were developed to save the calculations for use later (assuming enough memory). The memory fetch time (no cache on some of these systems) was much shorter than the time to re-calculate the data. If your code has a history going back several decades, it may have been designed under the design rules of avoiding division, square root, sin, cos ... if at all possible. For today's CPU this programming advice is not necessarily true.

Your observation is that the bottleneck is the memory I/O to the point that you have diminishing returns as you throw more threads at the problem. The solutions you have are:

a) replace the system with one that has higher memory bandwidth
b) replace the algorithm with one that has lower memory bandwidth requirements
c) both

Jim Dempsey