I was about to point out but

Laszlo_Ladanyi · ‎11-17-2015

I have a very simple loop in C:

for (i=0; i < len; ++i) {
    beta[index] += d * value;
}

In this loop beta and value are double arrays while index is an integer array. beta itself can be a very long array (potentially millions of elements), but len is typically much shorter, say, 5% of the length of beta. Of course, all the arrays are independent of each other. We can also assume that no two entries in index are the same. What bugs me is that no matter what I do nothing seems to help. So far I have tried using the restrict keyword, specifying #pragma ivdep, manual unrolling, prefetching (though I may have applied these last two with not the right unroll factor / prefetch lookahead), and even tried using mkl to first gather the values to be updated, do the update with daxpy, then scatter the results. Any suggestions what could be done with this? Thanks, --Laci

TimP · ‎11-18-2015

There isn't hardware support for parallel scatter with potential duplicate index values until avx512. Ivdep or simd might hint to the compiler you don't care about duplicates.When "vectorization " occurs on current host (e.g. for avx2), it means the right hand side is vectorized, then scattered scalar fashion, which is OK

There are posted white papers about indirect prefetch for intel Xeon phi. As you hint, efficiency of software prefetch, supposing that this is the issue, depends on finding effective prefetch distance and sufficient data locality that occasional prefetch, ideally once per cache line, is sufficient. You would issue software prefetch only for the indirect operands, as hardware prefetch would apply to the linear ones.

McCalpinJohn · ‎11-20-2015

This is similar to the kernel that shows up in sparse matrices stored in compressed row format, so lots of work has been done on it over the years...

The performance tuning strategy depends on the size of the arrays and the sub-range of values accessed relative to the cache sizes on the system.

Before considering a tuning strategy, it is important to have a reasonable model of what performance to expect so that you can compare it with the observed performance. The expected performance depends on lots of factors:

How big are the arrays relative to the various caches in the system?
Do you expect the values of "index" to show any strong locality features?
- Are consecutive values of "index" rarely/sometimes/often very close?
- What is the range of minimum to maximum values of the "index" vector relative to the size of the "beta" array?
How are the "beta", "value", and "index" arrays used in code before and after this loop?
Are you willing/able to use multiple threads to execute this loop? (e.g., using OpenMP)
What hardware is this being run on?

For array sizes bigger than the L1 cache, codes such as this are "memory-bound". The performance goal is to ensure that the execution is (at least close to) bandwidth-limited, rather than latency-limited. A code will become bandwidth-limited if the available concurrency matches or exceeds the latency-bandwidth product of the system that you are running on.

What does this mean? A concrete example may help. On my Xeon E5-2680 ("Sandy Bridge EP") systems, the local memory latency in each socket is about 68 ns when the chip is running at its maximum all-core Turbo frequency of 3.1 GHz. The peak DRAM bandwidth for each socket is 51.2 GB/s. Multiplying these together gives a latency-bandwidth product of 3482 Bytes, which rounds up to 55 cache lines. This means that you need 55 cache lines "in flight" (i.e., in the process of being moved from the DRAM to the CPU, or vice-versa) at all times if you want to fully tolerate the memory latency and keep the DRAMs completely busy transferring data. (It is not possible to keep the DRAMs 100% busy, but it is possible to reach ~90% utilization, so this approach does give a good estimate of the concurrency required.) On this system, a single core can only directly support 10 Data Cache misses at any time, which is clearly a lot less than the 55 cache lines that need to be "in flight", so you will need to use multiple cores to reach maximum speed. The system has 8 cores, so it needs to sustain about 7 cache lines "in flight" from each core in order to satisfy the concurrency requirement. This is less than the 10 L1 Data Cache misses that are supported, so the hardware *can* provide enough concurrency -- we just need to make sure that the code is implemented in such a way that the cores actually *do* provide enough concurrency when executing this kernel.

The code has two contiguous access streams and one indirectly accessed stream that is probably mostly not contiguous. The hardware prefetchers will notice the two contiguous access streams and prefetch additional cache lines along those streams to increase the number of cache lines "in flight". The non-contiguous stream will require either OOO mechanisms or software prefetch to create extra concurrency.

So getting back to my original point, the first thing I would do is measure the performance with a single thread, then add an OpenMP directive and measure the performance with each of the available thread counts (including HyperThreading, if it is enabled). If HyperThreading is enabled, I would run the sequence two different ways:

First with KMP_AFFINITY="verbose,granularity=fine,compact"
- This will "pack" threads together, so OMP_NUM_THREADS=2 will put both OpenMP threads on one physical core, OMP_NUM_THREADS=4 will pack the 4 threads onto 2 physical cores, etc.
Then again with KMP_AFFINITY="verbose,granularity=fine,compact,1"
- This will put threads on separate physical cores (on the same chip) until all the cores have one thread assigned, then it will assign threads to the other "logical processor" on each core (still on the same chip) until all the cores have two threads assigned.

I would measure the execution time of the loop using calls to "gettimeofday()", compute the elapsed time for the loop, compute the amount of data moved for the loop, compute the corresponding average bandwidth, and print those values. In most cases of memory-bound codes, the bandwidth will approach an asymptotic value as the number of OpenMP threads is increased. The bandwidth that the hardware should be able to support depends on the array sizes, the location of the data in the memory hierarchy when the kernel begins execution, and (of course) the details of the hardware that you are running on.

You need to apply knowledge of your specific configuration to come up with the right formula for the "expected" performance. For example, if "len" is small enough that the "index" and "value" arrays can be expected to fit some level of the cache, and if they were recently used so that they should actually start out in the cache, then the performance will probably be limited by the data motion for the values of the "beta" array, which are likely to be found either in a level of cache that is "further out" or in memory.

Memory bandwidth is relatively easy to look up from the product description. Maximum expected cache bandwidth is not so easy to find. For Sandy Bridge/Ivy Bridge cores I typically see sustained L3 bandwidth of about 7 Bytes per cycle per core when all of the cores are loading/storing data from/to L3 at the same time. L2 bandwidth is about twice this value. For Haswell/Broadwell cores, the values should be close to twice these values -- something like 12 Bytes/cycle to/from the L3 for each core, and something like 25 Bytes/cycle to/from the L2 for each core.

KitturGanesh · ‎11-20-2015

Nice writeup, John - thx.

_Kittur

jimdempseyatthecove · ‎11-20-2015

Also, due to * being longer than +, you might try to optimize the * with vectorization (without involving a scatter).

for (i=0; i < len; ++i) {
    temp = d * value;
}

for (i=0; i < len; ++i) {
    beta[index] += temp;
}

Jim Dempsey

jimdempseyatthecove · ‎11-20-2015

If len is sufficiently large enough to run temp out of L1, then break it down into L1 cache sized chunks.

Jim Dempsey

KitturGanesh · ‎11-23-2015

I was about to point out but you did, thanks Jim!

_Kittur

Laszlo_Ladanyi · ‎11-24-2015

John, thanks for the nice writeup. I knew parts of that info, but have not seen it written down nicely and concisely yet. It is really helpful. As for your questions: This loop is indeed in sparse matrix computation :-). So it is embedded in another loop like this:

for (j=0; j < cnt; ++j) {
    d = update_multiplier;
    row = update_row;
    index = allindex+begin[row];
    value = allvalue+begin[row];
    len = alllength[row];
    for (i=0; i < len; ++i) {
        beta[index] += d * value;
    }
}

In words: we have to update a dense vector (which is of size the number of colums) with "update_multiplier * A". cnt is much smaller than the number of rows in the matrix (frequently below 10), therefore the sparse update is more efficient than anything I could gain from hardware assisted dense computation. Sometimes there is some locality within index, i.e., some elements in index are numbers close to each other, but that's more of an exception than rule. beta is definitely big, does not fit into L1, most likely not even the range that the smallest/largest index value specifies. value and index are not going to be used again (we are looking at a different row in the next iteration. But beta is accessed again, though whether the same elements as in the previous row or not, that's not known. If I had to bet I'd bet that it's usually different elements. I'm testing on an i7-4770, but the target is really all sort of cpus... Still, if I can get a code that runs faster on intel, it's worth for me to create a platform dependent ifdef. Jim, that's a very good and logical suggestion, thanks! Unfortunately it does not work. It goes against all logic, but it has actually slowed down the code from ~640s to ~760s (according to vtune). I have used -vec-report3 and it did show that the first loop was vectorized. And I have looked at the assembly code and I can see there, too, that the code is vectorized. I'm leaving in a matter of hours for the rest of the week, but when I'm back I can post detailed measurements done by vtune. Any suggestions what should I post that would help figuring out why your suggestion does not work? (I have gathered the data with 'amplxe-cl -collect hotspots ...'). Thanks, --Laci

jimdempseyatthecove · ‎11-25-2015

The i7-4770 supports AVX2. AVS 2 supports scatter/gather

While my #5, second loop will rely on this for performance (assuming you compile for at least AVX2), the second loop is performing:

gather
add
scatter

You might want to experiment with separating the operations

for (int j = 0; j < len; j += SizeFitsInL1Cache) {
  int iEnd = std::min(j + SizeFitsInL1Cache, len);
  for (int i = j; i < iEnd; ++i) {
    temp = beta[index] + d * value;
  }
  for (int i = j; i < iEnd; ++i) {
    beta[index] = temp;
  }
}

*** also, old C programmers need to adapt. use "for(int i=" whenever possible. This places the scope of the loop control variable inside the loop. Meaning it need not be updated for use outside the scope of the loop. This can impact the efficiency of the optimization.

Jim Dempsey

TimP · ‎11-25-2015

I haven't seen an explanation of how AVX2 compilation makes a decision between simulated gather and use of the gather instructions. The Haswell implementation of gather was acknowledged not to be a significant performance improvement. I've seen Intel compilers generate either simulated gather-scatter or hardware gather/simulated scatter. Simulated means by use of AVX insertp and extractp instructions. keeping memory references explicitly in order.

In my benchmarks on Haswell laptop, gather-scatter depends more than usual on optimizing degree of unroll (-Qunroll4 is a good start).

If the loops are likely to hit some recently accessed cache lines, it may strengthen the case for trying to issue software prefetch only once per 64 or 128 bytes, with a generous prefetch distance.

If you are trying to use VTune to analyze the effectiveness of splitting the gather and scatter references into separate loops, you will need to analyze L1D and fill buffer activity. I would have thought the splitting to be a long shot based on avoiding reading from cache lines which are in fill buffers and allowing for more delayed write-back. Not having seen this discussed by experts, it may not be a common consideration.

JenniferJ · ‎11-30-2015

check this article: https://software.intel.com/en-us/articles/understanding-gather-scatter-instructions-and-the-gather-scatter-unroll-compiler-switch and see if you can use the option.

on Windows, the option is /Qopt-gather-scatter-unroll:n

Jennifer

TimP · ‎11-30-2015

That article is specific to Mic knc.

how to optimize a loop with nested indexing