Memory bandwidth and scalar product

velvia · ‎08-06-2015

Hi,

Is is well-known that the performance of the scalar product is limited by the memory bandwidth of the machine.

auto n = std::size_t{100000000};
auto v1 = new double;
auto v2 = new double;

auto sum = double{0.0};
for (auto k = std::size_t{0}; k < n; ++k) {
  sum += v1 * v2;
}

delete[] v1;
delete[] v2;

What is surprising is the fact that this loop still benefit from parallelization. If I throw a (#pragma omp parallel for) just before the loop, I get a 3x speedup on my Core i7-4850HQ. It is quite surprising for an algorithm known for being memory limited.

Here is my guess at the explanation:

- The memory prefetcher is here to hide the latency of memory access. But it can't prefetch more than one page at a time (about 4kB). The fact that we now have 4 prefetchers running at the same time makes it faster because with one thread the bandwidth of the CPU <-> RAM connection was not saturated.

- A similar story around the Translation Lookaside Buffer (TLB).

Is any of this explanation correct ?

PS: The original message which was using std::vector has been edited because it did not show the claimed behaviour due to the fact that std::vector initializes memory to 0 which warms up the cache.

Martyn_C_Intel · ‎08-06-2015

Hi,

Your observation is surprising to me, too. I agree with your expectation that the large dot product will be bandwidth limited. Once you are bandwidth limited for a very large array, more prefetching doesn't help.

On some architectures, the available memory bandwidth increases with the number of cores used, (Intel(R) MIC Architecture, for example). On a dual socket Intel(R) Xeon system, there is typically a benefit form having at least one thread on each socket. But on a single socket Core i7, I would expect minimal benefit from running more than 1 thread. I ran a quick test here on a Core i5 system that supported Intel AVX2, and that's what I found.

Some things to check:

Your OpenMP parallel directive should include a clause reduction(+:sum)

The loop should be vectorized as well as paralllelized - check the optimization report. I compiled and ran on Linux using:

icc -std=c++11 -xcore-avx2 -qopenmp dotp.cpp -qopt-report-file=stderr -qopt-report-routine=main -qopt-report-phase=vec,openmp; time ./a.out

Try measuring the memory bandwidth using the stream benchmark, for one or more threads. (Set OMP_PROC_BIND=spread or KMP_AFFINITY=scatter). That would show whether you were getting additional bandwidth from additional threads.

Is the code fragment above embedded in an outer loop or other more complex code? It's possible the compiler might then make other optimizations or transformations that might change the behavior.

Incidentally, on what OS are you running?

Martyn_C_Intel · ‎08-07-2015

That is very close to what I was running.

I have run your code on both Core i5 and Core i7, with different versions of the Intel compiler, and I cannot reproduce the behavior you report. I can see a slight speedup with OpenMP, perhaps 10-20%, as I would expect. I also tried building with gcc and the behavior was similar, though a little slower. It doesn't ever complete in 0.03 seconds. I even tried a different timer, but nothing changed. The nominal bandwidth for the Core i7 processor I used was the same as for yours.

I was only able to see a substantial speedup with OpenMP when I moved to a Xeon system with two sockets.

I don't know what to conclude; is there something special about your system and its memory subsystem, or its clocks? Or is it a Xeon masquerading as Core i7? At least, it's better to have it running too fast than too slow.

velvia · ‎08-08-2015

Hi Martyn,

Here is the full code so everybody can play around with.

DISCLAIMER: This code is not the right one and should be ignored. I just let it there so everybody can understand Martyn's answer. Look at the code that is in my next message.

#include <iostream>
#include <vector>
#include <chrono>

int main(int argc, const char *argv[]) {
    auto n = std::size_t{100000000};
    auto v1 = std::vector<double>(n);
    auto v2 = std::vector<double>(n);

    auto start_time = std::chrono::high_resolution_clock::now();
    auto sum = double{0.0};
#pragma omp parallel for reduction(+ : sum)
    for (auto k = std::size_t{0}; k < n; ++k) {
        sum += v1 * v2;
    }
    auto end_time = std::chrono::high_resolution_clock::now();
    auto time = std::chrono::duration_cast<std::chrono::nanoseconds>(
                    end_time - start_time).count();

    std::cout << "Time: " << time / 1.0e9 << std::endl;
    std::cout << "Sum: " << sum << std::endl;

    return 0;
}

Here are my answers to your questions and my remarks

- I have experienced this on both OSX and Linux on the same machine. So I don't think it is related to the OS.

- I have also tested this program on Amazon Web Services c4.2xlarge. It is always difficult to know how these machines work, but I get the following results when compiled with g++ -std=c++11 -Ofast -march=native main.cpp -o main. It takes 0.104 s to complete and 0.031s if I use OpenMP. I agree in this case that I have no proof that all the threads are running on the same socket. But on the other hand, whatever socket the threads are running on, the memory still has to go through the socket that is attached to the RAM holding the vectors.

- I have tried to disable the vectorizer, but it does not make a difference. My guess is that we are limited by the amount of memory per unit time that can be prefetch by a given core. Therefore, the speed of the code does not change with vectorization.

- I don't want to use any tools such as the Stream benchmark. The "Bandwidth" of a given system is not defined properly anyway. Therefore I prefer to stick to codes that I understand.

I think that we come back to my initial guess: this algorithm is not limited by the "bandwidth of the connection in between the RAM and the socket" but by the prefetcher of the core it runs on. But I would like to get a confirmation from someone from Intel.

Martyn_C_Intel · ‎08-10-2015

Ahhhh! Thanks, that's educational for everyone.

TimP · ‎08-10-2015

Multi-threading stands to gain more with gcc than with icc, even when you enable vectorization (g++ -march=native -ffast-math -O3 ....) as gcc doesn't "riffle" (Intel terminology) or "interleave" (30 year old terminology) using multiple batched sums so as to overcome latency of repeated operations updating the same register. icc riffles aggressively; I see riffling by a factor 8, thus 32 batched sums, for double on AVX2. If you put threaded parallelism on top, you are using 64 batched sums on a dual core without hyperthreading with icc, but only 2 with gcc and no simd.

g++ requires -ffast-math and -O3 or -O2 -ftree-vectorize to enable simd reduction. icpc requires -O2 and -fp-model fast (both defaults) for this purpose. So the default icpc code without omp parallel may as you say use nearly entire available memory bandwidth on a single CPU where g++ default options may require several threads to use up memory bandwidth.

When you write a single reduction loop with omp parallel for reduction(+...) it seems that you are forbidding simd operation. I don't expect compilers to respond well to omp parallel for simd reduction(); I think gcc will not implement the simd clause in any parallel for.

icc avoids use of fma for AVX2 dot products where it can't batch the sums sufficiently to overcome the higher latency of fma vs. separate multiply and add instructions. Neither icc nor gcc always makes the best choice between fma and no-fma/

When someone says they are using STL I would expect to see inner_product(). In the past, both gcc and icc have implemented simd more reliably with inner_product() than with omp for simd reduction (with latest icc either should optimize). I would put an outer parallel for outside inner_product when attempting to optimize for multi-core.