Solved: Behavior of streaming (nontemporal) reads and writes in NUMA systems

Byington__Benjamin · ‎09-20-2019

The short version:

At least under some circumstances I'm seeing the _mm512_stream_load_si512 and _mm512_stream_si512 intrinsics have significantly faster throughput when dealing with memory addresses on a *different* NUMA node compared to when dealing with memory address on the *same* NUMA node.

I would like to understand:

Why that would be (if anything I'd have expected the opposite)
If there is anything I can do to achieve the same throughput on intra-node traffic as I achieve for inter-node traffic.

More Detailed Context:

I'm working on high throughput and high multiplex signals analysis software. To facilitate development/test/etc we have an execution mode that takes a few unique input signals, and replicates them out in order to produce the millions of signals the system is designed to analyze in realtime. Unfortunately, using a straightforward mempcy for the data copy/replication proved to be a bottleneck in and of itself, though I don't entirely understand why. I'm only observing a throughput of ~5.8GB/s, where something like the STREAM benchmark indicates I should have a single-threaded throughput of nearly 12GB/s.

Either way, that is besides the point of this post. The point is that I re-implemented these memory copies in terms of manual intrinsics. If I used _mm512_load_si512 and _mm512_store_si512 I did see some modest speed gains for our use cases. If I used __m512_stream_load_si512 and __mm512_stream_si512 then I got very puzzling and seemingly inconsistent behavior. Most of the time it would run slightly slower than the normal load/store version, but some of the time it would run significantly faster. To give concrete numbers, when filling a 32GB memory buffer, memcpy would take about 5.6 seconds, normal load/stores would take about 4.9 seconds, and streaming load/stores would take either about 5.2 seconds *or* 3.5 seconds. At a minimum we need it to complete in 5.1 seconds so non-streaming instructions to tip the scales to get us where we need to be, but really we'd like to reliably have the 3.5 second time.

After scratching my head a bit, I managed to make the performance of the streaming load/stores more deterministic by using numactl to control what cores and memory arenas were in use. To my surprise, the faster time corresponds to locking the execution cores to one numa node, and all memory allocations to the other numa node.

Hardware:

I'm not sure what specs are relevant, and replicated this behavior on multiple skylake machines, but a particular machine in question has been configured as follows:

Two Intel(R) Xeon(R) Silver 4114 processors.
256GB of DDR4 ram (at 2.4GHz)
Supermicro motherboard X11DPU
OS is Centos 7.3.1611 with linux kernel 3.10

Toy Experiment:

I've written a toy program to demonstrate the issue I'm observing. It doesn't quite show the same throughput numbers I measure within my real code, but it does show similar trends. In particular, the code runs a number of experiments. It has two data movement patterns, both of which are related to what the real software actually does:

Filling a 32GB output buffer by replicating a 262KB input buffer. This is done via separate 65KB mini transfers.
Filling a 32GB output buffer by replicating a 268MB input buffer. This is done via full 268MB transfers.

It also does 3 data movement techniques:

standard memcpy
Manual (slightly unrolled) loop using m512 load/store intrinsics
Manual (slightly unrolled) loop using m512 streaming load/store intrinsics.

Both input and output data buffers are cache aligned (both in this toy code as well as in the real application). If I run the program twice, once with "local" memory and once with "far" memory, the results are:

bbyington@rt-gpudev ~]$ icpc -O3 -xCORE-AVX512 memcpy.cpp -o memcpyTest
bbyington@rt-gpudev ~]$ numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 20 21 22 23 24 25 26 27 28 29
node 0 size: 130719 MB
node 0 free: 125271 MB
node 1 cpus: 10 11 12 13 14 15 16 17 18 19 30 31 32 33 34 35 36 37 38 39
node 1 size: 131072 MB
node 1 free: 127188 MB
node distances:
node   0   1
  0:  10  21
  1:  21  10

bbyington@rt-gpudev ~]$ numactl --physcpubind 0 --membind 0 ./memcpyTest

Filling 32GB array by tiling out 262KB source array, and using 65KB transfers
        SmallSource Memcpy Duration: 5.21826
        SmallSource LoadStoreIntrin Duration: 3.8994
        SmallSource StreamIntrin Duration: 5.28715

Filling 32GB array by tiling out 268MB source array, and using 268MB transfers
        MediumSource Memcpy Duration: 6.66535
        MediumSource LoadStoreIntrin Duration: 5.90025
        MediumSource StreamIntrin Duration: 7.16456

bbyington@rt-gpudev ~]$ numactl --physcpubind 0 --membind 1 ./memcpyTest

Filling 32GB array by tiling out 262KB source array, and using 65KB transfers
        SmallSource Memcpy Duration: 9.70955
        SmallSource LoadStoreIntrin Duration: 5.49025
        SmallSource StreamIntrin Duration: 1.95921

Filling 32GB array by tiling out 268MB source array, and using 268MB transfers                                                                           
        MediumSource Memcpy Duration: 5.8158                                
        MediumSource LoadStoreIntrin Duration: 8.70741                      
        MediumSource StreamIntrin Duration: 5.74727

Outstanding Questions:

Why does the throughput of streaming instructions decrease with a larger input size? Does that mean they are still interacting with the CPU cache? If they are not bypassing the cache, how are streaming operations (sometimes) outperforming the nonstreaming operations?
I'm not sure why memcpy improved it's performance so much with the larger copy on "far" data. Since it did not have a similar boost when using "near" data, does that mean it switches to streaming instructions if the copy is large enough (or it otherwise determines it should be advantageous)?
Why are streaming loads/stores faster when interacting with memory from another NUMA node? The normal load/stores slow down in this case, which is more in line with what I'd have guessed.
Is there a way for accesses on "near" memory to perform as well (using streaming intrinsics or otherwise) as seen with streaming accesses on far memory?

Source Code for Toy Experiment:

#include <stdexcept>
#include <cassert>
#include <cstring>
#include <chrono>
#include <iostream>
#include <x86intrin.h>

using namespace std::chrono;

static constexpr auto cachLen = 64;

// Sizes for source and destination arrays
static constexpr size_t destinationDataLen = 503808ul*64*512;
static constexpr size_t smallSourceDataLen = 256*512;
static constexpr size_t largeSourceDataLen = 4096*64*512;

// Sizes for individual coppies
static constexpr auto smallBatchSize = 64*512;
static constexpr auto mediumBatchSize = 4096*64*512;

// Helper struct for RAII control of allocation that presents itself
// as aligned.
template <typename T>
struct AlignedAlloc
{
    AlignedAlloc(size_t count)
        : data_{new T[count + cachLen / sizeof(T)]}
        , count_(count)
    {
        aligned_ = data_;
        while (reinterpret_cast<size_t>(aligned_) % cachLen != 0) aligned_++;
    }
    ~AlignedAlloc() { delete [] data_; }

    T* data(size_t idx = 0) { return aligned_ + idx; }
    const T* data(size_t idx = 0) const { return aligned_ + idx; }

    T& operator[](size_t idx) { return aligned_[idx]; }
    const T& operator[](size_t idx) const { return aligned_[idx]; }

    size_t Count() const { return count_; }

private:
    T* data_;
    T* aligned_;
    size_t count_;
};

// memcpy replacement that uses m512 streaming load/stores
void CopyStream(void* dest, void* src, size_t count)
{
    _mm_mfence();
    for (size_t i = 0; i < count; i += cachLen*4)
    {
         auto l1 = _mm512_stream_load_si512(src + i + cachLen*0);
         auto l2 = _mm512_stream_load_si512(src + i + cachLen*1);
         auto l3 = _mm512_stream_load_si512(src + i + cachLen*2);
         auto l4 = _mm512_stream_load_si512(src + i + cachLen*3);

         _mm512_stream_si512(reinterpret_cast<__m512i*>(dest + i + cachLen*0), l1);
         _mm512_stream_si512(reinterpret_cast<__m512i*>(dest + i + cachLen*1), l2);
         _mm512_stream_si512(reinterpret_cast<__m512i*>(dest + i + cachLen*2), l3);
         _mm512_stream_si512(reinterpret_cast<__m512i*>(dest + i + cachLen*3), l4);
    } 
    _mm_mfence();
}
// memcpy replcaement that uses m512 load/stores
void CopyLoadStore(void* dest, void* src, size_t count)
{
    for (size_t i = 0; i < count; i += cachLen*4)
    {
         auto l1 = _mm512_load_si512(src + i + cachLen*0);
         auto l2 = _mm512_load_si512(src + i + cachLen*1);
         auto l3 = _mm512_load_si512(src + i + cachLen*2);
         auto l4 = _mm512_load_si512(src + i + cachLen*3);
         
         _mm512_store_si512(dest + i + cachLen*0, l1);
         _mm512_store_si512(dest + i + cachLen*1, l2);
         _mm512_store_si512(dest + i + cachLen*2, l3);
         _mm512_store_si512(dest + i + cachLen*3, l4);
    } 
}


int main()
{
    AlignedAlloc<int16_t> smallSource(smallSourceDataLen);
    AlignedAlloc<int16_t> mediumSource(largeSourceDataLen);
    AlignedAlloc<int16_t> dest(destinationDataLen);

    // Initialize all data, both to have a verifiable payload for transfer,
    // as well as to make sure all memory gets paged mapped before timing anything
    for (size_t i = 0; i < smallSource.Count(); ++i)
    {
        smallSource = i % std::numeric_limits<int16_t>::max();
    }
    for (size_t i = 0; i < mediumSource.Count(); ++i)
    {
        mediumSource = i % std::numeric_limits<int16_t>::max();
    }
    memset(dest.data(), 0, dest.Count() * sizeof(int16_t));

    // Coppies memory from src to dest.  src is (significantly) smaller than dest,
    // and will be replicated as necessary.  batchSize controls how many elements are
    // transfered at once
    auto TestCopyFunction = [](auto&& CopyFunc, const std::string& name, size_t batchSize,
                               AlignedAlloc<int16_t>& dest, 
                               AlignedAlloc<int16_t>& src)
    {
        auto dCount = dest.Count();
        auto sCount = src.Count();
        assert(dCount % batchSize == 0);
        assert(sCount % batchSize == 0);

        auto t1 = high_resolution_clock::now();
        for (size_t i = 0; i < dCount; i+= batchSize)
        {
            CopyFunc(dest.data(i), src.data(i % sCount), batchSize*sizeof(int16_t));
        }
        auto t2 = high_resolution_clock::now();
        auto span = duration_cast<duration<double>>(t2 - t1);
        std::cerr << name << " Duration: " << span.count() << std::endl;

        // Verify correct receipt on the destination side
        for (size_t i = 0; i < dCount; i+= sCount)
        {
            for (size_t j = 0; j < sCount; ++j)
                if (src != dest[i+j]) throw std::runtime_error("Validation Failed");
        }
        std::memset(dest.data(), 0, destinationDataLen*sizeof(int16_t));
    };

    std::cerr << std::endl;
    std::cerr << "Filling 32GB array by tiling out 262KB source array, and using 65KB transfers\n";
    TestCopyFunction(std::memcpy, "\tSmallSource Memcpy", smallBatchSize, dest, smallSource);
    TestCopyFunction(CopyLoadStore, "\tSmallSource LoadStoreIntrin", smallBatchSize, dest, smallSource);
    TestCopyFunction(CopyStream, "\tSmallSource StreamIntrin", smallBatchSize, dest, smallSource);

    std::cerr << std::endl;
    std::cerr << "Filling 32GB array by tiling out 268MB source array, and using 268MB transfers\n";
    TestCopyFunction(std::memcpy, "\tMediumSource Memcpy", mediumBatchSize, dest, mediumSource);
    TestCopyFunction(CopyLoadStore, "\tMediumSource LoadStoreIntrin", mediumBatchSize, dest, mediumSource);
    TestCopyFunction(CopyStream, "\tMediumSource StreamIntrin", mediumBatchSize, dest, mediumSource);

}

McCalpinJohn · ‎09-24-2019

I have seen this behavior on my Xeon Platinum 8160 and Xeon Platinum 8280 boxes. The boost in performance only applies for low thread counts. My results show local performance to be higher when using more than 3-5 threads -- the details depend on the ratio of reads/writebacks/streaming stores. For small thread counts, the remote results are slightly higher if "LLC Prefetch" is enabled in the BIOS (not the default in our systems), but the difference is small. In some cases single-thread performance is faster with allocating stores than with streaming stores -- see notes at http://sites.utexas.edu/jdm4372/2018/01/01/notes-on-non-temporal-aka-streaming-stores/

The STREAM Copy values of ~10 GB/s for a single thread on SKX correspond to 5 GB/s of read traffic and 5 GB/s of write traffic. This convention may not match the way you are counting Bytes.

I don't think I ever got around to doing a detailed analysis of this anomalously high remote bandwidth -- it is generally a small effect (except for the store-only case, which is not a significant overhead in any applications that I know of).

With multi-threaded code there should be no problem getting better performance with local memory than with remote memory. You will need to use most (perhaps all) of the cores in a socket to reach asymptotic local memory bandwidth levels.

View solution in original post

Byington__Benjamin · ‎09-20-2019

I guess I forgot to mentioned that I'm compiling with icpc 17.0.4 (gcc version 6.3.1 compatibility).

McCalpinJohn · ‎09-24-2019

I have seen this behavior on my Xeon Platinum 8160 and Xeon Platinum 8280 boxes. The boost in performance only applies for low thread counts. My results show local performance to be higher when using more than 3-5 threads -- the details depend on the ratio of reads/writebacks/streaming stores. For small thread counts, the remote results are slightly higher if "LLC Prefetch" is enabled in the BIOS (not the default in our systems), but the difference is small. In some cases single-thread performance is faster with allocating stores than with streaming stores -- see notes at http://sites.utexas.edu/jdm4372/2018/01/01/notes-on-non-temporal-aka-streaming-stores/

The STREAM Copy values of ~10 GB/s for a single thread on SKX correspond to 5 GB/s of read traffic and 5 GB/s of write traffic. This convention may not match the way you are counting Bytes.

I don't think I ever got around to doing a detailed analysis of this anomalously high remote bandwidth -- it is generally a small effect (except for the store-only case, which is not a significant overhead in any applications that I know of).

With multi-threaded code there should be no problem getting better performance with local memory than with remote memory. You will need to use most (perhaps all) of the cores in a socket to reach asymptotic local memory bandwidth levels.

Byington__Benjamin · ‎09-26-2019

Thank you so much for responding.

I did in fact misunderstand how STREAM was counting bytes. That eliminates the rough factor of two I was seeing between the STREAM output report and my computations for my own application.

You mention that you observe the faster remote bandwidth for streaming operations is only significantly different for store-only workloads. In my own tests I saw the significant difference (over 2x) when populating the 32GB destination from replicating the 286KB input, but it involved the same number of load and store instructions regardless of the input data size. Does the fact that the input is small enough to fit in cache mean I'm still really effectively in a "store only" scenario? Or is this potentially just a quirk in my particular hardware? I guess I'm still a bit confused how/if streaming loads and stores interact with the cache.

You are right that if I throw more threads at the copy my bandwidth goes up, though I was hoping to use a single core for the copy, to keep the rest available for computation. When I saw the significantly faster performance in the one particular case (streaming operations replication a small input to a large output on a distant numa node), I was hoping I could somehow obtain that in the general case. Sounds like I'm unlikely to achieve that, but throwing another core or three at the memory movement is probably still good enough to have sufficient memory throughput while reserving the bulk of the machine for computation.

Thanks again!

McCalpinJohn · ‎09-27-2019

The details of how streaming stores are processed are minimally documented. The Uncore Performance Monitoring Reference Manual lists the available mesh transactions, and it would not be hard to figure out which transactions are used by the streaming stores in your application. Unfortunately, this does not tells us about several important performance-related properties....

We know that Non-temporal stores will invalidate any cache entries mapping the target address (for all caches in the system). So streaming store processing must involve coherence processing on both the local chip and on the remote chip(s). These can (in principle) be performed sequentially or in parallel, and it is possible that the degree of overlap is different for local and remote accesses. Throughput is also dependent on the number of queue entries available to process particular transaction types. Streaming stores to local targets use different UPI transactions than streaming stores to remote targets, and it is certainly possible that the remote case allows more buffers to be used. Some of these items could be understood by generating various carefully controlled tests and studying them using all of the potentially relevant uncore performance counters in the potentially relevant uncore units (mesh, CHA, IMC, UPI link layer, M2M, and M3UPI). This is the sort of exercise that I normally enjoy, but in this case we already know that the local performance is limited by single thread bandwidth, so the details are not terrifically important....

Your best performance is with the "small source" buffer and the load and store intrinsics operating on local memory. This is consistent with the behavior of several generations of recent Intel processors. In this case, allocating (i.e., "normal") stores will activate the L2 HW prefetch engine to generate extra concurrency and to reduce the average latency of the L1 Dcache store misses. There is no corresponding mechanism for reducing latency when executing streaming stores. Allocating stores do generate more DRAM traffic than the streaming stores, but at these low levels of DRAM utilization this does not matter. My analysis suggests that none of your cases generate more than ~20 GB/s of DRAM bandwidth, which is less than 18% of peak for one socket. Performance is limited by available concurrency and transaction occupancy, not by channel bandwidth.

Byington__Benjamin · ‎09-30-2019

Thank you very much for all the useful information John, it's very appreciated