Intel® C++ Compiler
Community support and assistance for creating C++ code that runs on platforms based on Intel® processors.

Is RAM to RAM DMA possible and would it be fast?

DLake1
New Contributor I
4,221 Views

I have a 9920x at up to 4.8GHz on all cores and 32GB of quad channel 3600MHz CL16 RAM and memcpy copies at a rate of just 11GB/s which is clearly limited by the CPU cores, I could write a multithreaded memcpy routine for large copies but I would like to know if there is a more elegant solution maybe using DMA to fully utilize the memory bandwidth without writing messy multithreaded memcopy code?

0 Kudos
20 Replies
jimdempseyatthecove
Honored Contributor III
4,210 Views

Is your compiler an Intel C++ compiler?
If so, is it configured to use AVX or aVX512?
If so, then it should be generating a call to _intel_fast_memcpy (you can verify this by looking at the map file or using debugger at point of memcpy).

Additional factor is how your BIOS has configured your memory access. There are two methods:

One where the memory is Uniform (meaning interlieved)
The other Non-Uniform (aka NUMA or not interlieved)

The descriptive text in your BIOS for this option may be confusing, you may need to consult your motherboard's fourm to get a clear description of the setting, or you could simply test by setting the mode the other way, assuming you have verified _intel_fast_memcpy is (was) being used.

There are different reason for selecting which memory access is better.

For improved memcpy (_intel_fast_memcpy) you would want it set to interlieved (not NUMA).

Depending on source and target placement, NUMA mode might yield 1/4 to 1/2 the performance of memory bandwidth for single threaded _intel_fast_memcpy. This is not a statement that interleaved is best under all circumstances.

Jim Dempsey

0 Kudos
DLake1
New Contributor I
4,210 Views
The compiler is Intel C++ 18.0. Its calling _intel_avx_rep_memcpy. The i9 9920x is single socket not NUMA. There are options in the BIOS for interleaving IMC, channel and rank and I get 11,500MB/s copy rate with IMC and channel interleaving on 2. AIDA64 benchmarks show about 107,000MB/s read and 83,000MB/s write so a single core just can't copy any faster than 11,500MB/s. Why is there not a faster way of copying memory without multithreading?
0 Kudos
McCalpinJohn
Honored Contributor III
4,210 Views

I can't quite tell how you are defining your terms here....

There are three commonly used ways to define "bandwidth" in the context of memory copies.  These are discussed at the bottom of https://www.cs.virginia.edu/stream/ref.html -- it is often possible to reverse-engineer someone else's assumptions, but clarity and specificity make it much easier to understand what is going on.....

In your configuration the peak DRAM BW is 115.2 GB/s, which is consistent with the reported 107 GB/s read bandwidth.  (Counting bytes for contiguous reads is not ambiguous, so this consistency is not surprising.)   The 83 GB/s for writes could mean either (a) 83 GB/s for streaming stores (counting only the store traffic), or (b) 83 GB/s for normal (allocating) stores, counting both the allocate and writeback traffic (41.5 GB/s each).   Similarly, your 11 GB/s memory copy number could be interpreted in several different ways.....   

Single-core memory bandwidth is not limited by the DRAM interface -- it is limited by the number of outstanding cache misses that a single core can generate, combined with the latency of the memory accesses.  The standard formulation of Little's Law can be used to solve for any of the three terms:

Concurrency = Latency * Bandwidth

It is important to think about the differences between "peak" and "effective" in these three terms, and to remember that for concurrency and bandwidth "bigger is better", while for latency "smaller is better".   

If you know how much concurrency (i.e., how many outstanding cache misses) a system supports, and you can measure the idle memory latency, then you can compute an approximate upper bound on the effective bandwidth:

Effective Bandwidth <= Maximum Concurrency / Minimum Latency

E.g., if a core supports 10 outstanding cache misses (64 Bytes/cacheline) and has a latency of 64 ns, then the "effective bandwidth" can be no larger than (10 cachelines * 64 Bytes/cacheline)/(64 ns) = 10 GB/s.    The measured bandwidth will be lower than this maximum if anything in the test reduces the amount of available concurrency or increases the actual latency.   The measured bandwidth can only exceed this bound if there are other mechanisms (e.g., hardware prefetch) that can reduce the "effective latency" by moving data into the cache before the core requests it.

The issues are discussed in a bit more detail in slides 26-32 of: http://sites.utexas.edu/jdm4372/2016/11/22/sc16-invited-talk-memory-bandwidth-and-system-balance-in-hpc-systems/

Quick comments before I head off for more coffee:

  • Latency has been steady or increasing for a bunch of reasons (discussed in my SC16 presentation linked above), and it is relevant to note that the Core i9-9220X processor is actually built using a 28-core die with 12 of the 28 cores enabled and 14 of the 28 L3 slices enabled.  In order to make the L3 work effectively, cache-line addresses are hashed pseudo-randomly around the 14 [Coherence Agent/L3 slice] pairs, then hashed across the four DRAM channels.  So any L2 miss to DRAM has to traverse many, many, on-chip links, and traverse at least four clock-frequency boundaries (core->uncore, uncore->DRAM, DRAM->uncore, uncore->core).
  • The number of outstanding cache misses that a core can support has been increasing very slowly because the buffer that supports L1 Data Cache misses sits between the L1 Data Cache and the L2 cache.  Each L1 Data Cache access that misses in the L1 Data Cache tags has to perform an address comparison against all of the entries of the L1 Cache Miss Buffers, then attempt to allocate an entry if there is no address match.  If the buffer gets large enough to require multiple cycles to check, this will add directly to the L2 Cache Hit latency -- which is usually quite important for performance.   In 2005, AMD Opteron supported 8 L1 Data Cache Misses per core.  In 2009, Intel Nehalem supported 10 L1 Data Cache Misses per core.   Sandy Bridge through Broadwell stayed at 10.  There are indications that Skylake/CascadeLake support 12 L1 Data Cache Misses per core, but I can't find a definitive reference on that right now....  The number of cache misses required to fully overlap the memory latency on a Xeon Platinum 8280 is 140 GB/s * 80 ns = 11263 Bytes = 176 cache lines. This is not a small difference compared to 12, suggesting that it would require significant design changes and tradeoffs.
0 Kudos
DLake1
New Contributor I
4,210 Views

But why does a memory copy from/to RAM have to be limited by the CPU cores and cache, why cant it be handled like a DMA operation within the memory controller?

Your wisdom is greatly appreciated Dr.Bandwidth.

0 Kudos
jimdempseyatthecove
Honored Contributor III
4,210 Views

If you are not using streaming stores, as John was telling us, it is likely that move from the source can come from memory that is not cached .AND. becomes cached in a cache location that was "dirty". This require a write of the "dirty" data. Then when the data just read is written (presumably to a different location), this too can cause the being written data to write to a different cache line, and that line may also be dirty as well, and when so, requires a write. Scenario can be as bad as

read cache line from RAM to register (and cache)
write evicted cache line to RAM
write cache line from register to RAM (and cache)
write evicted cache line to RAM

IOW ~half of the memory bandwidth could be consumed by cache line evictions.

YMMV

Jim Dempsey
 

0 Kudos
DLake1
New Contributor I
4,210 Views

This is some assembly from a little program that I use to experiment with memory bandwidth measurement, it gets 11,300MB's with the affinity set to the 3rd logical core(second physical):

__asm{
	mov rax, data0
	mov rdx, data1
	mov rcx, datasize
	shr rcx, 5
loop_copy:
	vmovntdqa ymm0, 0[rax]
	vmovntdq 0[rdx], ymm0
	add rax, 32
	add rdx, 32
	dec rcx
	jnz loop_copy
}

But I'm asking if a memory copy can be done with a DMA operation or something like that to eliminate the constraints of the CPU cores.

0 Kudos
jimdempseyatthecove
Honored Contributor III
4,210 Views

Your code will experience memory stall at line 8. (instruction at line 8 cannot initiate write until ymm0 is updated)

Write the code to facilitates buffered reads and writes

__asm{
	mov rax, data0
	mov rdx, data1
	mov rcx, datasize
	shr rcx, 8
loop_copy:
	vmovntdqa ymm0, 0[rax]
	vmovntdqa ymm1, 32[rax]
	vmovntdqa ymm2, 64[rax]
	vmovntdqa ymm3, 96[rax]
	vmovntdqa ymm4, 128[rax]
	vmovntdqa ymm5, 160[rax]
	vmovntdqa ymm6, 192[rax]
	vmovntdqa ymm7, 224[rax]
	vmovntdq 0[rdx], ymm0
	vmovntdq 32[rdx], ymm1
	vmovntdq 64[rdx], ymm2
	vmovntdq 96[rdx], ymm3
	vmovntdq 128[rdx], ymm4
	vmovntdq 160[rdx], ymm5
	vmovntdq 192[rdx], ymm6
	vmovntdq 224[rdx], ymm7
	add rax, 256
	add rdx, 256
	dec rcx
	jnz loop_copy
}

On your system, 4 registers (128 bytes) per loop iteration might be sufficient.

Jim Dempsey

0 Kudos
DLake1
New Contributor I
4,210 Views

I don't care about the code, is it possible to perform a memory copy without the data going into the CPU core then out again, like a DMA operation where the CPU doesn't handle the data directly?

0 Kudos
jimdempseyatthecove
Honored Contributor III
4,210 Views

DMA on IA32/Intel64 requires the use of a peripheral device, and device driver for the O/S. Therefor, you will need to either build a device or find a suitable device. Preferably a portion of the device address space would be mapped into your process virtual address space, but you may need to make a call to the driver via an O/S call to pass in the source and destination virtual addresses and byte count. The overhead of this call may negate any benefit of reduced transfer time.

An additional bad effect of this, is the DMA would be from RAM to RAM and not necessarily via any cache level nor use cached data in the event that cache holds newer data. Additionally, the DMA may not necessarily invalidate cached data (thus resulting you later reading prior data not newer data).

What may be work investigating, if the O/S permits this, is to build or select a hardware device (and driver) that performs the transfer using Block I/O instructions INS{B,W,D] and OUTS[B,W,D]. You would have to resolve any "dispute" with the O/S as to obtaining (for your process) exclusive access to an I/O port to your device (which will buffer your data).

An improved "benchmark" test might be something like the following:

__asm{
	mov rax, data0
	mov rdx, data1
	mov rcx, datasize
	shr rcx, 5
	vmovntdqa ymm0, 0[rax]
	vmovntdqa ymm1, 32[rax]
	vmovntdqa ymm2, 64[rax]
	vmovntdqa ymm3, 96[rax]
loop_copy:
	vmovntdq 0[rdx], ymm0
	vmovntdqa ymm4, 128[rax]
	vmovntdq 32[rdx], ymm1
	vmovntdqa ymm5, 160[rax]
	vmovntdq 64[rdx], ymm2
	vmovntdqa ymm6, 192[rax]
	vmovntdq 96[rdx], ymm3
	vmovntdqa ymm7, 224[rax]
	vmovntdq 128[rdx], ymm4
	add rax, 256
	vmovntdqa ymm0, 0[rax]
	vmovntdq 160[rdx], ymm5
	vmovntdqa ymm1, 32[rax]
	vmovntdq 192[rdx], ymm6
	vmovntdqa ymm2, 64[rax]
	vmovntdq 224[rdx], ymm7
	vmovntdqa ymm3, 96[rax]
	add rdx, 256
	dec rcx
	jnz loop_copy
}

*** Assure that you  input and output buffers are 32-byte aligned
*** Assure that your input buffer has at least 128 bytes following it (as reads go past end of buffer)

Note, your formal fast memcopy will require preamble code to potentially advance to cache line, check cache line alignment compatability between input and output buffers, and postamble code to process residual non-32-byte end of buffer data.

_intel_fast_memcpy found in libirc.lib does this for you. Try this function as well as trying your own.

Jim Dempsey

0 Kudos
jimdempseyatthecove
Honored Contributor III
4,210 Views

You might be interested in this thread:

https://stackoverflow.com/questions/43343231/enhanced-rep-movsb-for-memcpy

Read all, pay attention to: History and Official Advice

Jim Dempsey

0 Kudos
McCalpinJohn
Honored Contributor III
4,210 Views

The !*(@&^#*&^@#!$ forum decided I needed to sign in again upon submission of my response, which disappeared.

Short answers

* DMA engines need physical addresses, which requires that the user pin the address range (using mlock()) and then look up the physical address for every 4KiB page.   Ugh.

* The "memory controller" is not necessarily "closer" to the memory than the core!   Why?  Because your processor has two memory controllers, on opposite sides of the chip, each managing two DDR4 DRAM channels.  (3 channels per controller on the Xeon Scalable processors.).  A typical DRAM configuration for these processors will assign physical addresses to DRAM channels in 256 Byte blocks, with consecutive physical addresses mapped to a permutation of the channel numbers.  The permutation typically includes input from some higher-order address bits, so the permutations will be different for different starting addresses.  In general, half of the data read in will have to go across the chip to the other memory controller anyway, so going through a core is not a serious detour.

* It is really worse that this -- if the store has to be visible to the coherence domain, the address will need to be sent to a "CHA" (Coherence and Home Agent") and its co-located L3 slice for processing.  Your chip has 14 of these enabled, spread across the full grid of 28.  The L3 slice will look up the address and may invalidate or update a valid copy of the cache line.  Intel processors are also typically configured to cache IO DMA writes in the L3, so even if the line misses in the L3 cache slice, it may be retained.  In parallel with the L3 lookup, the CHA will look up the address in the "Snoop Filter", to see if the address is cached in any of the L1 or L2 caches on the chip.   In most systems a DMA write will require that address to be invalidated in all L1 and L2 caches.  This occurs even if the line is dirty, which can cause ugly race conditions that the user is not expecting (since a CPU-based copy will properly merge dirty data in the caches with incoming DMA writes).

But for some cases a DMA engine can be useful, and Intel supports these as part of their "I/OAT" technology.  An example memcpy implementation is discussed at https://software.intel.com/content/www/us/en/develop/articles/fast-memcpy-using-spdk-and-ioat-dma-engine.html

0 Kudos
DLake1
New Contributor I
4,205 Views

I feel your pain Dr. Bandwidth.

So then a multithreaded operation seems to be the best way of fully utilizing the memory bandwidth.

I found this curious looking instruction in the other section of https://software.intel.com/sites/landingpage/IntrinsicsGuide/ called MOVDIR64B, its not recognized by the compiler as an instruction and AIDA64 shows its not supported by my CPU in a CPUID query, it seems to copy 64 bytes (512bits) from one memory address to another rather than using registers, could you tell me anything about this?

0 Kudos
jimdempseyatthecove
Honored Contributor III
4,205 Views

>>The !*(@&^#*&^@#!$ forum decided I needed to sign in again upon submission of my response, which disappeared.

That happens to me about twice a week. I am in the habit of copying all posts to paste buffer prior to Submit...

Edit: HA HA it requested (demanded) log in again for this post!!!

The only drawback of this is if the post contains formatted text via {...} code button. The copy excludes formatting, and the resultant paste loses line indexing and other formatting (e.g. when format was C++, Fortran, html, ...).

Intel - Please fix this such that the entered Comment is remembered and restored after login.

Jim Dempsey

0 Kudos
DLake1
New Contributor I
4,197 Views

jimdempseyatthecove (Blackbelt) wrote:

The only drawback of this is if the post contains formatted text via {...} code button. The copy excludes formatting, and the resultant paste loses line indexing and other formatting (e.g. when format was C++, Fortran, html, ...).

Maybe you could "Disable rich-text" where you can CTRL+A CTRL+C the rich-text source.

0 Kudos
jimdempseyatthecove
Honored Contributor III
4,202 Views

John,

>>But for some cases a DMA engine can be useful, and Intel supports these as part of their "I/OAT" technology.

Interesting article and feature. Have you used this?

I am a little confused about the article. Perhaps you can add some context.

From the Introduction section:

Intel I/OAT can take advantage of PCI-Express nontransparent-bridging, which allows movement of memory blocks between two different PCIe connected motherboards, thus effectively allowing the movement of data between two different computers at nearly the same speed as moving data in memory of a single computer.

To me, this seems to state that you require some physical bridging device to plug into each computer's PCIe slot. However, this stated, it is not unusual for a "communication" device to contain a loop back capability, and I suspect that this is used for intra-system (sans PCIe buss) transfers (aka "memcpy").

First question:

Am I correct in assuming that a portion of the PCIe bridging "hardware" device is contained in the CPU (supporting Intel QuickData Technology), permitting the Intel QuickData Technology driver to perform loop back (intra-system) transfers without the physical PCIe bridging "hardware" device inserted?

Second question:

I suppose I should read the full set of documentation for Intel QuickData Technology... In the memcpy example the transfers were within the same process (VM) on a single system. Whereas, in the case where a bridging device is used for inter-PCIe transfers, the spdk_ioat_chan handle would have to be opened referencing two processes, one process on each system. The spdk_ioat_probe function does not provide a means to reference the other system's process???

Jim Dempsey

 

 

0 Kudos
DLake1
New Contributor I
4,202 Views

That sounds like RDMA.

0 Kudos
McCalpinJohn
Honored Contributor III
4,202 Views

@Jim -- I dunno nuttin about I/OAT -- it seems like there is some context missing in the documents that make them unintelligible to me....

@CommanderLake -- MOVDIR64B is a new instruction for Tremont, Tiger Lake, and Sapphire Rapids (according to the Intel Architectures Instruction Set Extensions Programming Reference, document 319433-038, March 2020).   The description (now in Volume 2 of the SWDM) describes clearly enough *what* the instruction does, but it does not give a clear indication of *why* it is expected to be useful. 

Comments and speculations:

  • This is the first "memory-to-memory" that I have seen in a while.   Accessing two memory arguments has very different implementation requirements than existing instructions, so I would guess that it would have to be considered fairly important to justify the implementation & validation cost.
  • The instruction description notes that the memory read is not guaranteed to be atomic, while the memory write is guaranteed to be atomic. Atomicity of stores larger than a single "word" can be quite valuable in the implementation of low-level producer-consumer operations -- allowing one or more "data" words to be combined with one or more "tag" words in a single cache transaction.  It would take some serious thinking to understand whether this is an important intended use case for this instruction (or the MOVDIRI register-to-memory version).
  • The instruction description notes that the 64-Byte atomic payload is guaranteed even for UC (UnCached) memory types.  This is potentially important for low-level device drivers that must communicate with memory-mapped IO devices.  With the UC type, payloads are typically limited to 4B or 8B, and there is no overlap between operations.  Being able to send a 64-Byte block in a single transaction could dramatically reduce the number of latencies required to interact with memory-mapped IO devices, without requiring that they implement the full complexity of being able to deal with Write-Combined memory types (which *usually* show up in 64-Byte blocks, but are allowed to be split into many smaller transactions).
  • The instruction description notes that the MOVDIRI and MOVDIR64B instructions are "volatile" and do not merge with prior stores in the write-combining buffer.   This makes the instruction suitable for driving streams of stores to a fixed-address memory-mapped hardware FIFO.  (With the Write Combining type, you would have to explicitly flush the WC buffers before writing to the same address, otherwise the stores could be merged, with only the final store actually moving out of the core.)
0 Kudos
DLake1
New Contributor I
4,202 Views

I know I/OAT was meant for extra network traffic offloading, I just don't know what exactly.

Here's Intel's summary of what I/OAT is and does: www.intel.com/content/www/us/en/wireless-network/accel-technology.html

0 Kudos
Viet_H_Intel
Moderator
3,713 Views

Let us know if this is still an issue. Otherwise, we will close it.


Thanks,


0 Kudos
Viet_H_Intel
Moderator
3,639 Views

We will no longer respond to this thread.  

If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only.

Thanks,


0 Kudos
Reply