Why do MemReq TLPs take 100s nanoseconds to be generated after MMIO writes?

Alexis · ‎06-14-2020

I found several topics discussing about how to build bigger TLPs but never about reducing the time of TLP builds.

My goal is to reduce the delay between PCIe TLP generation for each MMIO write.

In the user-space:

auto *addr = static_cast<uint64_t *>(mmap_addr);
*(addr) = 0x1111111111111111;
*(addr+1) = 0x2222222222222222;
*(addr+2) = 0x3333333333333333;

As expected that code sends three MemWr 8Bytes-TLPs with hundreds of nanoseconds delay between each other.

I also tried with `memcpy`, TLPs are created randomly like this: ` | | |||| |`
* 4-5 TLPs can be back-to-back and the other ones, far from each other

Directly in the kernel:

for (k=0; k<32; k++){
   writeq(0x11111111, dev_bk->bar[0].base_addr+8*k);
}

Still 50-300ns delay between TLPs.

Questions:

How a MMIO write is converted to a TLP?
Which mechanism can I use to speed up the TLPs generation? (Linux kernel)
Is my application seeing this 100s nanosecond delays between each call?

I'm aware of the WC Write-Combining buffer feature and the SIMD (SSE, AVX, AVX-512). I'd like to put them aside for that question.

Best regards,

McCalpinJohn · ‎06-15-2020

There has traditionally been very little documentation of the details of the implementation of interface between the core and the PCIe interface.

If the MMIO interface is mapped UC, then the writes will certainly happen one at a time, with a spacing that is determined by whatever mechanism the core uses to ensure that the stores arrive at the target address in order. Most implementations of such ordering require a round trip signal (even if that signal is not visible architecturally). Some implementations are able to use the properties of statically-routed, FIFO-ordered channels (typically virtual channels) to enable some degree of pipelining (while also guaranteeing ordering) without a full round trip. This is fairly common in low-level interfaces (not directly accessible to the users), and in my experience is almost non-existent in user-visible interfaces.

One suggestion that occasionally shows up in Intel documents is to use write combining and ensure that the full 64 Bytes is written as quickly as possible. Filling the buffer will is one of the triggers that causes it to drain -- presumably quickly, but with no guarantees. Other transactions that are guaranteed to close & flush write-combining buffers are similarly lacking in commentary on timing. Given that the purpose of write-combining buffers is to stay open long enough to combine writes, there will always be a tension between low latency and high utilization. (UC writes have no such tension -- the implementation going to try to execute them as fast as possible while remaining consistent with ordering rules. Unfortunately "as fast as possible" is usually very, very slow compared to ordinary pipelined transactions.)

Alexis · ‎06-15-2020

Dr. Bandwidth, all your posts are always very informative and useful, thank you!

During that time, I made progress.

First, I wasn't interested in generating concatenating TLPs (WC) but it seems the only way to be more or less faster. I thought the UC- and UC were the most performant ways of sending TLPs. As you describe, I expected to see a FIFO somewhere to remove the acknowledgement RTT delay.

Finally, using Write-combining I could achieve what I want in the kernel space only.

Sending 256B:

In the kernel:

for(int i=0;i<32;i++) {
  writeq(data64, mmap_addr+8*i);
}

In the user-space with mmap'd memory: vm_flags |= VM_PFNMAP | VM_DONTCOPY | VM_DONTEXPAND and pgprot_noncached

for(int i=0;i<32;i++) {
  (uint64_t*)(mmap_addr+8*i) = data64;
}

Kernel: I can see 4x 64B TLPs back-to-back with no delay between them
User-space: I can see 4x 64B TLPs with 100s nanoseconds between them

It seems the ack RTT signal is still present in user-space but not in the kernel. It's consistent over tens of tests.

Any thought about that behavior? How can I improve the user-space behavior?

I've already read lots of your posts explaining the out-of-order, atomicity and the non-guarantee TLPs generation when WC is used.

Thank you again!

McCalpinJohn · ‎06-17-2020

The only thing that I can think of is to go back and verify that the MTRR and Page Table entries for the two cases are really absolutely the same.

There can be subtle differences in behavior between various memory types that are all "uncached". Essentially every "box" has a copy of the MTRRs, while only the core looks at the page table entries. Operations on addresses that are guaranteed to be uncached by an MTRR can skip snooping entirely, while operations that are uncached due to page table entries may have to generate snoops in the uncore. I usually think about this difference in terms of external DMA writes, but (depending on the specific transaction generated by the core) it could influence MMIO writes as well.

It should be possible to monitor the exact transaction types being used via the uncore performance counters -- particularly opcode-matching in the CHA. It looks like there is enough information in the tables of Chapter 3 of the SKX uncore performance monitoring guide, but I have not tried the opcode-matching stuff yet.

Alexis · ‎06-21-2020

I've made several tests in the kernel and in the user-space.

It seems easier in the kernel space to get back-to-back packets but in the user-space, the delay between two "flushes" seems to be nondeterministic even when fully sequential.

The memory is correctly set and the mapping doesn't change between kernel/user app: write-combining @ 0xa1000000-0xa2000000

1 out of 20, the delay between 64B-TLPs is very reduced (40ns max and some 64B-TLPs are back to back), that means it's possible and somehow there is a point where all the conditions are met.
19 out of 20, the delay is huge between all the TLPs (200-300ns) and none is back to back.

As said, in the kernel, a for-loop writing N times 8B sequentially generates TLPs back to back.

I'm very curious about those conditions to get performant writes using WC.

I'll try to use your app found on your Github to read the uncore performance counters.

Thank you for your help.

Previous edit: Somehow in the user app, moving the data to the stack memory I could get all the TLPs back to back without any delay. But that changed at the second run. So unreliable results, it's very frustrating not to know what's happening in details.

Alexis · ‎06-26-2020

Hello,

After lots of readings and testings, I come back here. I think my question is more on the WC/cache architecture than the PCIe.

I can consistently get the expected result having sequential writes in the probe/remove functions of the kernel module but when located somewhere else in the code, I face very poor performance.

Nevertheless, I still struggle to understand why sequential writes in any other functions in the module (and user space), the CPU flushes correctly the WC when full but adds delay between each flushes.

I'm open to any suggestions.

I'm gonna try with an AMD thread-ripper CPU with the hope the WCB/LFB is more documented or has a more consistent behavior.

CPUs: i9-9900K + i9-7980XE (Skylake and Coffee Lake)