Solved: TSX and PCI consistent memory

A_H_ · ‎04-27-2016

Hello Forum,

I am evaluating Intel TSX with respect to contention from a I/O card. Specifically, I am expecting to see a transaction aborted when reading a memory location that is allocated as PCI coherent (pci_alloc_consistent) and modified by a PCIe device. My test measures the time it takes for the CPU to abort my transaction (which is infinite since it contains an infinite loop).

I am observing a duration of 4ms when using a global or local variable, and the abort status is 0. Which makes somewhat sense, the transaction will be aborted when my thread's time slice is over and the linux scheduler takes over.

When using a memory location that is allocated as PCI coherent by the device driver and mapped into the user process, the transaction aborts in a very short time (< 1us), with a status of 0. Even though the PCIe device is not touching the memory (no read, no write).

The documentation does not mention anything about memory type / cacheing with respect to TSX.

Does someone have a good explanation for these observations?

Why is the abort status 0 in both cases?

Thanks,

A

int wait_on_address(volatile uint64_t *ptr)

{
    int i;
    unsigned status;
        status = _xbegin();

if (status == _XBEGIN_STARTED) {
// we're in transactional context

uint64_t val = *ptr;

while(1)
;

_xend();

            //printf("Optimistic path success\n");
            return 0;
        }

//printf("Optimistic path failed. status=0x%x\n", status);
return 1; // Value changed or transaction got cancelled
}

Roman_D_Intel · ‎04-27-2016

The SDM (www.intel.com/sdm) documents it in "15.3.8.2 Runtime Consideration" section.
It says: "Transactional execution only supports write-back cacheable memory type operations. A transactional region may
always abort if it includes operations on any other memory type. This includes instruction fetches to UC memory
type."
Roman

View solution in original post

Roman_D_Intel · ‎04-27-2016

The SDM (www.intel.com/sdm) documents it in "15.3.8.2 Runtime Consideration" section.
It says: "Transactional execution only supports write-back cacheable memory type operations. A transactional region may
always abort if it includes operations on any other memory type. This includes instruction fetches to UC memory
type."
Roman

A_H_ · ‎04-28-2016

I must have missed those lines in the documentation. Changing the memory type to WB did the trick.

Thanks for your help!

BTW, where, in linux, do I look for the cache-type of some pages. Can I read that in /proc/PID/pagemap, in /proc/mtrr or /sys/kernel/debug/x86/pat_memtype_list?

McCalpinJohn · ‎04-28-2016

The page type is determined by the PAT entry for the page, the value in the corresponding PAT look-up table, and the MTRR for the region (or the default memory type if there is no MTRR defined for the region). This is discussed in Chapter 11 of Volume 3 of the Intel Architecture SW Developer's Manual (document 325384). Table 11-2 in Section 11.3 is a particularly good high-level summary, while Table 11-7 in Section 11.5.2.2 shows how the MTRRs and PATs combine to form a specific memory type.

Generally systems are set up to use only 3 modes: WriteBack for all system memory, UnCached for memory-mapped IO control space, and WriteCombining for memory-mapped IO data regions.

Changing a region of memory to be UnCached is relatively easy, since (as shown in Table 11-7) changing the MTRR to UC will override the PATs (which are much more difficult to changed). Linux has very limited support for WriteProtect and WriteThrough memory types -- the MTRR driver knows about these, but it is often very difficult to get the PAT's set up in a compatible mode.

FYI: The WB type will not work with memory-mapped IO. You can program the bits to set up the mapping as WB, but the system will crash as soon as it gets a transaction that it does not know how to handle. It is theoretically possible to use WP or WT to get cached reads from MMIO, but coherence has to be handled in software.

A_H_ · ‎05-03-2016

Thanks for the information. I can see that extracting the page type is not trivial, especially understanding the cobination of MTRRs and PATs.

FYI: The WB type will not work with memory-mapped IO. You can program the bits to set up the mapping as WB, but the system will crash as soon as it gets a transaction that it does not know how to handle.

In my scenario, this memory area is used for Rx only, so mapping it as WB worked (the CPU never writes to it).

jimdempseyatthecove · ‎05-03-2016

>> this memory area is used for Rx only, so mapping it as WB worked (the CPU never writes to it).

However, if after you read a location, and data remains in cache, a subsequent write by device to area will not invalidate the cache line. This is what John was talking about. To correct for this you will need to introduce a means to reestablish coherency. IOW evict the stale cache lines using CLFLUSH; MFENCE on Host or CLEVICT0; CLEVICT1; MFENCE on KNC. Though it is not clear as to if on KNC if CLEVICTn invalidates any other cache than the L1/l2 on the core that issues the CLEVICn, I suspect it does not. John, do you have any additional information on this? I suspect (but cannot confirm) that KNL supports CLFLUSH.

Jim Dempsey

McCalpinJohn · ‎05-03-2016

It is possible that WB mode will function for MMIO if there are no stores, but in that case it is safer to use WP or WT. (These can be created by the combination of WB PAT entries with a WP or WT MTRR.)

Using the CLFLUSH instruction for coherence could easily cause a coherence protocol violation since CLFLUSH is required to remove the line from all caches in the system. (The specific transactions are not documented, but an "invalidate" transaction does not map to a PCIe read or write instruction directly.)

I suspect that KNL will support CLFLUSH rather than the CLEVICT1/CLEVECT2 instructions used in the KNC. (It does not matter here, since KNL is not likely to support TSX). The KNC's CLEVICT1/CLEVICT2 instructions are directives to the "local" cache(s) only (very similar to my earlier IBM patent: http://www.google.com/patents/US7194587), so they should not send any message(s) to the rest of the system (unless the line was dirty, in which case an ordinary writeback will be generated). This should be "safe" for cached MMIO space for clean data, while WP and WT don't support dirty lines in the cache. With WB-mapped MMIO space, the writeback of a dirty line will likely cause a coherence fault.

jimdempseyatthecove · ‎05-04-2016

John,

Can you comment on my thoughts?

A H would like to access an I/O card within a TSX protected region. As I see it, there are two issues that have to be worked around: a) if the I/O region is (if possible) configured to be cached then he has the problem of having the device writes update the CPU cache line(s), and b) to be able to complete a TSX region including the I/O page in one TSX session.

Now comes the tricky part, CLFLUSH, it is not clear to me, maybe you can clarify this, will clear either the specified cache line, or possibly clear any cached line that matches the tag (subset of bits in memory cache line address). If the CLFLUSH flushes only the specified (fully resolved) cache line, then AH can use it (with restrictions), however, if CLFLUSH flushes any cache line of a different address that matches the tag, then the CLFLUSH can cause a coherency protocol violation. Can you comment as to which is flushed?

Now then, (assuming exact cache line flushed) if prior to entering the TSX protected region, sufficient CLFLUSHes are issued to evict possible stale data, then the TSX region is entered, then the region will complete (barring interrupt) unless some other thread enters the CLFLUSH preamble to a TSX region referencing the I/O page.

An additional issue that may be insurmountable is the TSX protected region will write to the I/O page only the last write where the last write differs from the original content. IOW if a location in the I/O page contained A, an the protected region wrote B, C, A to this cache line, then nothing is written.

Jim Dempsey

McCalpinJohn · ‎05-05-2016

CLFLUSH should do a flush of the private caches on the fully resolved address, so aliasing should not be a problem.

If the data is dirty, the CLFLUSH will generate a writeback of the modified line. If the address maps to MMIO, the writeback transaction will probably be considered a protocol error and crash the system. (It would be possible for the hardware to translate the writeback into a 64Byte write transaction, but since this is an explicitly unsupported mode of operation there is no good reason to do that translation.)

If the data was clean (or not present) in the local caches, the CLFLUSH instruction must also broadcast an invalidate transaction on the target address to all caching agents in the system. (In a single socket system the invalidate may only need to be sent to the local L3.) If the address translates to MMIO space, the invalidate operation might be treated as a protocol error and crash the system. (Alternatively, the invalidate operation may work correctly -- clearing all the processor caches, but not sending any message to the PCIe device. This mode would allow the use of WP and WT caching for MMIO regions.)

There is a fundamental disconnect between TSX and MMIO. TSX is used for atomically updating multiple data items, so some agent has to write to the set of addresses. If the core writes to the MMIO addresses using the WB memory type, then you have dirty data in the caches and the coherency protocol will crash the system. If the PCIe device writes to the MMIO addresses, it is unable to generate invalidate transactions to send to the processor caches. Manually flushing the lines can be used to maintain coherence, but the point of TSX is that the hardware is monitoring reads and writes to the set of addresses, which in this case it clearly cannot do (because there are no PCIe transactions of the type needed).

It was almost possible to hack around the cached MMIO coherence part of this problem on AMD Family10h processors by using the IORR registers, but the specific combination of settings I needed was explicitly disallowed by the HW. I never did figure out if this was due to a fundamental protocol inconsistency or simply because it was not an explicitly supported mode of operation.

jimdempseyatthecove · ‎05-07-2016

>>If the data is dirty, the CLFLUSH will generate a writeback of the modified line.

A H mentioned that only the PCIe device. The purpose of the CLFLUSH is to invalidate the cache line such that new data written by the PCIe device gets refreshed. The CLFLUSH is performed intermittently.

>> If the PCIe device writes to the MMIO addresses, it is unable to generate invalidate transactions to send to the processor caches. Manually flushing the lines can be used to maintain coherence, but the point of TSX is that the hardware is monitoring reads and writes to the set of addresses, which in this case it clearly cannot do (because there are no PCIe transactions of the type needed).

Right, so AH will need to incorporate a means for the PCIe device to notify the CPU after it updates the window and that there is an immediate dwell time before it does anything else with the data in the window. This is somewhat the same technique of directly updating the old VGA frame buffer during vertical scan interval, though (I guess) there is no equivalent to the frame interrupt. If the device is a user built device (FPGA thingie), it may be able to be programmed in a way such that the CPU sees when an update session is complete, and the FPGA enters a dwell time or compute time of known interval. The program in the CPU would then be required to extract the data. Code would have to be added to detect dropped frame or corrupt frame.

Jim Dempsey