cache write back for pci device

le_g_1 · ‎05-29-2013

Hi everyone.

I have a pci device registered with some physical memory addresses. I tested cache behavior on an i7 cpu. Basically I confirmed what the SDM says about cache except that when I set both PAT and MTRR to write back, the linux kernel went to a fatal error. Using mcelog, I got the following message. How can I fix this bug? Or can anyone give me some hints on how to analyse the information? Thanks.

Hardware event. This is not a software error.
CPU 0 BANK 6 TSC 14d88398d30
RIP !INEXACT! 60:c097ce67
MISC 3000038086 ADDR f7c20000
TIME 1369815438 Wed May 29 16:17:18 2013
MCG status:RIPV MCIP
MCi status:
Uncorrected error
Error enabled
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: corrected filtering (some unreported errors in same region)
Generic CACHE Level-2 Generic Error
STATUS be2000000003110a MCGSTATUS 5
CPUID Vendor Intel Family 6 Model 42
RIP: __schedule+0x347/0x7c0}
SOCKET 0 APIC 0 microcode 28

Patrick_F_Intel1 · ‎05-29-2013

Hello le,

I'm not completely sure but I think that PCI-E devices only support uncached/write-combining memory accesses. Otherwise it seems like you'd run into memory coherency problems.

Have you tried setting the memory type to uncached/write-combining?

Pat

le_g_1 · ‎05-29-2013

Hi, Mr Fay,

Thanks for your quick reply. Yes, I tried setting memory type to UC/UC- and the program runs well.

SDM says that the effecive memory type is determined by the mechanism that prevents caching. When I set MTRR to UC and PAT to WB, or MTRR to WB and PAT to UC, the program runs well. Only when I set both PAT and MTRR to WB does the kernel crash. What confuses me is that I did nothing but to write a single byte into that memory location. There is no other process that would potentially access that location as far as I known. I cannot think of any kind of memory coherency problems. Futher more, I disabled mutil-processor support in BIOS.

Patrick Fay (Intel) wrote:

Hello le,

I'm not completely sure but I think that PCI-E devices only support uncached/write-combining memory accesses. Otherwise it seems like you'd run into memory coherency problems.

Have you tried setting the memory type to uncached/write-combining?

Pat

Patrick_F_Intel1 · ‎05-29-2013

It seems like the PCI-E device itself is like 'another process' with which you need to worry about coherency. When you do a write to WB memory, the line sits in the cache for some time until it gets kicked out for some reason. Usually when I think of doing a write to a device via memory mapped region, you want the write to not sit in cache but to go straight to the device. So I'm not sure why you'd want WB (except that WB is faster due to not accessing memory every read/write).

In any case, it sounds like you've found that the only (as far as I know) supported PCI-E mode (uc/wc) is working properly... which is good.

Pat

McCalpinJohn · ‎05-29-2013

It is possible to map IO devices to cacheable memory on at least some processors, but the accesses have to be very carefully controlled to keep within the capabilities of the hardware -- some of the transactions to cacheable memory can map to IO transactions and some cannot.
I don't know the details for Intel processors, but I did go through all the combinations in great detail when I worked for that other company that makes x86-64 processors.

Speaking generically, some examples of things that should and should not work (though the details will depend on the implementation):

Load miss -- generates a cache line read -- converted to a 64 Byte IO read -- works OK.
BUT, there is no way for the IO device to invalidate that line in the processor(s) cache(s), so coherence must be maintained manually using the CLFLUSH instruction. NOTE also that the CLFLUSH instruction may or may not work as expected when applied to addresses that are mapped to MMIO, since the coherence engines are typically associated with the memory controllers, not the IO controllers. At the very least you will need to pin threads doing cached MMIO to a single core to maximize the chances that the CLFLUSH instructions will actually clear the (potentially stale) copies of the cache lines mapped to the MMIO range.
Streaming Store (aka Write-Combining store, aka Non-temporal store) -- generates one or more uncached stores -- works OK.
This is the only mode that is "officially" supported for MMIO ranges. It was added in the olden days to allow a processor core to execute high-speed stores into a graphics frame buffer (i.e., before there was a separate graphics processor). These stores do not use the caches, but do allow you to write to the MMIO range using full cache line writes and (typically) allows multiple concurrent stores in flight.
The Linux "ioremap_wc" maps a region so that all stores are translated to streaming stores, but because the hardware allows this, it is typically possible to explicitly generate streaming stores (MOVNTA instructions) for MMIO regions that are mapped as cached.
Store Miss (aka "Read For Ownership"/RFO) -- generates a request for exclusive access to a cache line -- probably won't work.
The reason that it probably won't work is that RFO requires that the line be invalidated in all the other caches, with the requesting core not allowed to use the data until it receives acknowledgements from all the other cores that the line has been invalidated -- but an IO controller is not a coherence controller, so it (typically) cannot generate the required probe/snoop transactions.
My guess is that the hardware crashed on this instruction.
It is possible to imagine implementations that would convert this transaction to an ordinary 64 Byte IO read, but then some component of the system would have to "remember" that this translation took place and would have to lie to the core and tell it that all the other cores had responded with invalidate acknowledgements, so that the core could place the line in "M" state and have permission to write to it.
Victim Writeback -- writes back a dirty line from cache to memory -- probably won't work.
Assuming that you could get past the problems with the "store miss" and get the line in "M" state in the cache, eventually the cache will need to evict the dirty line. Although this superficially resembles a 64 Byte store, from the coherence perspective it is quite a different transaction. A Victim Writeback actually has no coherence implications -- all of the coherence was handled by the RFO up front, and the Victim Writeback is just the delayed completion of that operation. Again, it is possible to imagine an implementation that simply mapped the Victim Writeback to a 64 Byte IO store, but when you get into the details there are features that just don't fit. I don't know of any processor implementation for which a mapping of Victim Writeback operations to MMIO space is supported.

There is one set of mappings that can be made to work on at least some x86-64 processors, and it is based on mapping the MMIO space *twice*, with one mapping used only for reads and the other mapping used only for writes:

Map the MMIO range with a set of attributes that allow write-combining stores (but only uncached reads). This mode is supported by x86-64 processors and is provided by the Linux "ioremap_wc()" kernel function.
Map the MMIO range a second time with a set of attributes that allow cache-line reads (but only uncached, non-write-combined stores).
The MTRR type(s) that allow this are "Write-Through" (WT) and "Write-Protect" (WP).
These might be mapped to the same behavior internally, but the nominal difference is that in WT mode stores *update* the corresponding line if it happens to be in the cache, while in WP mode stores *invalidate* the corresponding line if it happens to be in the cache. In this case it does not matter, since we will not be executing any stores to this region. On the other hand, we will need to execute CLFLUSH operations to this region, since that is the only way to ensure that (potentially) stale cache lines are removed from the cache and that the subsequent read operation to a line actually goes to the MMIO-mapped device and reads fresh data.

On the particular device that I am fiddling with now, the *device* exports two address ranges using the PCIe BAR functionality. These both map to the same memory locations on the device, but each BAR is mapped to a different *physical* address by the Linux kernel. The different *physical* addresses allow the MTRRs to be set differently (WC for the write range and WT/WP for the read range). These are also mapped to different *virtual* addresses so that the PATs can be set up with values that are consistent with the MTRRs.

A slightly extended & generalized version of this discussion is on my blog at http://blogs.utexas.edu/jdm4372/

I hope this helps....

le_g_1 · ‎05-30-2013

Hi Mr. McCalpin,

I have studied your poster carefully, really a good job.

I have one question to confirm. We need to execute CFLUSH to invalidate the cache line in WT range once a device memory is modified. Is this baed on the fact that in the WT range, the cache line is never tagged as modified, so cflush is equivalent to invalidate that cache line?

I tried a whole day, but failed to solve my bug... It seems that IO device hardly supports WB cache...

McCalpinJohn · ‎05-30-2013

For the "read-only" range, cached copies of MMIO lines will never be invalidated by external traffic, so repeated reads of the data will always return the cached copy. Since there are no external mechanisms to invalidate the cache line, we need a mechanism that the processor can use to invalidate the line, so the next load to that line will go to the IO device and get fresh data.

There are a number of ways that a processor should be able to invalidate a cached MMIO line. Not all of these will work on all implementations!

Cached copies of MMIO addresses can, of course, be dropped when they become LRU and are chosen as the victim to be replaced by a new line brought into the cache.
A code could read enough conflicting cacheable addresses to ensure that the cached MMIO line would be evicted.
The number is typically 8 for a 32 KiB data cache, but you need to be careful that the reads have not been rearranged to put the cached MMIO read in the middle of the "flushing" reads. There are also some systems for which the pseudo-LRU algorithm has "features" that can break this approach. (HyperThreading and shared caches can both add complexity in this dimension.)
The CLFLUSH instruction operating on the virtual address of the cached MMIO line should evict it from the L1 and L2 caches.
Whether it will evict the line from the L3 depends on the implementation, and I don't have enough information to speculate on whether this will work on Xeon processors. For AMD Family 10h processors, due to the limitations of the CLFLUSH implementation, cached MMIO lines are only allowed in the L1 cache.
For memory mapped my the MTRRs as WP ("Write Protect"), a store to the address of the cached MMIO line should invalidate that line from the L1 & L2 data caches. This will generate an *uncached* store, which typically stalls the processor for quite a while, so it is not a preferred solution.
The WBINVD instruction (kernel mode only) will invalidate the *entire* processor data cache structure and according to the Intel Architecture Software Developer's Guide, Volume 2 (document 325338-044), will also cause all external caches to be flushed. Additional details are discussed in the SW Developer's Guide, Volume 3. Additional caution needs to be taken if running with HyperThreading enabled, as mentioned in the discussion of the CPUID instruction in the SW Developer's Guide, Vol 2.
The INVD instruction (kernel mode only) will invalidate all the processor caches, but it does this non-coherently (i.e., dirty cache lines are not written back to memory, so any modified data gets lost). This is very likely to crash your system, and is only mentioned here for completeness.
AMD processors support some extensions to the MTRR mechanism that allow read and write operations to the same physical address to be sent to different places (i.e., one to system memory and the other to MMIO). This is *almost* useful for supporting cached MMIO, but that is a story for another time and place.... Perhaps on my blog: http://blogs.utexas.edu/jdm4372/

There are likely to be more complexities that I am not remembering right now, but the preferred answer is to bind the process doing the cached MMIO to a single core (and single thread context if using HyperThreading) and use CLFLUSH on the address you want to invalidate. There are no guarantees, but this seems like the approach most likely to work.

McCalpinJohn · ‎06-10-2013

For "le g":
Just out of curiosity, what interface did you use to set the PAT attributes to Write Back?

I was able to set up my MMIO range with an MTRR of "Write-Protect" using the /proc/mtrr interface:
echo "base=0xd0000000 size=0x08000000 type=write-protect" >/proc/mtrr
and "cat /proc/mtrr" confirmed that this set up a 128MiB region with Write-Protect mode at the desired address:
reg05: base=0x0d0000000 ( 3328MB), size= 128MB, count=0: write-protect

Unfortunately, in my version of Linux (2.6.32.279), the kernel function "ioremap_cache()" sets up the PATs as "UC-", which seems more than a little counter-intuitive.   I got this info using a trick that I found elsewhere on the interwebs:
        mount -t debugfs debugfs /sys/kernel/debug
        cat /sys/kernel/debug/x86/pat_memtype_list
The latter command returned a bunch of lines, the relevant ones being:
PAT memtype list:
[...]
write-combining @ 0xc8000000-0xd0000000
uncached-minus @ 0xd0000000-0xd8000000
[...]

The first of those entries is the "write-only" range to my FPGA, which has no MTRR entry (so is UC by default), but which allows the PAT to upgrade to WC. So "ioremap_wc()" clearly does what I was expecting.

The second entry is the "read-only" range to my FPGA, for which I set up an MTRR of type WP, but the kernel call "ioremap_cache()" clearly set up the region with an uncached attribute. According to the Intel Arch SW Developer's Manual, Volume 3, Chapter 11, Table 11-7, setting the PAT attribute to WB or WP, when combined with an MTRR of WP, will produce a WP range. Even a PAT setting of WT would be acceptable, since it would combine with the MTRR to give a WT type.

My interpretation is that Linux does not think I should be allowed to make MMIO cached (of any sort), and so it just ignores the request implicit in the "ioremap_cache()" call. Now I need to figure out how to override that irritating "protection".

le_g_1 · ‎06-11-2013

For "Dr. McCalpin":

I have ever met your problom too, and saw /sys/kernel/debug/x86/pat_memtype_list too. The linux kernel should have kept a list of memory type range to protect. I "solved" the problom by hacking the kernel code a little just to do the experiment. Specifically, I changed __ioremap_caller in http://lxr.linux.no/linux+v2.6.32/arch/x86/mm/ioremap.c. In line 177, I override the kernel logic by giving prot_val the expected value. I verified this by looking at the resulting page entry which shows that the PAT/PCD/PWT are all 0 which means a WB page.

Another news. I changed into a relatively old PC(core enhanced CPU Q8200) compared with i7 2600 before, and the things got better. The machine worked well when a little IO memory is accessed. But if more than 1K byte is written. The kernel still Panics. With the restricted memory area, I found that when mtrr and PAT are both set to WB for Memory Mapped IO, read works well, but the written data would never be flushed out(both clflush and wbinvd tried) to device(I verify this by connectting several LEDs on my board).

McCalpinJohn · ‎06-11-2013

Thanks for the update. I was just looking at that bit of the kernel yesterday. In my case this should also work because (according to Table 11-7 of the Intel Arch SW Developers Guide, Volume 3) the combination of an MTRR of type WP and a PAT of type WB combine to give a result of WP, which is what I want. Trying to give a PAT of WP would be a lot more work because the Linux PAT subsystem does not even include encodings for the WP or WT types, so I would have to either replicate or extend a lot more code....

u9012063 · ‎07-21-2013

Hello le,

When you said program runs well under UC/UC, do you mean one processor or multi-processor?

I did the same setup (UC/UC). My simple read/write application seems ok. But when I use this UC memory to do complex work (such as make kernel), my system crashes.

William

le g. wrote:

Hi, Mr Fay,

Thanks for your quick reply. Yes, I tried setting memory type to UC/UC- and the program runs well.

SDM says that the effecive memory type is determined by the mechanism that prevents caching. When I set MTRR to UC and PAT to WB, or MTRR to WB and PAT to UC, the program runs well. Only when I set both PAT and MTRR to WB does the kernel crash. What confuses me is that I did nothing but to write a single byte into that memory location. There is no other process that would potentially access that location as far as I known. I cannot think of any kind of memory coherency problems. Futher more, I disabled mutil-processor support in BIOS.

Quote:

Patrick Fay (Intel)wrote:
Hello le,

I'm not completely sure but I think that PCI-E devices only support uncached/write-combining memory accesses. Otherwise it seems like you'd run into memory coherency problems.

Have you tried setting the memory type to uncached/write-combining?

Pat

le_g_1 · ‎07-31-2013

William, sorry for delay.

I meant multi-processor. But my program only did very very simple work, to verify what Intel manual said.....

UC should not introduce memory consistency, so I don't think it's SMP that coused the crashes.

Will kernel making acess the UC memory location? I can not see any relevance.