Re: Dual port memory corruption

Altera_Forum · ‎03-28-2012

We have system that is using an M9K memory block for IPC between two nios2 cpus.

CPU A accesses it as tightly coupled data memory, CPU B via the data port and Avalon bus (no data cache on either cpu). The memory block is set to return OLD_DATA during concurrent read and write.

One of the memory locations contains the 'write pointer' into a ring buffer, this is written by CPU A (actually updated every 125us) and polled by CPU B (along with a couple of other locations) in its idle loop.

On one card the 'OLD_DATA' option doesn't seem to be working properly, CPU B reads 0x00015460 (lots of times), then 0x00015420 (invalid) followed by the valid 0x00015480 (I've added a double read & compare to detect the error).

So it looks as though the read has latched a part-changed value.

Does any one know if 'OLD_DATA' actually works correctly, or do we have a more general timing error.

Altera_Forum · ‎03-28-2012

That's supposed to work with M9K blocks. Do you have the 2nd port running on a difference clock frequency or have the latency set two 2 cycles by any chance? I wouldn't expect either to be a problem but it might be something to check out if that's the case.

The behavior you are seeing sounds more like a timing problem to me, especially if you have the read latency set to 1 on the 2nd port. With a read latency of 1, the output of the RAM is not registered and perhaps CPU 2 reading the data is not seeing a settled copy of the data. If this is happening then the timing report should be flagging this as a violation (assuming the constraints are setup properly).

Speaking of proper constraints.... I forget which version of the tools had this problem, but there was a bug that caused data corruption in the on-chip memories due to an incorrect timing constraint being made. I can't remember if it was 11.0 or 11.1 that had that issue but if you are using Quartus II without service packs you might want to grab a SP since that issue could result in what you are seeing.

Altera_Forum · ‎03-29-2012

Everything is running off the same 100MHz clock.

I don't remember there being options in the sopc builder for read latency - but it is a while since I've used it - any values are likely to be the defaults.

This part of the fpga (it is a large Arria part) is a dual nios with 2 tightly coupled code and 3 data blocks booted from the PCIe slave -> Avalon bridge. I had intended to dual port the 'shared data' directly between two tightly coupled data ports, but CPU B doesn't make that many accesses (when not idle) and isn't that time/clock critical so accessing via the Avalon bus lets me read it out over PCIe for debug/diagnostics.

The device also contains a big TDM switch, an FFT block for tone detect etc.

I'm also not sure which version or Quatus they hw guys use. I think they are frightened to change because something might break!

Altera_Forum · ‎03-29-2012

Ok, if the memory was left at the defaults then it would probably be 1 cycle read latency. So the commands and write data will be pipelined into the memory but the readdata out of the memory will not pipelined.

I dug around my email and found the memory corruption issue. Turns out it only affected DCFIFO at 11.1 (no service pack) so you shouldn't be running into this with the on-chip RAM component.

I would have the hardware team take a look at the timing analysis reports to make sure they don't have failing paths from the on-chip memory over to CPU 2 (or from CPU 2 into the memory). If there are failing paths you could bump the read latency of that memory port to 2 to get some extra pipelining in the readdata path. The TCM connection would need to remain set to a read latency of 1 since that's the only latency Nios II supports with the tightly coupled connections. If the failing paths are into the memory then additional pipelining in the fabric will be needed, I would probably do this with a pipeline bridge so that you don't get pipeline stages all over the place.

Altera_Forum · ‎04-04-2012

Apparantly the total design has quite a lot of timing errors (and I mean 1000s) for signals that cross between clock domains (not related to these nios cpu and their memory). Some of these signals are trully 'don't care' (eg led outputs), some are probably synchronised - but the tools haven't been told, and others will be bugs waiting to strike!

Marking about 7 clock pairs as 'false paths' (all related to a specific 96MHz clock generated from an external 8MHz input) lets the rest of the fpga be synthesised without any timing errors - and the 'failing' board passes the test it was failing on.

The suspicion is that in attempting to meet these impossible timing constraints so much fpga real estate was used that other paths also failed.

Altera_Forum · ‎04-04-2012

What might be happening is without constraints the fitter is putting most of it's effort into paths that are failing due to a lack of constraints and causing other valid paths to fail in the process. The fitter optimizes based on slack times so if you had some paths that should be false paths and they have the worst slack times due to a lack of constraints the fitter will work extra hard on those paths. So it could be the fitter was working hard on improving the slack time to LEDs instead of focusing on on-chip paths like between Nios II and the dual-port RAM.

Altera_Forum · ‎04-05-2012

I think that is what I wanted to say :-)

Altera_Forum · ‎10-01-2012

After removing a few clock domain crossing issues and removing all the 'red' timing paths, the rebuilt fpga image worked on the boards that were failing.

However we've recently has some board fail the production tests (which now perform a test which is likely to show up this issue). Program the older fpga image and these boards work.

The whole thing is very strange, I've added a memory test in the idle loop of both cpu (using a 32bit value whose low 16 bits are the bit-reversed 1's complement of the high bits - which increment). While this test gives occaisonal errors, the ring index fails 3-4 times as often even though the read/write collision is much more likely.

The error has to be associated with the memory write, otherwise the following read would return the 'old' data - and that just dosn't happen.

I've tried many thing to increase the error rate, but nothing seems to make a significant difference. The only other avalon slave (to that memory block) is the PCIe interface, but it won't be accessing that memory at all. Loading the PCIe slave makes no difference at all.

The whole nios block (and quite a lot of other stuff) is running off the same 100MHz clock, there will be a bus width adapter and clock crossing bridge between the PCIe and any avalon slaves.

We do have to put all the Avalon bus signals from the PCIe block through a lump of vhdl in order to get the correct addres lines (otherwise the BAR becomes massive, not just 32MB ), but that doesn't contain any logic - it just renames signals.

About the only thing I haven't tried is accesses through the other 3 BARs. They go into different logic that is in the same SOPC build, but has no shared avalon master/slave parts.

IIRC Fmax is a lot higher than 100MHz.

Altera_Forum · ‎10-02-2012

I just looked at this doc and the section "Mixed-Port RDW" suggests that you can't use a dual clocking scheme with mixed ports and the old data option all at the same time: http://www.altera.com/literature/ug/ug_ram_rom.pdf

If that's really true then that's a bug with the on-chip memory component. If this is true I'm not sure why Quartus allowed the compile to go through unless the memory component does something to the instantation that makes it work.

Altera_Forum · ‎10-03-2012

Both ports to the memory should be on the same clock. The same same 100MHz clock is fed to the nios (using it as a tightly coupled data memory), the other nios (accessing via the avalon bus) and the memory interface itself.

I've just had a thought, the only other avalon master (to this particular memory block) is the PCIe slave. That will require a clock crossing bridge and a bus-width adapter. Will those be generated separately for each slave? Or will the interface be converted once?

Is there an easy way to tell??

If the former, maybe we should add the explicit bridges?

Plausibly having two separate sets of bridges might actually work best.

One to map the top 16MB of the BAR to the SDRAM (optimised for burst transfers).

The other to map the low 16MB to various on-chip resources (optimised for single word transfers, this could alias addresses at 256kB ).

I'm not sure we've really looked at any of the avalon bridges.

If relevant we are 'still' using Q9.1sp2, moving forwards is causing grief.

Altera_Forum · ‎10-04-2012

With SOPC Builder you will get clock crossing adapters and width adapters per master:slave pairing when they are needed. Clock crossing adapters can be spotted during generation time since you'll see messages like "clock_0" flying by the screen for each one. Finding the width adaptation logic is not so easy.

So from the way you described the system assuming you only have the second Nios II data master and PCIe master hooked up to the second port of the on-chip RAM, I would only expect one clock crossing and one width adapter to be placed between the PCIe and memory cores.

Not related to the problem but one thing you probably want to avoid is the additional bursting from the PCIe through the clock crossing adapter into the memory. Those automatic clock crossing adapters only let one beat through at a time and each takes a few clock cycles to make their way through. The clock crossing bridge will do a better job because the data crosses domains using FIFOs so you can fill the FIFO with a rate of one beat/clock cycle assuming there is room in the FIFO.

All of this is starting to jog my memory, I can't believe I managed to find this but you could be running into this: http://www.altera.com/support/kdb/solutions/rd11022009_41.html

I think the PCIe core performs bursts of 32 beats. So what you could do is setup a clock crossing bridge to be the width of the PCIe master and non-bursting. Place the clock crossing bridge between the PCIe core and the memory. What this will do is force the burst adapter to be placed between the bridge and PCIe core, and the width adaptation between the bridge and the memory. Place the clock crossing bridge master on the same domain as your CPUs and memory, and the bridge slave should be on the same domain as the PCIe master.

Altera_Forum · ‎10-04-2012

I think I almost follow that!

In our case the PCIe master doesn't usually access the internal memory block that is causing us issues. It does do single word cycles into a different M9K block (tightly coupled to the other cpu), and to a small avalon slave we use as an interrupt requestor (to the ppc at the other end of the PCIe link).

The PCIe slave also does read/write to the SDRAM - these will be longer requests.

Looks like it would make sense (in any case) for us to put an explicit non-bursting clock crossing bridge between the PCIe Avalon master and all the Avalon slaves except SDRAM. The SDRAM might benefit from a bursting bridge - since the software tries quite hard to do multi-word accesses to the SDRAM.

At the moment we've seen errors on different cards with different fpga images. The problem I see is that any fpga rebuild is likely to change the resource allocation - and the whole problem looks like a marginal timing issue somewhere, so the new version only works because it doesn't use the part of the specific fpga that is close to the tolerance limits.

I can, of course, detect the specific error we are seeing in software (by doing a re-read). But there are other shared locations where that would be much more difficult. Making the M9K data blocks not 'tightly coupled' (so they use the Avalon arbiter) would also slow the code down too much.

Altera_Forum · ‎10-05-2012

The SDRAM would benefit from having a clock crossing bridge between it and the PCIe master for sure (probably end up with a 4-8x throughput increase as a result instead of relying on asynchronous clock crossing adapters).

In the case of the TCM problems the only other thing that comes to mind to check is whether or not your design meets timing at all three corners (slow 0C, slow 85C, and fast 0C). I assume that you are using this TCM to share data between the two processors and are using some sort of locking mechanism between them like a mutex.

Altera_Forum · ‎10-08-2012

The PCIe slave access times are dominated by the PCIe 'cycle' time, the per-byte times are fast enough for what we are doing. We are feeding the avalon clock into the hard PCIe block - not sure what type of clock crossing bridge that adds (that feature is removed in quartos 12). 'Interestingly' the PCIe -> avalon bridge shows 32bit data until you try to add a pipeline bridge when it suddenly becomes 64bit!

The memory block we are seeing problems is used to share data between the two processors. It could be TCM to both - but then we wouldn't be able to dump it out for debug.

The locking is fine, the only (byte) location that can be written by both sides is covered by a lock (Dekker's algorythm). The lock is also used for the one place where two data items are updated together.

I don't do these fpga builds (just write the software - I have done some build for one of the dev cards), so can't confirm we meet all the timing corner cases. However we did fix a few timing issues (there are no nasty red errors now) and that just seems to have changed which board fail.

Altera_Forum · ‎10-08-2012

We may have found something....

Although the sopc builder lets you select OLD_DATA for the memory blocks, if you look deeply enough into the generated stuff the memory is marked 'dual clock' - even though the same 'avalon clock' is requested on the component.

Dual clock operation can't do 'old data' (for good reason).

Quartos 12 has an extra option on the memory block for 'single clock operation', unfortunately we can't build with that without major changes.

See: http://www.altera.com/support/kdb/solutions/rd04172006_685.html?gsa_pos=2&wt.oss_r=1&wt.oss=altsyncram%20parameters

and: http://www.alteraforum.com/forum/showthread.php?t=3648

Altera_Forum · ‎10-09-2012

That might explain why the Qsys on-chip memory component issues the following warning when in dual port, single clock mode:

Info: onchip_memory2_0: Tightly Coupled Memory operation is not supported with s2 Avalon interface during single clock operation.

If you have an easily reproducible case perhaps manually editing the HDL to force it back into single clock mode would determine if it's worth exploring a solution to that problem. If that solves it then it might be possible to create a custom tightly coupled memory, I can't remember which tool prevents this but I seem to recall someone managed to get that working once.... I'm just not sure if it was in SOPC Builder or Qsys. How many memory locks do you need? Perhaps you can add a mutex component and use it as the locking mechanism instead to avoid the simultaneous read-write to the same address

If you have a lot of memory locks perhaps a secondary small single port memory could be used for your locking mechanism, but that would require a regular data master connections and cache bypassing. Last but not least another technique I like using for message passing (especially if it's frequent) are FIFOs, but that requires a pair if you want to move messages in both directions.

Altera_Forum · ‎10-09-2012

We do need TCM with OLD_DATA - doesn't matter if it is the s1 or s2 side (assuming that warning only applies to s2).

We can do some experiments with qsys - but not on the real project.

(The PCIe to avalon bus component isn't supported (or is different) - so that has to be changed.)

Or does OLD_DATA enforce a read-latency of 2?

Neither nios cpu has a data cache - the frequently accessed memory is tightly coupled.

Changing the shared memory to use the Avalon interface (ie not dual ported to TCM) wuold also slow down the cycles too much.

Adding mutex isn't possible, the one cpu doesn't have enough free clock cycles to acquire them.

The one 'mutex' I have that cpu does a 'trylock()' action and takes a dufferent path if the mutex can't be obtained.

I did a quick audit of the locations that might have issues earlier in the year, I remember that at least some of them were very problematic.

Altera_Forum · ‎10-09-2012

I asked around and a custom on-chip memory as tightly coupled memory is only possible in Qsys.

I'm not sure if that message is specific to s2 or if it really should apply to both. Yesterday was the first time I've seen that message so I don't know what the story behind it is.

It's been so long since I've parameterized a memory with the OLD_DATA parameter value that I'm not sure if it enforces a read latency of 2. I suspect it doesn't because unless things have changed in the on-chip memory component I recall the additional register that is added when you select 2 cycles actually lives outside of the on-chip RAM block (gives more freedom to the place and route engine to be able to move it around).

If the mutex component overhead is too high due to the lack of a data cache perhaps a single cycle custom instruction shared by both CPUs implementing a faster lock would do the trick. It could actually be 0 cycles but that might hinder the CPU fmax.

Altera_Forum · ‎10-09-2012

We've managed to hack the files generated by sopc to remove the second clock, the image builds, but there is something not quite right somewhere.

The TCM is on s2, we could switch to s1 if it might make a difference.

The problem with any mutex is that the cpu with the TCM doesn't have any spare clock cycles to waste waiting for the mutex to be available - or really to test it either.

The cpu is doing hdlc transmit (entirely in software) and has about 190 clocks to process a receive and transmit byte. The tx side only need to look at shared data when looking for a new frame - so can send an extra flag. The rx side has bigger problems - since it can't stall, although it doesn't need to look at any sequence numbers.

Altera_Forum · ‎10-12-2012

With the TCM set for 'single clock' we see occaisional memory read errors.

The code snippet below is calculation the transmit CRC16 (the custom instruction is combinatorial and updates the crc for a new byte).

If 'tx_src' points into the TCM both versions work.

If 'tx_src' points into SDRAM (no data cache) the '-' version generates correct TX data, but a constistently invalid crc.

         ldbu    r6, 0(r7)      #  * tx_src, s
-        ldhu    r11, 42(r8)    #   <variable>.hdlc_tx_crc, crc
         addi    r7, r7, 1      #  , tx_src, tx_src
+        ldhu    r11, 42(r8)    #   <variable>.hdlc_tx_crc, crc
         custom  1, r3, r6, r11 #  , <anonymous>, s, crc

Altera_Forum · ‎10-15-2012

We've switched s1 and s2 - so that s1 is the TCM and s2 the avalon bus.

With the second clock removed we definitely see latched 'old data' (with signal tap), but some of the avalon cycles are returning corrupt data.

There errors seem to be related to the cpu that has the memory TCM accessing avalon slaves! The EN2 signal is being driven by "*/*/HDLC_CPU_*stall*" - which looks like a signal for the wrong cpu.

Something is very cross wired!

In 'dual clock mode' the EN2 signal is pulled high.