Our company makes products using a board we designed which has two Stratix-V parts. We have various FPGA designs that operate on this board already. We were recently motivated to change and add some features to one of those designs. We found when we modified the FPGA design that features which had worked before stop working. It is strange because for example on one build a particular feature stops working and then on a later build that feature works again while a different feature stops working. Even on a given build sometimes it behaves differently from one FPGA/board to another. It is a fairly large complex design that can use up to 90% of ALM and 85% of M20K. We tried removing some existing features to see whether the remaining features work reliably in a less dense design. Nonetheless we would still have sporadic results from one build to the next. It is rare for us to get a build for which all features work. We believe our design has proper timing constraints though it is all subject for review. When we do builds we frequently get timing closure though not always. We only test builds that have timing closure.
- FPGA Design Tools
Sorry for the delay. Here are more details..
The two Stratix-V parts on our board are each 5SGXEA7H2F35C3.
The board was designed symmetrically in the sense the same FPGA build file could be loaded into both FPGAs for full symmetric functionality.
Between the two FPGAs there are four serial lines each operating an independent Interlaken link running at 12.4 Gbps.
Each FPGA receives the four Interlaken signals sent by the other. There is no channel bonding.
The GX transceivers running Interlaken are instantiated using the Native PHY IP core.
Each Stratix-V part also uses some GX transceivers to interface SFP+ ports on the board.
On each FPGA two of those ports operate in 10G Ethernet mode.
The GX transceivers for those Ethernet ports are instantiated using a single 10GBASE-R PHY IP core configured for 2 channels.
For each Ethernet port an "Ethernet 10G MAC" IP core is also instantiated.
For the Interlaken links, on some builds, one of the four links reports a failure to achieve 67/64 PCS block lock.
The remaining three links so far operate correctly.
On a given build on some boards both directions of the link report the failure. On other boards there is no failure.
In rare cases we had a board on which one direction works while the other does not.
On a different build, all four links might work correctly.
For some builds we saw a strange phenomenon. The link reports block lock but when we pass certain payload in the Interlaken metaframes the CDR on the receive side fails to retain lock to data.
When the payload is removed the CDR lock succeeds again.Or with different payload the CDR locks fine.
When we pass the same payload that causes the trouble on the one link through a different Interlaken link the new CDR does not have trouble keeping lock.
Again, on other builds this may not occur.
10G Ethernet trouble:
Our design generates Ethernet frames for transmission and then forwards them through the Avalon interface into the MAC core which then forwards frames through the XGMII interface to the 10GBASE-R transceiver core.
We also support the reverse direction of receiving frames through the transceiver core and through the MAC core.
On some builds we pass frames into the MAC core as usual and the Avalon txstatus ports indicate transmission with no errors and yet the receive side of the Ethernet link does not indicate any frames. The Ethernet link is reported as up.
We sometimes see frames go out on one Ethernet port and not the other. The failing port could change from one build to another.
On some builds there are no transmit problems. On some boards and some builds we have receive problems in that the Ethernet frames are transmitted successfully and the Ethernet link is up but the Avalon receive port does not indicate frames.
when we experience similar problems, it often turns out to be incorrect timing constraints. But you are already investigating this. You could also focus on power issues. Once we encountered similar problems on Arria10, when we added computationally expensive feature. Sometime it worked sometime it did not. It turned out, that as chips heated, their power consumption increased and the power source was not good enough. The better cooling solved the issue. So I would recommend to investigate temperature/power quality.
Thank you for the reply. As you say large designs do require more power and we are checking into that as a possible cause. It is in our pipeline to get a power source that supports higher current to see whether it makes a difference. As it is we don't see dips in the core voltage when we put a scope on it. And the temperature of the part does not measure as excessive. We do have heat sinks.
I see you write several time about Core voltage but you describe trouble with I/O interfaces. Did you check also other voltage rails (transceivers can behave weird if the voltage is not correct, particularly during their calibration and reconfiguration)?
As far as we know our timing constraints are correct. And we only test builds with timing closure. The core voltage seems stable when we measure it on a scope and temperature does not read as excessive.
Yesterday I posted more details. I am not sure if you saw them since my posts seem displayed out of order on the web page. Based on those details do you have any more specific suggestions on what to review?
We can have an expert review our constraints if that is possible.