- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
We have 4 copies of SerialLite II IP core in our Stratix IV GX device.
It used to work fine when we use it in stream mode. After we changed it to packet mode, sporadic problem started to show up. When the problem shows up, the word order in the packet is messed up, and sometimes, the downstream device receives two EOPs for every SOP.
It does not happen all the time and it happens only for some particular compilation. For example, after one compilation, the problem could jump from instance one to instance three, and it could disappear altogether. And using signaltap, we have verified that the packets going into SerialLite is fine when the problem happens. My guess is that this is a classic poor design practice failed in timing or cross clock domain. Since the IP is encrypted so there is no way for me to look into it and debug. Looking in the signaltap, it seems some weird 8x83 FIFOs are used and I suspect that is where the problem is. Again, I can’t look further without the source code. One more piece of information, the first run is always clean after a reset is issued to SerialLite.
What do you suggest how I can go forward to trouble shoot this?
Thank you
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
As I understand it, you are observe intermittent issue with the SLII IP where you observe the RX received 2 EOP for every SOP after you switch from stream mode to packet mode. This seems to be dependent on Quartus builds where some builds do not exhibit the issue but some builds do. Also the failing instance seems to be random according to builds as well.
Based on this observation, this is trending towards potential timing problem. To further narrow down the issue, would you mind to try running Modelsim simulation with your design to isolate the functional issue prior to debug into timing issue.
Also, it is recommended for your to create simple test design ie with one or two instance with single lane to facilitate the debugging process.
Do you observe any timing violation or anomaly in the failing compilation?
Please let me know if there is any concern. Thank you.
Best regards,
Chee Pin
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
First, from my experience, simulation is a good tool to catch logic issues, but a very poor way to debug timing issue, especially if the problem is between two different clocks. However, I have run simulation and did not see any problem. You know, running simulation over high speed serial link for reasonable during for the problem to show up is not very realistic. Sometimes, simulation takes longer than the compile time for it to show meaningful results. It is actually easier to put in signaltap to debug the problem.
Your suggestion of simplify the design does not work either. As I said, every compilation of the same design gets different results and in most cases, the problem goes away, simplified design won't reproduce the problem. And the problem only shows up in our system, which comprises of multiple boards, FPGAs and software, when running some special cases. Simplified design won't trigger the issue.
In those compilations that reproduced the problem, there is no timing violation, no unconstrained clocks, and everything looks normal.
The biggest thing getting in the way for me to debug is the source code is encrypted. There is no way for me to debug it. I can put some signals on signaltap but I don't know the logic around those signals. And again, what makes it harder is that most times, after I changed the signaltap, the problem goes away.
Thanks,
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thanks for your update. You are right, the main purpose of us running the simulation here is to help isolating potential functional issue to further narrow down the issue. Glad to hear that there is no issue with the simulation.
Thanks for sharing that after changing the signaltap logic, the problem goes away. This is trending toward potential timing issue. Just would like to clarify with you on the following:
1. When you mentioned two EOPs for every SOP, I believe you are observing this in signaltap. Is my understanding correct?
2. When the issue occur in signaltap for a specific lane, do you observe failure in that SL lane as well? Just to further isolate out if it is signaltap only issue or SL lane issue. For example, potential timing issue which cause signaltap sampling error.
Please let me know if there is any concern. Thank you.
Best regards,
Chee Pin
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Chee Pin,
Thanks for the reply. I will answer your question one by one.
1. Yes, but in the down stream device. Let's name the device that sources the traffic and we suspect causing the problem is FPGA1, the down stream device that receives all the packets on SerialLite is FPGA2. We put signaltap in FPGA2 at the Rx Atlantic interface out of the seriallite and saw the words are out of order in the packets and sop and eop do not match each other. We also put signaltap in FPGA 1 at Tx atlantic interface input right before seriallite and did not see any problem while problems were observed at signaltap in FPGA2. The reasons that we believe the problem is in FPGA 1 Tx is that problem goes away for every first run after we reset seriallite in FPGA 1, and FPGA 1 only. Problem comes and goes only when we recompile FPGA1, not FPGA 2. We recompiled FPGA 2 many times and nothing changes.
2. This is bonded 4 lane link. And overall link status signal "stat_rr_link" never goes down even when the problem shows up. We never put signaltap at the lane level since it is not going to be of too much use to us. Again, seriallite is encrypted and it is a black box to us. We pull signals out in signaltap but we don't know what we are looking at without source code. I would assume byte alignment, word alignment and lane bonding all work since words are out of order on 64 bit word boundary, not individual bit or byte. Plus, link never goes down. In signaltap of both FPGA, we basically use the clock at the atlantic interface to clock the data, they should be the signals for the right clock. There is no setup violation in the timing report. No unconstrained clock.
Thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Chee Pin,
If you would like, I can send you some signaltap waveforms. But not on this forum, because of IP and security concerns.
Thanks,
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Sorry for the delay. Thanks for your update. I understand that there are two FPGAs where you are suspecting that the issue is cause by the upstream FPGA1. When you recompile the FPGA1 design, the error might goes away. There are 4 bonded lanes in the design.
Before we further engage our timing team, would you mind to help test on the following:
1. With the failing build, can you perform loopback from the FPGA1's TX back to its own RX to see if issue occurs. This would be helpful to narrow down to FPGA1 only. If there is any issue at the SL IP in FPGA1 TX, its own RX should see similar error as FPGA2's RX.
2. You can try the loopback with both internal serial loopback and external loopback to see if there is any difference.
3. If there is no issue with internal and external loopback, you might want to look into if there is any potential trace length mismatch in the connection between FPGA1 and FPGA2 to narrow down connection issue.
Please let me know if there is any concern. Thank you.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Again, let me answer your question one by one.
1. This is really hard to do. We first observed this problem in our system, with multiple boards in the chain in different cabinet. It is duplex link but we only use it for one direction, FPGA1 to FPGA2. The end product of this is the RF signal came out wrong then we realized there is a problem then we put in signaltap at each stage and suspect it is FPGA1. I could put a loop back at FPGA1 output but I can only test it by putting signal tap on FPGA1 Rx. Since we don't use this direction, that is quite a bit change and again every time we make changes to the design or even no change just recompile the problem goes away. Second, if we put FPGA1 loopback, FPGA2 won't detect a link the system software will not run and create packets. We have to change the software to make the test. That is quite some exercise.
2. Same reason as 1. And I really doubt it is the phy, transceivers or high speed serial link itself, since our system software monitors those error siganls like err_rr_8berrdet, err_rr_disp, err_rr_pol_rev_required, overflow, underflow etc signals all the time and has counter on those to count how many times they happens. the error counters stay error. i am convinced that it is not link layer but somewhere above.
3. Again, i doubt it is link layer and below problem. One more piece of information, at the beginning, we used streaming mode for seriallite and never saw this problem. about one year ago, we change this link to packet mode, this problem started to show up. that further indicates it is something in the seriallite core.
Thank you
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thanks for your update. I can understand the effort requires to create test design to help narrowing down the potential root causes. For your information, with the whole system hooked up, it is rather difficult to debug and narrow down. This is why I am suggesting that we perform a loopback at the FPGA1 where you suspect the issue coming from. To avoid affect other component in the system, not sure if it is possible for you to create simple duplex SL II design in FPGA1 and perform a loopback to see if issue pops up. If we can replicate this, it would be helpful for debugging since we narrow down to FPGA1 and single SLII IP core. I understand this might require some effort to work on.
On the other hand, to avoid any further delay, I would suggest we further engage our timing team to help look into and advise if there is any potential anomaly from timing perspective. Since I am unable to duplicate case from here, would you mind to open a new Forum case with title specific to timing ie "Timing debugging required on Quartus build dependent packet issue". You may then briefly your observation where there are varitaion from build to build and mention timing analysis debugging is required. You may then let me know the case so that I can help to route to our timing team.
Please let me know if there is any concern. Thank you.
Best regards,
Chee Pin
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Just would like to follow up with you if you have had a chance a open a new case to request for timing team's assistance? Please feel free to let me know the case so that I could help to expedite the routing. Thank you.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
We shall continue to support you in the new case. This thread will be transitioned to community support. If you have a new question, feel free to open a new thread to get the support from Intel experts. Otherwise, the community users will continue to help you on this thread. Thank you.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page