I use a cyclone 10 gx fpga (10CX150YF780C5G) on my pcie card. One transceiver is used as a 10Gbps fiber link and other 4 transceivers are used as Gen1 Pcie X4 lanes. No external memory interface. The total power consumption is about 2W.
The problem is when fpga's die temperature rises above about 55C, the transceiver that is used to connect the fiber link(SFP) has occurred missing word error in transmitting side. I use the transceiver loopback function to monitor the data both at a remote receiver output and at the local loopback receiver output. They show exactly same fault. It tells us the PCS stage in the transmitter side has something wrong. The reference clock to the transceiver ATX-pll is 644.53125MHz. Other functions in the fpga is working well.
When I lower the temperature by add a fan to blow it, it works normally. Is there any one knowing how to fix it?
Thanks a lot!
In order to better understand the problem, do you mean if you enable the internal serial loopback (no using cable), the same problem is observed if the temperature rises above 55C, while the PCIe and other functions are still working well?
How many devices have a similar problem? Could you please try to test on another channel to determine if there is any channel dependency on this particular device?
You are right. Using the internal loopback(from transmitter's PCS output loopback to input of receiver's PCS) at the xgmii interface of the transceiver we get wrong data(mainly losing a 64-bit word in a packet) when operation temperature rises to 55C. Since when you loopback, the data actually is sent to the remote receiver over the fiber link too, so we can see at the remote receiver output that we get the same loss of data as the loopback output.
For the design, we get clean timing analyzer result without any timing violations.
We have total five boards. They fail at different temperature, the lowest is 55C and the highest is 72C not reaching to the specified 100C. In board design we only use one transceiver as a 10Gbps fiber link. It may not easy to switch to other transceiver in the FPGA to test other channels.
Thanks. Looking forward to your further helps.
If enable the serial loopback can see the problem, you should be able to test on other channels as well since it does not use the PHY channel (fiber cable), you can just change the Pin assignment will do.
Do you have a signal tap that can show the pass and fail condition?
Can you provide me a simplified design that only consists of 1 channel of 10G that can replicate the issue on your hardware, so that I can have a better understanding of what is the setting in the transceiver and 10G MAC IP?
Besides, please ensure your board design has met the Pin Assignment Guideline, especially the power supply, and transceiver pins.
The board works well under the temperature of 55C, so i think the pcb design should be ok and pins' assignment does not look having any problem so far.
For your further review, I can archive my fpga design and email it to you instead of post here. Please give me an email address. Does it sound right?
Thanks a lot for your helps.
Please refer to the following link (table 1), and check the GXB power supply (e.g. Vcct_GXB, Vccr_GXB, and Vcch_GXB), and determine if there is any difference between a pass and fail case. The FPGA is supposed to work at above 55C, and since you encounter it on multiple boards, so it is suspicious if there is a board issue.
Besides, you can send me a private message with the simple design, so that I can have a sanity check on it. Thanks.
Since this problem can replicate with internal loopback, can you please change the pin assignment as below and determine if there is channel dependency?
set_location_assignment PIN_AF25 -to "SFP_RXD0(n)"
set_location_assignment PIN_AG27 -to "SFP_TXD0(n)"
set_location_assignment PIN_AF26 -to SFP_RXD0
set_location_assignment PIN_AG28 -to SFP_TXD0
Besides, please refer to the attached screenshot, add the interface signals of native PHY to signal tap, and then compare the "tx_parallel_data" and "rx_parallel_data" to determine if there is a mismatch. The data drop may happen before this module.
We always test the data at these points. We usually call them xgmii interface. Only difference is a logical conversion of words from big endian to little endian between Native PYH parallel and xgmii. The data at xgmii shows OK.
I will add "tx_parallel_data" and "rx_parallel_data" into my signaltap to see if there is any difference.