Nios® II Embedded Design Suite (EDS)
Support for Embedded Development Tools, Processors (SoCs and Nios® II processor), Embedded Development Suites (EDSs), Boot and Configuration, Operating Systems, C and C++
Intel Support hours are Monday-Fridays, 8am-5pm PST, except Holidays. Thanks to our community members who provide support during our down time or before we get to your questions. We appreciate you!

Need Forum Guidance? Click here
Search our FPGA Knowledge Articles here.
12436 Discussions

Altera Cyclone V: UDP data transfer without packet loss

Honored Contributor II

Dear all, 


I am a newbie in Altera Cyclone V SOC. 


We are conducting a feasibility study on our Altera Cyclone V SOC.  

We are having some streaming data receiving at FPGA at 300 Megabits per second. 

We have to transfer this data without any packet loss to a server machine over WiFi connection. 

UDP client application works on Altera SOC and server application works on an Ubuntu desktop machine. 


Initially we tried out a sample UDP_client and UDP_Server application which sends hardcoded buffer of data [from user space only. Not data read from FPGA via kernel space] to measure the maximum data rate over WiFi. 

We got the range of 250Mbps to 500Mbps. [Without tuning and tuning some network parameters as suggested in the link [


We are running Angstrom Linux supplied by Altera (Linux kernel version 4.9). 


Two methods we have tried out already. 


Method 1:- FPGA is programmed to write the streaming data to DDR memory.  

We have written a character driver, tried out interrupt driven approach (one interrupt in every 1 ms). 

From the ISR, we have woke up a read() method of our character driver which copies the 30K buffer available in DDR memory in every 1ms to user space. (copy_to_user()). 

From the user application, we used sendto() method for UDP transfer. We found out that periodically packets gets lost. 


Method 2:- FPGA is programmed to write the streaming data to DDR memory every 1ms to a reserved 1MB space and a register is available which tracks the number of packets already written (Erase on read type programmed).  

Character driver is written to access the DDR memory.  

User application in every 10ms issues a IOCTL to read the register storing number of packets available, copies the available packets to user space using read(). copy_to_user() mechanism at kernel side. 

Split the buffer received in chunks of one packet. (30K buffer send via UDP in a for loop). we used sendto() method for UDP transfer. We found out that periodically packets gets lost. 


Again we found packets gets lost at server side.  


Please share your valuable thoughts on this. Are we going in the right direction?  

Are there any better methods to attack this problem? Does any zero copy / mmap the kernel buffer solve this issue? 

Or are there any techniques to avoid the multiple overhead of kernel space/user space transition so that data can be sent directly via UDP? 


Please help.. We are now almost blocked..  


Looking forward to hear from you soon. 

Thank you, 

0 Kudos
6 Replies
Honored Contributor II

You have got to expect some packet loss, especially on wifi. For your specific problem the first thing is to figure out if the lost packet has actually been sent or not. If the UDP transmit buffer for your socket becomes full, then the IP stack on the sender can decide to throw the data away instead of sending it. I think that in that case sendto() will report a count lower than expected. 

The next possibility is that an error occur on the packet data while it was sent on he wifi link. In that case there is nothing to do. Packet loss occurs due to noise on the channel. 

The last possibility is that the packet gets discarded on the receiving end, if the UDP receive buffer is full. This can happen for example if packets arrive in bursts and the application isn't fast enough to read them. 


The UDP protocol is unreliable. If you need all the packets to be sent you need to implement a higher level retransmission system. For example give a sequence number to each packet, and have the receiver send some feed back packets with a list of the missing sequence numbers, and then the transmitter must send them again. This needs a lot of changes of course, especially because the sender must keep the data for a longer period even after it has been transmitted. Alternatively you can use TCP which does all this and more but for high bandwidth applications with a few lost packets its congestion management can cause more problems than it solves.
Honored Contributor II

Hi Daixiwen, 


Thank you for your valuable response. We understood the point regarding transmission side. 


A slight update based on our latest analysis results.  


We have put a long run of our test code commenting out the data transmission porting. That means from a user thread in a 10ms period, call the read() [Kernel side - copy_to_user()] for a fixed size bytes. 

Check the circular buffer usage at this time. We understood that circular buffer overflow happens at this time.  

1st reason of our packet loss is we can't handle the data copy within data generation speed. We need to understand whether copy_to_user() is the real culprit. 


To confirm this, we have put another overnight test where we measure the read() API time taken with gettimeofday() API like the following. 





read() //fixed size buffer 




From this we understood, elapsed time varies from 2ms to 14ms for a fixed size 300K buffer. We doubt this is the scheduling latency at different layers + fixed overhead of memcpy that is causing this issue. 


Hence we are trying to change our design to interrupt driven approach using memory mapping. 


Please share your valuable thoughts on this.. We will update you with our findings meanwhile. 


Thank you, 

Honored Contributor II

The variation in the elapsed time seems to indicate that you have another task taking priority over yours. Another explanation could be that the time returned by gettimeofday() is not that accurate. I'm not very familiar with embedded Linux so I can't give you more details there, but which period are you using for the system timer? As it is used by the scheduler, it may be worth to check that it is set to 1ms and not 10ms and you may get a bit more consistent results. 

Do you really need Linux on your target? Given your requirement it may be worth it to use a real time operating system instead where you have more control on how the processor will be used on critical tasks. You could also squeeze out more performance from other lighter IP stacks such as LwIP. 

I think there are some variations of Linux with real time OS optimizations but I don't know if they can be used on the SoC platform.
Honored Contributor II

Hi Daixiwen, 


Thank you for your valuable comments. 


Please find answers for your queries below:- 

1. Which period are you using for the system timer? 

[Lullaby] >> Scheduler is triggered every HZ quantum. HZ is by default set as 100 in Linux configuration [Default value for ARM]. Hence scheduler gets ticked every 10ms.  

I am not sure if it is a best practice to set it to 1ms. If scheduler is ticked every 1ms, that would also be an overhead, right? 


2. Do you really need Linux on your target? 

[Lullaby] >> As I said, we are doing a Proof of concept. We consider Linux as the first choice because it is open source, community support, availability of wide range of drivers, stacks and is proven. 

If we find we are not able to manage this data rate deterministic, we would definitely think of any RTOS options. But what is your option on using RT patch on Linux? Do you know any useful links which we can refer for enabling RT in Altera Linux kernel 4.9. 

As I am new to Altera community, I am not sure which forum I need to post the query on Altera provided Linux. Please guide me. 



Looking forward to hear from you soon. We will update the results of our experiments. 


Thank you, 

Honored Contributor II



Maybe I think that you are using a dual-core ARM cpu. How about the method of using cpu affinity? 


Honored Contributor II

Yes putting the system timer to 1ms instead of 10ms will cause a bit more overhead, but also a better granularity on how the processor power is distributed between the tasks. On an non-RT operating system it could make a difference, it can be worth it to try. I don't think the overhead is that high for a >600MHz CPU. 

As for you other question I'm sorry but as I said I'm not familiar with embedded Linux. If you don't find any answer here maybe you could try a more generic ARM Linux forum. 

We are using eCosPro ourselves but I think some people have also been successful with FreeRTOS on the SoC platform.