Re: Altera SGDMA & TSE

Altera_Forum · ‎09-23-2015

Hi all,

I am working on TSE and is using sgdma for avalon_stream_to_mem descriptor, now in my code I have four descriptors,I am receiving network packets in my local buffer from sgdma in every sgdma callback function, now the issue is that I am getting lower throughput and thus lowering my transfer speed. Can any body know how to use sgdma at its best for getting higher data rate?

Altera_Forum · ‎09-23-2015

Are your four descriptors chained? When is your callback called? Is it when the chain is completed, or for each descriptor completion?

To make most use of the SGDMA the secret is to be sure it is always kept busy, and that is always has at least one usable descriptor. If you wait until it has parked at the end of the chain before you set a new one then you are loosing some time. I've never used the callback mechanism but I've used my own ISR, but I think this is possible with callbacks too.

Often the problem with low throughput isn't the SGDMA itself but the software that controls it. Be sure that you do as little processing as possible when handling the SGDMa interrupts, and especially use pre-allocated memory buffers. A malloc() call is very costly.

Be sure also that your processing of the packet data isn't the bottleneck. Obviously it should last less time than the interval you have between two packets.

Altera_Forum · ‎09-30-2015

Hi..

My descriptor chain size is 1, and callback is called when chain is completed. I am using Altera Sgdma drivers and I have tried but couldn't get the most of sgdma. In every call back I am getting a single frame from received in TSE. I have cleaned all the logic from the callback function & no malloc or memcpy is done and also using pre allocated buffer. So now what should I do to get the higher throughput. With current config i am getting my data transferred at upto 270KBPS but I need this to MBPS.

Altera_Forum · ‎09-30-2015

You should definitely be able to have a higher sample rate than that. You could try a longer chain if you can handle several buffers at the same time. What Nios CPU are you using and at what frequency? Do you compile your code with optimization? (-O2)

Do you call the do_async_transfer() function as soon as possible in the callback? To get the highest throughput it is important to do it with the lowest delay possible. What your callback should do is first to look for a new buffer for the next transfer, set up the DMA for the async transfer, and only then process the received packet. That way the DMA can begin loading the next packet while you process the current one.

If you have enough onchip RAM in your FPGA it can also be a good idea to use it for your network packets instead of main RAM. You can even use a dual port one with one port connected to the DMA and the other one to the NIOS CPU.

Altera_Forum · ‎10-05-2015

Hi..I am using NIOS II Small CPU at 125MHz frequency & ram size is 128MB. I have tried with -O2 optimization but didn't see any difference in performance. In the callback function first I am calling tse_mac_rcv() and then after checking for chain complete mask & status tse_mac_axRead() is called which inside calls do_async_transfer(). After that I am putting my logic for received packet. Is this the correct way to handle ISR callback? With this configuration my ISR callback is called max upto 250-300 times in a second and which indicates the maximum speed we can achieve to 250-300KBPS. Is there any other way we can handle these logic? I have not tried with multiple chain desciptor.

deepag · ‎05-24-2024

Hi..

Am using Quartus version 21.1 for FPGA programming and software development for NIOS II SDK Tool. Am working on TSE ethernet. we have added TSE IP in Quartus and BSP files got generated to NIOS Tool , but i could not find these functions like tse_mac_rcv() , tse_mac_axRead() , tse_mac_raw_send().

can you please send the files which consists of these functions .

Thanks & Regards

Deepa G

Altera_Forum · ‎10-05-2015

It looks okay but I haven't used the tse_mac_* functions in a while and I don't remember how bloated they are. You could have a look inside them and see what they do. Ideally they should just signal the IP stack that a new packet has arrived and return, instead of doing all the packet processing.

What network protocol are you using? TCP is slower than UDP as it requires more processing from the IP stack.

How big is the instruction cache on the Nios? Does changing to a /f core improve anything?

Altera_Forum · ‎10-06-2015

These functions I have taken from niche stack for ucosii. I have removed all the OS dependent code and modified it to use with bare metal software. So all these functions internally calls do_async_transfer and other basic APIs. My actual job is to take raw ethernet frames from USB port and send all those to TSE and similarly all the packets received from TSE to USB. In this case I am receiving large number of raw frames in a single usb transfer,so I segregate them all and send one by one to TSE using tse_mac_raw_send() api. In the Sgdma callback I am receiving single packet/raw frame and send those to USB. The speeds I am getting tested with iperf 4.72MBPS from USB to TSE and 300-315KBPS from TSE to USB. Now I want to increase the speed from TSE to USB part. Can you just tell me what can be the possible bottleneck in the system? And yes NIOS II instruction cache size is 32k.

Altera_Forum · ‎10-06-2015

Can you use a profiler? It should indicate what part of your code is using the most CPU and should help you find what to optimize.

Another way to do that is to comment away some part of the functionality and see if you get any signoficate speed increase. If yes it means the bottleneck is in the commented code. Do you also concatenate together several ethernet frames before sending them on USB? If not and if there is a significant overhead when sending through USB it could be it.

How are you sending to USB? Are you using a DMA that reads from the same buffer the ethernet frame was received in? If your software copies data around then it can reduce the bandwidth significantly.

Altera_Forum · ‎10-07-2015

No at the moment I am not concatenating ethernet frames before sending them to usb. I tried once but I found that I have to wait for frames to come then I concatenate and also I have to wrap all the frames with USB header and trailer and attach all the information like length of each frame, number of frames attached etc. I am using NCM class for doing this so I have to follow NCM fixed format for sending data. And so these all takes longer and results in increased delay if i ping from remote pc. And also I am not using DMA because each frame sizes are almost 60,90 and maximum 1514 bytes. Do you really think that concatenating number of frames will increase in speed?

Altera_Forum · ‎10-07-2015

Without profiling it's not possible to give a certain answer but yes I think so. Packing lots of small packets with software takes some time and will reduce the bandwith. If you want haigh bandwidth while keeping the same solution you need to pack several packets together. With the SGDMA you can even configure the DMA to automatically pick the different fragments where they are in memory and assemble them, reducing the CPU usage and again increasing the bandwidth. But as you say it will increase the latency. You have to choose between latency and bandwidth.

Another solution to have both a high bandwidth and a low latency would be to to the whole Ethernet to USB conversion in hardware instead, and don't use at all the Nios CPU or the DMAs. But it's more work.

Altera_Forum · ‎10-16-2015

Hi,

I tried accumulating multiple raw frames and wrap'em with usb header and trailer, but it adds latency while ping, one more issue is that my USB can't cope up with the speed with received frames frequency as from the beginning Rx interrupt hits very fast and so my queue gets full so data packets are dropped. Can you please tell me that how to handle this kind of situation when your up flow is polling based and down flow is interrupt based? And how these stacks like lwip or nichestack handles data as in all of them reception is sgdma interrupt based?

Altera_Forum · ‎10-20-2015

The IP stacks only use interrupts. Polling usually uses too much CPU resources. For higher performance some stacks have a high priority thread that is only responsible for receiving the packet after an interrupt and configuring the DMA again for the next one, while a lower priority thread does the actual processing on the packet.

It's difficult to tell without the code but from your description it looks like you have too many processing on the CPU side that is causing the packet drop. Use a profiler to find where the bottleneck is.

Do you have to do much software processing on the ethernet frames? If it is just simple encapsulation, you'll get a better performance by doing everything in hardware instead of going through a DMA and a software stack. I've never used USB cores, but if yours has some Avalon Stream interfaces it shouldn't be too complicated to set up.

Altera_Forum · ‎10-20-2015

Actually we are planning for putting all into hardware but assuming that in software we can get some good results based on that we'll just convert the logic into hardware..In returning data to USB I am taking one packet at a time from sgdma warp it. I also tried 5 packets wrapped into one usb packet and send. But the issue here is the receive packet queue gets full before forwarding all of them to USB. So I think my usb is getting slower compared to TSE and so resulting in loss of packets.

I used profiler and high performance counter also but the execution done only once and in that nothing looks like bottleneck or any of USB or TSE is eating more cpu.

Another option now I am trying is to poll the sgdma instead of interrupt so that packet loss can be controlled at the cost of bandwidth. Do you think is this the viable solution or putting all into hardware is the only way?

Altera_Forum · ‎10-21-2015

I'm not sure you'll get a significant improvement by using polling. It is true that you would spare the time lost by the CPU's handling of the interrupt, but on the other hand if you use the altera HAL do_sync_transfer() function to do polling, then the CPU can't be used to do something else (like sending the data to USB) while it waits for a DMA transfer. Another solution could be to bypass the HAL and use directly the SGDMA registers. When a packet is received, you configure the SGDMA for the next packet, enable it again, then process the received packet, send it through USB, and only then poll for the SGDMA status and wait for a new packet. That way if an Ethernet packet is received while you send the previous one to USB, you won't miss any CPU cycle. But again the gain compared to the solution with interrupt could be marginal.

The most optimal way is to bypass the CPU completely and put everything in hardware.

Altera_Forum · ‎12-24-2015

Hi,

After a long again need some help from you, as we are planning to move this whole to hardware by removing NIOS and creating a component that will take care of the nios part. Now i wanted to know that is it possible to use multiple descriptors in SGDMA. As the drivers given here from Altera can use and process single descriptors at a time but suppose if I want to make it to use multiple then how we can do that? And also suppose if i want to use this multiple descriptor concept in developing a hardware than how it can be done?

Waiting for your response...

Altera_Forum · ‎12-24-2015

I'm not sure how broad "move this whole to hardware" entails - but using the SGDMA is associated with a NIOS (software) implementation. If you're actually going to move everything into a hardware implementation, you may be better off not using the SGDMA and it's RAM-based descriptors. i.e. it may be simpler to write your own block to tx/rx the packet data than it is to write a block to program the descriptor RAM for the Altera SGDMA.

If you are still using the Altera supplied InterNiche driver (ins_tse_mac.c) then they have already done the work to make the driver support multiple descriptors: search for ALTERA_TSE_SGDMA_RX_DESC_CHAIN_SIZE for example. Or if you're using your own driver at this point, review their tse_sgdma_read_init() function and see how it is looping to create a chain of descriptors.

Altera_Forum · ‎12-28-2015

Hi ted,

As you are saying that there is already support for multiple descriptors in interniche drivers, then can you tell me how to use it for multiple descriptors and/or chain? I tried with calling altera_sgdma_stream_to_mem_descriptor() but couldn't get it done. If we increase the value stored in ALTERA_TSE_SGDMA_RX_DESC_CHAIN_SIZE to 2 or more than it will be a chain with two descriptors. Suppose if I want to use sgdma at its maximum capacity than how should I go with multiple chain or descriptor?

Also one question is that there are so many references like ucos ii interniche stack and lwip stack and standalone sgdma software. But none of them uses multiple chain logic. Is there any specific reason for this or people generally do not prefer to use multiple chain?

Altera_Forum · ‎12-29-2015

--- Quote Start ---

Also one question is that there are so many references like ucos ii interniche stack and lwip stack and standalone sgdma software. But none of them uses multiple chain logic. Is there any specific reason for this or people generally do not prefer to use multiple chain?

--- Quote End ---

I don't think any of these stacks are out-of-the-box optimized for absolute highest performance. They work well for their intended use, with modest performance on a modest processor. I think there is a general understanding that if you want to saturate one or more 1000mbps links, you aren't going to be doing that with a ~100MHz processor and software. See http://www.alterawiki.com/wiki/nios_ii_udp_offload_example

--- Quote Start ---

Hi ted,

As you are saying that there is already support for multiple descriptors in interniche drivers, then can you tell me how to use it for multiple descriptors and/or chain? I tried with calling altera_sgdma_stream_to_mem_descriptor() but couldn't get it done. If we increase the value stored in ALTERA_TSE_SGDMA_RX_DESC_CHAIN_SIZE to 2 or more than it will be a chain with two descriptors. Suppose if I want to use sgdma at its maximum capacity than how should I go with multiple chain or descriptor?

--- Quote End ---

This is largely a repeat of the good advice Daixiwen already gave you earlier in this thread:

Software driving SGDMA at it's maximum capacity would consist of a (very?) long circular chain of descriptors stored in dual port on-chip RAM with the other port of the RAM connected to the NIOS tightly coupled data memory interface. The software would basically consist of initializing the chain and starting the SGDMA, and then continuously process the chain looking for completed descriptors. You can either do that by foreground polling of the chain, or having the SGDMA set an interrupt on each descriptor and having your ISR process every completed descriptor each time it takes an interrupt.

In order to do this, you will need to overcome whatever issue you ran into with your difficulty using the HAL API's altera_sgdma_stream_to_mem_descriptor() etc. as you're going beyond any of the readily available examples.

You can get quite far with an approach like this, however at some point you will want to consider other options including the Modular SGDMA and developing custom IP for your fixed function.

Altera_Forum · ‎12-30-2015

Thanks..Got this multiple descriptors working fine.

Msg06484 · ‎08-08-2023

Hi,

I hope it is ok to append to this chat, I am new to this forum. My question is similar but more basic so I thought this might be the place to ask.

I have the lwip core. I have the sample design for the Cyclone IV E, found at https://github.com/adwinying/lwIP-NIOSII/tree/master/FPGA/software/lwIP_NIOS_II_Example

I am tying to develop this for the Arria10 in Quartus 18.1

I first instantiated the sgdma and found that the lwip does not like that...

I then found that it does like msgdma so I instantiated that, but still there are errors. Things like - ALTERA_TSE_SGDMA_INTR_MASK and ALTERA_TSE_FIRST_RX_MSGDMA_DESC_OFST

If I am telling the truth I also instantiated on chip memory (RAM) for the dma, and a descriptor_memory ROM.

I am the only person in the company on this project right now. I have access to a corporate tech support at Intel, but we are stuggling.

Does anyone know of a sample design that would be more fitting for lwip on Quartus 18 (possibly with arria10)?

Thanks in advance.