FPGA Intellectual Property
PCI Express*, Networking and Connectivity, Memory Interfaces, DSP IP, and Video IP
6343 Discussions

Improve performance of TSE on Stratix IV GX

Altera_Forum
Honored Contributor II
1,076 Views

Hello, 

 

I'm using the Standard design of Stratix IV on which I put the Nios II Simple Socket Server in order to receive incoming data from my other computer. 

 

I managed to establish a TCP/IP connection between my client and the FPGA server. 

 

My client application send 40 bytes data with the socket send() command. 

Everytime my Nios II receive a packet, he aknowledges it by sending back a packet to the client. 

 

Everything is fine except the performance of my Nios II which is about 70kbytes/s. 

 

I've read the "No free buffers for rx" from kaushal which helped me to instantiate the tighlty coupled memory and I modified the ipport.h file according to the help note of the Ethernet_accel_design. 

 

With the AN440 help, I also included the optimization.c and .h file and my ipport.h looks like this: 

 

#define ALTERA_MRAM_FOR_PACKETS     1 

# define ALTERA_MRAM_ALLOC_BASE         PACKET_MEMORY_BASE 

# define ALTERA_MRAM_ALLOC_SPAN          PACKET_MEMORY_SPAN 

 

# ifdef ALTERA_TRIPLE_SPEED_MAC 

char * ncpalloc(unsigned size); 

void ncpfree(void *ptr); 

char * ncpballoc(unsigned size); 

int ncpbfree(char *ptr); 

# if ALTERA_MRAM_FOR_PACKETS 

        //#include "../net/optimizations.h" 

    # define BB_ALLOC(size)        ncpballoc(size) 

    # define BB_FREE(ptr)        ncpbfree(ptr) 

    # define LB_ALLOC(size)        ncpballoc(size) 

    # define LB_FREE(ptr)        ncpbfree(ptr) 

# else 

    # define BB_ALLOC(size) ncpalloc(size) /* Big packet buffer alloc */ 

    # define BB_FREE(ptr) ncpfree(ptr) 

    # define LB_ALLOC(size) ncpalloc(size) /* Little packet buffer alloc */ 

    # define LB_FREE(ptr) ncpfree(ptr) 

# endif 

# else /* Not ALTERA_TRIPLE_SPEED_MAC */ 

# define BB_ALLOC(size) npalloc(size) /* Big packet buffer alloc */ 

# define BB_FREE(ptr) npfree(ptr) 

# define LB_ALLOC(size) npalloc(size) /* Little packet buffer alloc */ 

# define LB_FREE(ptr) npfree(ptr) 

# endif /* ALTERA_TRIPLE_SPEED_MAC */ 

 

and also: 

#if ALTERA_MRAM_FOR_PACKETS 

# define NUMBIGBUFS 30 

# define NUMLILBUFS 30 

/* some maximum packet buffer numbers */ 

# define MAXBIGPKTS 30 

# define MAXLILPKTS 30 

# define MAXPACKETS (MAXLILPKTS+MAXBIGPKTS) 

# define BIGBUFSIZE 1536 

# define LILBUFSIZE 128 

 

Please see my .qsys file and my ipport.h file attached. 

 

Except the Hardware checksum calculation, is there is anything I could try in order to improve the Simple Socket Server performance to at least 1Mbytes/s? 

 

Do you think that putting my .stack and .heap directly into a 300kbytes onchip memory could help me having better results? 

 

Thank you in advance 

 

With regards, 

Michel
0 Kudos
5 Replies
Altera_Forum
Honored Contributor II
292 Views

Putting the stack on an on-chip memory will help. Did you remember to compile with optimisations? Going from -O0 to -O2 is what gave me the highest speed boost. 

With so small packets you won't get a very high performance unfortunately, because what takes time is the header processing and not the data transfer by itself. Do you have any way of combining data from several packets into bigger ones? This will give you a big speed boost too.
0 Kudos
Altera_Forum
Honored Contributor II
292 Views

You'll increase the throughput significantly if you allow more than one data packet before requiring the application level ack. 

 

If you need throughput and data reliability, you may find that the tcp retries following packet loss actually take too long. Using your own protocol over udp may be necessary in the long run.
0 Kudos
Altera_Forum
Honored Contributor II
292 Views

I already tried to ask the ack after 5 received packets but I had synchronisation problem as if the TSE received 1 packet late whereas the Client was already waiting for his ack signal. This resulted dead end in my transmission and I can't take any loss of packet neither can I have a crash. 

I'll try to instantiate a ack_num in my own packet (just like the one in the TCP protocole) but does this improvement really matter in the ethernet speed? 

 

Actually I managed to have bigger data packet (1024 bytes) and my ethernet speed is now about 2Mbytes/s but I'd like to have like 5Mbytes/s if possible. Do you think it's possible without a hardware checksum calculation? 

 

Otherwise, I guess I'll have to do my own transmission protocole is that right?
0 Kudos
Altera_Forum
Honored Contributor II
292 Views

The udp data checksum is optional (just very recommended). Setting the checksum field to 0xffff (or is it 0x0000) tells the receiving side to not validate the checksum. 

 

To obtain 5MB/sec you probably need to overlap the network time with the nios processing of the previous/next (depending on direction) packet. 

Whether that speed is attainable - I don't know. 

Whether it is obtainable without writing a bespoke TCP/IP stack is also questionable.
0 Kudos
Altera_Forum
Honored Contributor II
292 Views

Thank you dsl,  

I think that I'll leave the checksum if it's very recommended, I wouldn't like to do unrecommended things with my FPGA =) 

 

I don't get the network time overlapping with the processing of previous/next. Do you mean that I should like unqueue the received packet and store them anyway even if they aren't like packet_t | packet_t+1 | packet_t+2 ? 

If that's what you mean I can't do that because I need my packet to be chronologically ordered since it's a real time application. 

 

I think that I won't create my own TCP/IP stack since it may be risky for my projet timeline. 

 

Do you think that I can offload my 1440 bytes buffer using the Nios PIO? 

I'd like my VHDL component to read that buffer 40 bytes by 40 bytes but it seems that the Nios II is way to slow. 

 

I was thinking about storing these data into 10 differents FIFO but I'm not sure if it'll be fast enough.
0 Kudos
Reply