NicheStack multiple select() call fails!

Altera_Forum · ‎11-06-2009

Hello @ all,

I have created two (or more) networktasks in the µC/OS-II enviroment wich are using the Niche TCP/IP Stack. When i perform a select() call from both tasks to there sockets, only the select() of the higher priority task returns on data-reception! All other selects() wich have been called by lower priority tasks block forever, even if there is data received on the coresponding sockets!

I used the 8.1 IDE and have insatalled the patch from the Altera support solutions (rd01132009_588).

I have allready debugged into the NichStack Funktions, it seems, that the tcp_wakeup() doesn't work for the lower priority tasks.

Any ideas?

Altera_Forum · ‎11-06-2009

There is a known problem with the nichestack being utilized from multiple UCOSII tasks. I was unable to get a specific problem ID from Altera, but they say the problem was supposed to be fixed in 9.0, but the fix didn't work, and they hoped that it would be fixed in 9.1, although at first glance of 9.1 I don't see any release notes describing a fix. The nichestack version in 9.1 remains at 3.1, the same as 9.0.

If you use the jtag uart for stdout and stdio, you will probably see "dtrap - breakpoint needed" messages generated by nichestack.

My solution was to refactor my code so that all tcp/ip calls are done from a single task which gives time slices to the individual network-related functions I need.

nioswiki.com has an example of a "superloop" version of nichestack usage, where all network calls are handled in a single task. This is a good starting point for you.

The catch with the superloop model is that it is all or nothing - either superloop or UCOSII. I needed the networking to function properly in the presence of UCOSII handling other functions so I ended up with "superloop in a task" with all of my network code in a superloop inside a single UCOSII task. In order to do this I had to reverse-engineer the ipport.h file found in your syslib, in order to set the appropriate DEFINES that weren't documented, and I had to modify netmain.c located in misclib under the Altera components syslib/Nios II Software Packages/altera_iniche/UCOSII/src/misclib.

Not the answer you wanted to hear, but an alternative if you aren't able to resolve your specific problem.

Mike

Altera_Forum · ‎11-06-2009

Thank you, for your reply!

You are right! That is not, what i wanted to hear :)!

I have currently a support-request running at Altera --> Perhaps there is a Bugfix available for that Poblem.

In my design i have implemented a Webserver (based on the simple_socket_server) wich is able to hold up to 10 connections. An some other Networktask where each of it must be able to handle more than one socket-connection either. So for me the select() function is very powerfull, because i can set a fd-structure (fds) containing all sockets of the coresponding Task wich should be checked for readability. Then the selct() blocks and other tasks can do there work until the select() returns.

I have to fullfill very taff requirements regarding the "reaction-time". When data is applyed on a socket, it must be read and processed as fast as posible! The Webserver is only for diagnostic-information and configuration, so thi task has a much lower priority.

I think, that all advantages of using an RTOS are gone, when a multiple select can not be handled by the Stack!

By the way did you know, if multiple blocking revfrom() calls work??

Altera_Forum · ‎11-07-2009

There is a further suspicious thing...

When is used the original implementation, wich comes out with the 8.1 design-suite, it seems that there is no problem with multiple selects() in general. But these implementation has the issue that sometimes the select doesn't return when rceiving tcp-pakets. So i decided to install the altera patch.

Altera_Forum · ‎11-09-2009

Try to edit your ipport.h file. Look for these lines:

/* use the sleep/wakeup in misclib/netmain.c */
//#define TCPWAKE_RTOS    1

and uncomment the define.

It's not rock-solid, but it's a lot better. Nevertheless I wouldn't recommend to use the Interniche stack from multiple tasks in production code... or I would recommend to stick to 8.1

Altera_Forum · ‎11-09-2009

Thank you for your reply!

I will try it!

I noticed, that when the select() is first entered within both tasks, the wakeup-function wakes both tasks correctly (this is done for timeout-checking within the select).

But when is send data to the socket of the lower-priority tasks, this task will be excluded permanently from the cyclic wakeup.

So i thingk the problem is located in the tcp_sleep and/or tcp_wakeup functions or there is a bug in the semaphore- mutex- handling....

But what do you recomend to do in my case?

(I need a very short latency between data reception on a socket and it's processing).

Altera_Forum · ‎11-09-2009

I saw the same thing as you... The application worked fine with the Nios II 8.0 IDE, and started to get synchronization problems when we switched to 9.0. The TCP_WAKE define solved it, but we still have a crash of the TCP/IP stack about every 2-3 days. But those crashes could be unrelated to this synchronization problem.

Since then we changed of operating system and don't use the Interniche stack any more, so I didn't investigate the matter further.

Altera_Forum · ‎11-11-2009

Hello @all,

i found a solution to fix the problem regarding the select()-call in an multitask enviroment.

How did i found the Bug:

At first i modified the tcp_wakeup() function wich is located in the file "tk_crnos.c" to get an proper debug-output i insert a printf just before the folowing code-line:


//my modification:
dprintf("w %u %d\n", OSTCBCur->OSTCBPrio ,i);
//END my modification
/* we found the TCB with our cookie */
error = OSSemPost(WEP->wake_sem);

This printf reports the the task from wich the tcp_wakeup()-function was called, and the priority of the task wich will be woken up.

In my case i have the folowing tasks running on my little test-implementation:

"inet_main" Prio 2 --> NicheStack network task

"clock_tick" Prio 3 --> NicheStack network task

"net_task_1" Prio 4 --> My first user task

"net_task_2" Prio 5 --> My second user task

On startup each user-task creates it's own udp-socket and performs a blocking select()-call on readability like this:

result = select(fd + 1, &readfds, NULL, NULL, NULL);

When the select returns, the data will be read from the socket and displayed on the nios2-terminal.

However when i run my implementation, i saw the folowing printout:

w 3 4

w 3 5

w 3 4

w 3 5

and so on.

This means, that the task with the priority-number 3 (clock tick) wakes up all tasks wich are curently sleeping in an select(). This wakeup is needed, to monitor a select-timeout if a timeout is passed to the select-call by the user.

And now here is the debug-output when sending data to the socket of the higher-priority networktask (the task where everything is fine):

w 3 4

w 3 5

w 2 4

[net_task_1] Received data from the socket!

w 3 4

w 3 5

As you can see, the "inet_main" task from the network stack calls the tcp_wakeup()-function when there is data on the socket! The select() then returns to the user-space, the data reception can be handled and every thing is fine!

Let's have a look at the other socket:

w 3 4

w 3 5

w 2 4

w 3 4

This simply means, that the "inet_main" task wakes up the wrong task with prio 4 instead of prio 5. And as you can see, after that the "net_task_2" (prio 5) will never be woken up again!

So i decided to take a closer look at the tcp_wakeup() function:

 
void
tcp_wakeup(void * event)
{
   int   i;          /* task table index */
   INT8U error;
 
   /*
    * gain control of the global wakeup mutex
    */
   OSMutexPend(global_wakeup_Mutex, 0, &error);
   if (error != OS_NO_ERR)
   {
      dprintf("*** tcp_wakeup, OSMutexPend = %d\n", error);
      dtrap();   
   } # ifdef TK_CRON_DIAGS
   dprintf("+++ tcp_wakeup = %lx\n", event);# endif
   /*
    * we are now in mutex
    * -----------------------------------
    */
 
    /* 
     * Loop through task tables, try to find the cookie.
     */
   for (i = 0; i < OS_LOWEST_PRIO; i++)
   {
      struct wake_event *WEP;
      OS_TCB *tcb;
      if ((tcb = (OS_TCB *)OSTCBPrioTbl) == (OS_TCB *)NULL)
         continue; /* unassigned priority */
      /* use extension */
      WEP =tcb->OSTCBExtPtr;
      if (WEP->soc_event == event)
      {# ifdef TK_CRON_DIAGS
         dprintf("+++ tcp_wakeup OSSemPost = %lx\n", event);# endif
//TBD
dprintf("w %u %d\n", OSTCBCur->OSTCBPrio ,i);
//END TBD
        /* we found the TCB with our cookie */
         error = OSSemPost(WEP->wake_sem);
         if (error != OS_NO_ERR)
         {
            dprintf("*** tcp_wakeup, OSSemPost = %d, %p\n", error, WEP->wake_sem);
            dtrap();   
         }
         /* clear the cookie */     
         WEP->soc_event = NULL;
 
         /*
          * give up mutex
          */
         error = OSMutexPost(global_wakeup_Mutex);
         if (error != OS_NO_ERR)
         {
            dprintf("*** tcp_wakeup, OSMutexPost = %d\n", error);
            dtrap();  
         } 
 
    return;   /* we woke it up ! */
      }
   }  /* for() */
 
   /* 
    * we didn't find the cookie in the wake set.
    * Q it up.
    */
   insertWakeSetEntry(event);   
   /*
    * give up mutex
    */
   error = OSMutexPost(global_wakeup_Mutex);
   if (error != OS_NO_ERR)
   {
      dprintf("*** tcp_sleep, OSMutexPost = %d\n", error);
      dtrap();   
   }      
 
   /*
    * we are now out of the mutex
    * -----------------------------------
    */
 
   return;
}

I noticed, that the tcp_wakeup() function simply loops through all tasks (from high-prio to low-prio) and searches for a cookie wich indicates, that this task is curently waiting on a select(). If this cookie is found, the coresponding taks will be waked up and tcp_sleep() returns.

When two tasks are pending on a select, only the task with the highest priority will be woken (although the received data on the socket isn't for this task).

I have modified the tcp_wakeup()-function so that the search loop won't brack at the first task. In other words: Now every task pending on a select will be waked up if there is data on the socket! Note, taht this wakeup doesn't result in a wrong return from select() within the user space, it's just an "inner-select-wakeup".

Now everything works fine....

Here is the modified code:


/*
* tcp_wakeup(void * event) - wakeup TCB with this event,
* else put in wake set.
*/
void
tcp_wakeup(void * event)
{
//Declaration:
int i; //task table index
INT8U error; //Error-flag (needed for semaphore access
int cnt = 0; //Counter for woken-up tasks
 
//Gain control of the global wakeup mutex
OSMutexPend(global_wakeup_Mutex, 0, &error);
if (error != OS_NO_ERR)
{
dprintf("*** tcp_wakeup, OSMutexPend = %d\n", error);
dtrap(); 
} 
//We are now in mutex# ifdef TK_CRON_DIAGS
dprintf("+++ tcp_wakeup = %lx\n", event);# endif
//Loop through task tables, try to find the cookie.
for (i = 0; i < OS_LOWEST_PRIO; i++){
struct wake_event *WEP;
OS_TCB *tcb;
 
if ((tcb = (OS_TCB *)OSTCBPrioTbl) == (OS_TCB *)NULL)
continue; //unassigned priority
 
//use extension
WEP =tcb->OSTCBExtPtr;
if (WEP->soc_event == event)
{# ifdef TK_CRON_DIAGS
dprintf("+++ tcp_wakeup OSSemPost = %lx\n", event);# endif
//We found the TCB with our cookie */
error = OSSemPost(WEP->wake_sem);
if (error != OS_NO_ERR)
{
dprintf("*** tcp_wakeup, OSSemPost = %d, %p\n", error, WEP->wake_sem);
dtrap(); 
}
//Clear the cookie: 
WEP->soc_event = NULL;
//Count this wakeup:
cnt ++;
}
}
//Check fore woken tasks:
if (cnt != 0) {
//Tasks have been woken, so give up mutex...
error = OSMutexPost(global_wakeup_Mutex);
if (error != OS_NO_ERR)
{
dprintf("*** tcp_wakeup, OSMutexPost = %d\n", error);
dtrap(); 
}
//...and get out of here:
return;
} 
 
//We didn't find the cookie in the wake set.
insertWakeSetEntry(event); 
 
//Give up mutex
error = OSMutexPost(global_wakeup_Mutex);
if (error != OS_NO_ERR)
{
dprintf("*** tcp_sleep, OSMutexPost = %d\n", error);
dtrap(); 
} 
//We are now out of the mutex, so leave:
return;
}

As you can see, the search-loop wil not be exit at the first task found pending on a select(). The variable "cnt" is only used to check, if any task has been woken up during the search-loop. If not, this event is queued up.

I don't know, if this is important to do, but the original code does the same.

ANY FEEDBACK OUT THERE ?

Altera_Forum · ‎11-11-2009

G'day BWEIBERG,

I'll need to have a closer look at your changes.

One change that I've made is to the following lines:

    
   if ((tcb = (OS_TCB *)OSTCBPrioTbl) == (OS_TCB *)NULL) 
      continue; //unassigned priority

changed to:

    
      tcb = (OS_TCB *)OSTCBPrioTbl;
      if ( (tcb == (OS_TCB *)NULL) || (tcb == OS_TCB_RESERVED ) )
         continue;    /* unassigned priority */

Ucos flags a priority as reserved when using mutexes, so just checking for NULL is not sufficent. You'll get an access violation or misaligned address exception if you don't make this change. (or other weird behavior if you don't have the MPU / MMU on)

I'm pretty sure that I filed a bug with Altera on this. But maybe not.