Solved: Re: Low latency application for sparse load

sahasay · ‎09-05-2009

Hi

I have a financial message processing application which has two kind of extreme examples. One is message processing at a very high rate of ~ 10k messages per sec and the other is message processing at a very low rate (system will be idle from time to time).

Making the application parallel and running it in the multi core processor, we are getting a very good figure for the high rate (~ 200 micro sec). But in the other case, when the system is idle for sometime and then need to process say a couple of messages, the latency is almost nearing a millisec. I can understand that this may be due to the OS removing the threads from the CPU, if it is idle for sometime and then it takes time to bring the threads to the CPU and load the registers, etc.

Is there a way to configure the OS (Suse 64 bit linux) on intel to behave more like a real time OS for this scenario?

Thank you in advance for your thoughts and suggestions.

Regards
Sayandeep

jimdempseyatthecove · ‎09-07-2009

Sahasay,

I think Roman is pointing you in the right direction. Although I might make a minor suggestion.

Assume you have 8 or more threads activein high volume mode.
However, in low volume mode you have 1 thread active (the socket polling thread as I understand your description).

The problem lies (I assume) with the latency time of waking up the next thread.

Roman is suggesting don't put these threads to sleep. Keep them in poll mode.

This is effective when you have a dedicated system (use _mm_pause in poll loop to conserve some energy if that becomes a concern).

However, should you not have a dedicated system, consider placing only one or only a few of the remaining threads (experiment with the number) in this poll loop, and the remainder can wait for the messaging system.

The lowest latency might come from

1) Socket polling thread

2) Thread polling shared memory variable holding message pointer receivedfrom socket polling thread. Call this your hot stand by thread.

3) Optional one or more additional type 2) threads (tuning parameter)

4) Thread polling shared memory receiving message (perhaps single word)from socket polling thread indicating number of additional threads to wake up.

The thread 4) is important because, should the socket polling thread issue the Signal other thread to wake up, this overhead generally incurs all the overhead of the scheduler in performing the internal Interlocks needed for the semaphores an additional overhead for scheduler. (scheduler needs to run on some core, it might as well be the one instigating the wake-up thread(s). Therefore having thread type 4) will no longer induce this overhead into thread type 1).

This should be relatively easy to setup within your current application (maybe a few 10's of well placed lines of code).

Please report back your experience should you try out Roman's and/or my suggestions as this will help others assess the situation.

Jim Dempsey

View solution in original post

Roman_D_Intel · ‎09-06-2009

Quoting - sahasay

Hi

I have a financial message processing application which has two kind of extreme examples. One is message processing at a very high rate of ~ 10k messages per sec and the other is message processing at a very low rate (system will be idle from time to time).

Making the application parallel and running it in the multi core processor, we are getting a very good figure for the high rate (~ 200 micro sec). But in the other case, when the system is idle for sometime and then need to process say a couple of messages, the latency is almost nearing a millisec. I can understand that this may be due to the OS removing the threads from the CPU, if it is idle for sometime and then it takes time to bring the threads to the CPU and load the registers, etc.

Is there a way to configure the OS (Suse 64 bit linux) on intel to behave more like a real time OS for this scenario?

Thank you in advance for your thoughts and suggestions.

Regards
Sayandeep

HiSayandeep,

How does your application receive the messages? Via network (sockets)? How does your application waits for the messages? You need to find out where the latency is coming from.

You might try to use polling or busy-wating methods to decrease the latency. On Linux look at the "epoll" method for I/O. Using busy-waiting spin locks instead of pthread mutexes/semaphores help to decrease the waiting latency for intra-process event processing.

Best regards,

Roman

sahasay · ‎09-06-2009

Quoting - Roman Dementiev (Intel)

HiSayandeep,

How does your application receive the messages? Via network (sockets)? How does your application waits for the messages? You need to find out where the latency is coming from.

You might try to use polling or busy-wating methods to decrease the latency. On Linux look at the "epoll" method for I/O. Using busy-waiting spin locks instead of pthread mutexes/semaphores help to decrease the waiting latency for intra-process event processing.

Best regards,

Roman

Hi Roman

Thanks for your mail. Infact, the application is receiving data from the socket. So we have a thread which constantly keeps on polling ( select() ) for data.

You are right, I need to find out from where the latency is coming in this scenario. As you will remember, if the threads are ver busy at all the time, the latency is very less.

So, I was thinking that the delay is coming from some constraint/feature of the OS/platform which is not meant to behave like a real time system.

Regards
Sayandeep

TimP · ‎09-06-2009

Threaded implementations frequently meet a requirement for application level adjustment of spin-wait defaults, or change polling mode (see Intel OpenMP KMP_BLOCKTIME and MPI I_MPI_WAIT_MODE, I_MPI_WAIT_TIMEOUT, I_MPI_SPIN_COUNT). Availability of such controls often facilitates performance investigations.
On Windows, particularly, it is said to be difficult to maintain thread affinity when spin count is exhausted and the thread yields the processor.
On a "real time system," presumably, you don't care about being friendly to competing processes or enabling power saving modes.
People whose "performance" goal is simply to maximize clock per instruction count can get there quickly by setting high spin wait counts.

jimdempseyatthecove · ‎09-07-2009

Sahasay,

I think Roman is pointing you in the right direction. Although I might make a minor suggestion.

Assume you have 8 or more threads activein high volume mode.
However, in low volume mode you have 1 thread active (the socket polling thread as I understand your description).

The problem lies (I assume) with the latency time of waking up the next thread.

Roman is suggesting don't put these threads to sleep. Keep them in poll mode.

This is effective when you have a dedicated system (use _mm_pause in poll loop to conserve some energy if that becomes a concern).

However, should you not have a dedicated system, consider placing only one or only a few of the remaining threads (experiment with the number) in this poll loop, and the remainder can wait for the messaging system.

The lowest latency might come from

1) Socket polling thread

2) Thread polling shared memory variable holding message pointer receivedfrom socket polling thread. Call this your hot stand by thread.

3) Optional one or more additional type 2) threads (tuning parameter)

4) Thread polling shared memory receiving message (perhaps single word)from socket polling thread indicating number of additional threads to wake up.

The thread 4) is important because, should the socket polling thread issue the Signal other thread to wake up, this overhead generally incurs all the overhead of the scheduler in performing the internal Interlocks needed for the semaphores an additional overhead for scheduler. (scheduler needs to run on some core, it might as well be the one instigating the wake-up thread(s). Therefore having thread type 4) will no longer induce this overhead into thread type 1).

This should be relatively easy to setup within your current application (maybe a few 10's of well placed lines of code).

Please report back your experience should you try out Roman's and/or my suggestions as this will help others assess the situation.

Jim Dempsey

sahasay · ‎09-08-2009

Quoting - jimdempseyatthecove

Sahasay,

I think Roman is pointing you in the right direction. Although I might make a minor suggestion.

Assume you have 8 or more threads activein high volume mode.
However, in low volume mode you have 1 thread active (the socket polling thread as I understand your description).

The problem lies (I assume) with the latency time of waking up the next thread.

Roman is suggesting don't put these threads to sleep. Keep them in poll mode.

This is effective when you have a dedicated system (use _mm_pause in poll loop to conserve some energy if that becomes a concern).

However, should you not have a dedicated system, consider placing only one or only a few of the remaining threads (experiment with the number) in this poll loop, and the remainder can wait for the messaging system.

The lowest latency might come from

1) Socket polling thread

2) Thread polling shared memory variable holding message pointer receivedfrom socket polling thread. Call this your hot stand by thread.

3) Optional one or more additional type 2) threads (tuning parameter)

4) Thread polling shared memory receiving message (perhaps single word)from socket polling thread indicating number of additional threads to wake up.

The thread 4) is important because, should the socket polling thread issue the Signal other thread to wake up, this overhead generally incurs all the overhead of the scheduler in performing the internal Interlocks needed for the semaphores an additional overhead for scheduler. (scheduler needs to run on some core, it might as well be the one instigating the wake-up thread(s). Therefore having thread type 4) will no longer induce this overhead into thread type 1).

This should be relatively easy to setup within your current application (maybe a few 10's of well placed lines of code).

Please report back your experience should you try out Roman's and/or my suggestions as this will help others assess the situation.

Jim Dempsey

Hello Gentlemen

I think your suggestions are bang on point. The worker threads are waiting on a condition variable to know if any message has been filled in this queue by the thread which receives the message from the queue.

Under low load condition, ofcourse the queue is empty and the worker threads are put to sleep. So if I keep one of the worker threads (or configurable) busy waiting then the problem should be solved.

Regards
Sayandeep

jimdempseyatthecove · ‎09-08-2009

Sayandeep,

You might want to consider permitting any of the worker threads to assume the roles of

Sleeper
Hot standby
Waker-uper/Hot Standby
Active Worker

Where the Waker-upper adds new Hot Standby threads as needed (in advance of availability of work) and then transistions into Hot Standby when no more Sleeper threads are available.

When population of Hot Standby increases to a set point, one of the threads assumes the role of Waker-upper and one goes to sleep. Then adding sleepers as threads expend work.

This will reduce latency, but this also will have some cost in terms of additional load on the system.

Jim Dempsey

jimdempseyatthecove · ‎09-08-2009

I forgot to mention,

Should your application have a high degree of floating point instructions, and if your system has HT, you might want to consider using the second of the HT pairs for I/O and Waker-upper roles (which should be predominantly integer bound programs).

Jim Dempsey

sahasay · ‎09-09-2009

Quoting - jimdempseyatthecove

I forgot to mention,

Should your application have a high degree of floating point instructions, and if your system has HT, you might want to consider using the second of the HT pairs for I/O and Waker-upper roles (which should be predominantly integer bound programs).

Jim Dempsey

Jim -- thanks again for the clarification...!

I am just thinking as we have evolved much more and faster from the normal software design pattern days, is anyone taking note ofthesuggestions/discussions going on here?I am suer people will face the same issues again and it will be great to have these available as a reference (low latencydesign patterns if I may say) available.

Any thoughts?

Regards
Sayandeep

jimdempseyatthecove · ‎09-09-2009

Sayandeep,

I am taking note, or should I say re-iterating notes taken during the development of QuickThread - a threading tool kit of my own design (see www.quickthreadprogramming.com for information). As you have experienced and can see from our discussions latency vs load vs undue system burden is somewhat of a vicious circle and you are unlikely to get everyone satisfied (programmer, app user, sys admin).

Jim Dempsey

sahasay · ‎09-12-2009

Quoting - jimdempseyatthecove

Sayandeep,

I am taking note, or should I say re-iterating notes taken during the development of QuickThread - a threading tool kit of my own design (see www.quickthreadprogramming.com for information). As you have experienced and can see from our discussions latency vs load vs undue system burden is somewhat of a vicious circle and you are unlikely to get everyone satisfied (programmer, app user, sys admin).

Jim Dempsey

Jim -- it was wonderful to know about your QT toolkit. I should say it is very timely as well. I am in the early stage of porting some solaris applications to linux/intel. I am doing an initial analysis of intel TBB to see if it can add benefits to our application.

How difficult it will be to port QT for linux?

Regards
Sayandeep

jimdempseyatthecove · ‎09-14-2009

QT (QuickThread) port to Linux is planned. We are in the process of launching the Windows version now.

I do not anticipate major problems in the port to Linux, but I will have to stress "anticipate". As I have experienced many time in the past - trivial changes may introduce major impact. If you read the documentation on www.quickthreadprogramming.comyou will note a description of a qtControl object. Inside the qtControl object is an Affinities member variable structure. This structure on Windows has two 32-bit or 64-bit affinity bit masks. On Linux these masks are larger. The Affinities member variable structure will have to change. There are also change considerations for Windows 7. As the full capacity affinity selections on Windows 7 is different than the 32/64-bit affinity mask as currently implemented.

The API in QuickThread treats the Affinities member variable structure in an opaque manner. So from the API point of view there will be no impact (unless you rely on manipulating bit masks yourself).

Internally, small sections of code will have to change (changing thread affinity, spawning threads, critical sections, WaitForSingleObject, etc...). These have well known counterparts between Windows and Linux. So I do not imagine significant effort in this area.

Surprisingly, the largest stumbling block is getting used to the different development environment.

Funding the port will be a different issue as we will rely on Windows sales to underwrite the port. It would be easy enough for interested Linux sites to experiment with QT on a Windows machine. If they like what they see, they could consider supporting the port development costs.

Jim Dempsey