Re: how long does it take for thread to start; thread pooling

Daniel_B_Intel2 · ‎05-19-2003

hi,

recently I participated a lecture dealing with Performance tips for programing in MS .NET environment.

one of the interesting thing I learned was that it takes nearly 1 sec (!!!!) from the command that creates a thread until the thread starts its work.

1) Is it really true? Did someone measured these figures?
2) Is it true only for MS .NET env., or almost for all operation systems?

As a solution for this problem the lecture suggested to use "thread pool", but he did not give us an example.

Can someone share his/her thoughts/code examples for thread pool use?

Thank you
Daniel

netdevil · ‎05-19-2003

The startup for threads can be cumbersome sometimes. The basic idea behind a "thread pool" is that you create an optimized number of worker threads (based off of your logical processor count) when the process starts, and put them in wait states. Then, you have some sort of queue of functions that need to be performed... as the queue gets requests it picks an available thread and signals it to wake up and perform the requested function. In this fashion, threads aren't created and destroyed constantly, but rather reused in a controlled environment.

This is frequently a good idea but as with everything it really can depend on the sort of application you are doing :)

hth,
-Dale

bronx · ‎05-19-2003

using the very simple "Command" framework outlined here

see 14th reply

allows a simple and effective way to implement thread pools: a FIFO queue of ICommand is shared by all threads in the pool. Whenever a thread is done with previous work, it extracts from the queue the next command and execute it.

ClayB · ‎05-19-2003

Daniel -

Chapter 10 of "Win32 Multithreaded Programming" by Cohen & Woodring, details the design of a thread pool class. I've also been told that there is a native thread pool class in the Win32 API. I've not seen or used the latter, but people assure me it is there. Anyone used the Win32 Thread Pool?

Of course, you still need to create the threads (and this will take the necessary time), but since you shouldn't need to create threads after this, you only pay the penalty once. Any processing that doesn't use threads at all times is a candidate for thread pools, though you will need to specify how tasks are assigned to threads.

A good example for thread pool use would be a transaction processing system. Each time a new transaction arrives a new thread could be created which is later destroyed after competion of the transaction. A better implementation would be to queue up the transactions as they arrive and have threads pull these requests off the queue for processing. Threads only are created once and go into a wait state if the queue is empty. You will also need some siganlling process to "wake" up a thread if a transaction arrives at an empty queue (and it is assumed there may be threads waiting).

-- clay

Daniel_B_Intel2 · ‎05-21-2003

thank you all for your very useful inputs!!!
-Daniel

Intel_C_Intel · ‎05-23-2003

Hi all,
Native Thread Pools in Win32 are called "Completion Ports".

ClayB · ‎05-23-2003

Actually, Win32 completion ports are one of the means to perform overlapped I/O. By their structure and usage model, it is easy to create a pool of threads that wait on getting packets from the completion port before starting up to perform some task on the received data. So, it does allow you to create a specialized thread pool.

We can take this idea, though, and create a simple, but robust thread pool model. Define an Event (line MoreWorkAvailable) that is signaled any time a new task is ready for threads to pick up and start execution. Start a pool of threads who run in an infinite loop waiting to see the Event signal. Once a thread receives the signal, it goes to some predetermined location to pick up something that defines the assigned task. With completion ports, the "task" is an I/O packet that must be processed in some predetermined fashion. With the more general thread pool, the tasks can be just about anything. The programmer must define and implement the details on where and what to store to allow threads the means to "pick up" the next task. Plus, what if there is more than one task available at a time, and several other questions.

-- clay

o_neuendorf · ‎05-24-2003

The answer is, as always, "it depends".
(work load, language used, libraries used etc.)

As a rule of thumb:

plain .NET thread:
Milliseconds

.NET Thread Pool:
If its the very first Thread your application creates, it may take 1 sec or a bit longer. If its not the first one, but there are no free threads in your pool at the moment, it may again take a few seconds. If the thread would exceed the pool limit, it takes as long as another thread is freed. In all other circumstances it is pretty fast (milliseconds) and usually faster than creating a new plain thread.

The timing is similar (a little bit faster) for the Win32 API (which is of course used by the .NET Framework).

You can easily measure the values with the Intel VTune Profiler.

Who gave your lecture?

Regards,
Olaf

o_neuendorf · ‎05-24-2003

> Hi all,
> Native Thread Pools in Win32 are called "Completion
> Ports".

No, the Windows Thread Pool is not called Completion Port. But you're right, there is a strong relationship between the Completion Port an the Win Thread Pool. Before Windows2000 the completion port was the only way to use some kind of a thread pool via the Win32 API.

Regards,
Olaf

Intel_C_Intel · ‎05-24-2003

Hi Olaf,
Can you, please, explain me, how did you measure the thread creation time using Intel VTune ? I never heard about this.

o_neuendorf · ‎05-27-2003

> Hi Olaf,
> Can you, please, explain me, how did you measure the
> thread creation time using Intel VTune ? I never
> heard about this.
>

Hi kdmitry,

I guess you caught me! I shouldn't have written that it's easy to measure these values with VTune. It's obvious, that you don't use VTune to measure "real" times (not to mention, that there are different meanings for the word real time).
To measure the timing you can use the Windows High Resolution Performance Counter (QueryPerformanceCounter). But that's only half of the story. To get an impression of what's going on behind the scenes, it's a good idea to use VTune. For example, to see how many cycles _beginthread or CreateThread need or to see the additional overhead of your C Library initialization. If there is an unexpected delay for the thread creation it might be due to a call to DllMain with some strange things happening in there (I once got that problem). A profiler can be a great help.
I guess, that's what I wanted to say, but I clearly missed the point.

Thanks for your excellent question,
Olaf

bronx · ‎05-27-2003

> To measure the timing you can use the Windows High
> Resolution Performance Counter
> (QueryPerformanceCounter).

I'll suggest to use the RDTSC instruction instead for higher accuracy. Read the time stamp 1st just before to create the thread and a 2nd time when entering the thread body. If you call Sleep(0) right after the thread creation the measurement should be quite meaningful provided that no other threads are significantly active

o_neuendorf · ‎05-27-2003

> > To measure the timing you can use the Windows High
> > Resolution Performance Counter
> > (QueryPerformanceCounter).
>
> I'll suggest to use the RDTSC instruction instead for
> higher accuracy. Read the time stamp 1st just before
> to create the thread and a 2nd time when entering the
> thread body. If you call Sleep(0) right after the
> thread creation the measurement should be quite
> meaningful provided that no other threads are
> significantly active

Yes, of course you're right. To use RDTSC is the most accurate way to measure time intervals (although the performance counter is sufficient and easy to use). But I wanted to stress a different point - not very successfull, as it seems :-)
If you measure the interval between the call to _beginthread etc. and the very beginning of your start function executed on the newly created thread, the result might be quite long. But the reason is not always due to the thread creation itself. Between thread creation an calling of the start function, some other things might happen, like a library initialization. But that has, strictly spoken, nothing to do with thread creation. Therefore the use of a profiler can be very helpful. Otherwise the interpretation of the results might be difficult. In the original post a value was mentioned which seems to be way to high. But there might be some circumstances to cause such a delay. To figure out I recommend to use a profiler.

Olaf

bronx · ‎05-27-2003

> But I wanted to stress a different point - not very
> successfull, as it seems :-)

I've seen your main point, mostly agree with it, thus my lack of comment... just tried to advise the original poster for a better methodology, IIRC the perf counter granularity is not better than a few microseconds: that's around 1000 x more than the TSC. The overhead due to the query call is huge too, as compared to inline ASM with RDTSC

> If you measure the interval between the call to
> _beginthread etc. and the very beginning of your
> start function executed on the newly created thread,
> the result might be quite long. But the reason is not

sure, the main reason IMO is that the new thread doesn't start executing immediatly, if the scheduler wait for the remaining of the parent thread slice (on a single CPU system) before to switch to the new one it can take more than one *millisecond*. That's why I advise to use a Sleep(0) to immediately abort the creator's slice.

> always due to the thread creation itself. Between
> thread creation an calling of the start function,
> some other things might happen, like a library
> initialization. But that has, strictly spoken,

sure, to see the effect a complete test will compare a low-level "CreateThread()" and a full fledge call like "_beginthreadex()"

> the original post a value was mentioned which seems
> to be way to high. But there might be some

probably due to the missing Sleep(0) I guess

netdevil · ‎05-27-2003

Bronx,

I've been using CPUID to serialize before my RDTSC calls. Does Sleep(0) do the same thing? Or are you just using Sleep(0) to help insure against a timeslice happening between your measurements?

-Dale

-- Quoted Message --
I'll suggest to use the RDTSC instruction instead for higher accuracy. Read the time stamp 1st just before to create the thread and a 2nd time when entering the thread body. If you call Sleep(0) right after the thread creation the measurement should be quite meaningful provided that no other threads are significantly active

bronx · ‎05-27-2003

> > the original post a value was mentioned which
> seems
> > to be way to high. But there might be some
>

seen the original post again, he says "nearly 1 second", it's way too high indeed & my explanation isn't sufficient to explain such a monstrous delay. It can be true only for some applications that make heavy use of per thread DLL initialization.

bronx · ‎05-27-2003

> just using Sleep(0) to help insure against a
> timeslice happening between your measurements?

see my msg above ;-)

ClayB · ‎05-27-2003

> IIRC the
> perf counter granularity is not better than a few
> microseconds: that's around 1000 x more than the TSC.
> The overhead due to the query call is huge too, as
> compared to inline ASM with RDTSC
>

Can someone post some code (bronx?) that illustrates this use for others to see? Just if you have something handy, that is. I'm sure once we have picked up the right manuals we can figure out how to do this, but I find it more help to see a code that I can adapt to my own purposes and look up if I have problems or other needs.

-- clay

netdevil · ‎05-27-2003

I usually use something like:

//////////////
inline __int64 __declspec(naked) GetTimeStamp()
{
__asm
{
rdtsc
ret 0
}
}
//////////////

and I have a second version that calls cpuid first, to serialize the command.

-Dale

bronx · ‎05-27-2003

here is a dump of a handy class for accurate timings :

(the "//#expxxx" pseudo comments are used for automatic generation of headers, one should figure out easily how to factor out the decl. from the spec.)


//#hgen



// StopWatch.cpp

// E.B.  22/08/02




//#export

#include 

//#endexp




//#expcla
class StopWatch
{
  __int64 totalTime;
  __int64 startTime;

  char name[128];

  __int64 getCurrentTime ()
    {  
      __int64 t = 0;
      _asm
      {
        RDTSC
        mov DWORD PTR t, eax
        mov DWORD PTR t+4, edx
      }
      return t;
    }

public:
  StopWatch (const char vName[]);
  const char *itsName () const {return name;}
  void clear () {totalTime = 0;}
  void start () {startTime = getCurrentTime();}
  void pause () {totalTime += getCurrentTime()-startTime;}
  __int64 getTotalTime () const;
  friend ostream & operator << (ostream &out, StopWatch &watch);
};



#include 
#include 


StopWatch::StopWatch (const char vName[]) : totalTime(0), startTime(0)
{
  strcpy(name,vName);
}

__int64 StopWatch::getTotalTime () const
{
  return totalTime;
}

ostream & operator << (ostream &out, StopWatch &watch)
{
  const int MaxChar = 16;
  char clocksStr[100]; sprintf(clocksStr,"%I64d",watch.totalTime);
  char formatedStr[100];
  
  formatedStr[MaxChar] = 0;
  long i=strlen(clocksStr)-1,j=MaxChar-1,d=0; 
  while (i>=0)
  {
    formatedStr[j--] = clocksStr[i--]; d++;
    if (j<0) break;
    if (d%3 == 0) formatedStr[j--] = ' ';
    if (j<0) break;
  }
  while (j>=0)
    formatedStr[j--] = ' ';
    
  out << formatedStr << " clocks [" << watch.name << "]
";
  return out;
}


// End of StopWatch.cpp
//

typical client code :



//
void Test1 ()
{
  StopWatch watch("local stopwatch");
  watch.start();
  // stuff ...
  watch.pause();
  cout << watch;
}

//
static StopWatch SW1("global stopwatch");

static void BeenThere ()
{
  SW1.start();
}

static void DoneThat ()
{
  SW1.pause();
}

void Test2 ()
{
  SW1.clear();
  BeenThere();
// ...
  DoneThat();
  cout << SW1;
}

Aaron_C_Intel · ‎06-10-2003

The implementation of your watch class seems to be correct. It's pretty much what I use.

However, when using this performance monitor with RDTSC there is an issue in an SMP operating system with either logical processors (HyperThreading) or multiple physical processors.

Supposing your application has multiple threads to do work in kind of the following pseudo code:

Main()
{
begintime = watch.start()

Launch threads(1...n) to process data(1...n)
Wait for threads(1...n)

endtime = watch.end()
}

The issue here is that in an SMP OS, there is no guarantee that the main thread runs on the same processor.

So suppose the main thread runs on CPU0 when we call watch.start, then it launches some threads, and then it goes to sleep waiting for the work of the other threads to be completed. When it wakes up it happens to run on CPU1. Now the problem is that the first RDTSC was done on CPU0 and the second RDTSC is done on CPU1. The time stamp counters between processors are not synchronized, and therefore the timing measured will be in correct.

There are a few solutions:
1) Don't measure the timing :)
2) Given option 1 is bad, you can use QueryPerformanceCounter which is supposed to be valid in a multi-processor system, regardless of which processor you call from. However, you lose some resolution.
3) You can use SetAffinity to make the main thread run on one specific processor. The one problem I can think of with is that the main thread could have to wait for an extra context switch.

Hopefully this makes sense.

Aaorn