Re: Rationale for I/O completion port

daradder · ‎07-18-2003

I'm just starting to program using the Microsoft I/O completion port scheme. As I understand it, the design principle is to choke the number of runnable threads per processor down to one, thus avoiding the performance losses caused by having numerous runnable threads.

Excuse my ignorance, but I would appreciate help with the following three questions:

1. Does the multithreaded performance hit that is the rationale for the I/O completion port arise mainly from heap contention, kernel contention or some other factor (I am interested in this because I peddle the LeapHeap lockfree allocator)?

2. What thread becomes scheduled on a processor when its single, runnable thread page-faults?

3. Does using an I/O completion port design on my uniprocessor server turn the multiprogramming Windows NT operating system into something every bit as powerful as ..erm.. DOS?

Dar

ClayB · ‎07-22-2003

Dar -

> I'm just starting to program using the Microsoft I/O
> completion port scheme. As I understand it, the
> design principle is to choke the number of runnable
> threads per processor down to one, thus avoiding the
> performance losses caused by having numerous runnable
> threads.

I'm no expert on Win32 programming and have never used I/O completion ports, so I can only quote from the literature (propaganda?) that I've read on this subject.

I've always thought I/O completion ports were installed to support a large set of overlapped handles with a relatively small number of threads within a pool. When an I/O operation completes on one of the handles associated with the port, a thread is awakened and given the data and results of the I/O operation for processing. Upon finishing the processing, the thread can again wait on the port.

Two advantages are achieved: 1) threads that are processing data from the port are actually working on something (not spinning idle and soaking up cycles), and 2) very few threads can be used to take care of many more named pipe handles. This second point may give a hint to your first question. That is, the performance hit overcome is less memory space usage by threads. Having one thread per pipe or socket will increase the stack space needed to support all those threads. However, if one tenth the number of threads can keep up with the traffic on the pipes, you've saved quite a bit of memory usage of threads that sit idle most of the time.

As for the other questions, I will have to leave that to someone more well-versed in using I/O completion ports than I.

-- clay

daradder · ‎07-26-2003

Thanks for the response Clay. I will look at the memory cost of multiple threads. In the meantime:

>...I can only quote from the literature (propaganda?) that I've read on the subject.

Indeed, which is it? I would like to know what the I/O Completion Port scheme boils down to; where it fits on the evolutionary ladder.

I find it difficult to assess the relative importance of the factors involved and get a clear perspective. Here is my summary, which doesn't actually get as far as IOCP:

Why do we want a multiprogramming, virtual memory Operating System like Windows NT or Linux in the first place?

Assume the advantages of virtual memory are obvious, and ignore multiprocessor aspects. Concentrate on multithreading as a cohesive kind of multiprogramming. I would say the big three advantages are:

(1) When a thread initiates I/O, another thread can be scheduled to run on the processor, and that prevents the CPU being wasted while the first thread waits for the I/O to complete.

(2) When a thread incurs a page fault, another thread can be scheduled to run on the processor, and that prevents the CPU being wasted while the first thread waits for the memory page to be loaded from hard drive into RAM.

(3) One thread can be dedicated to each job (such as servicing a HTTP request). Programming the thread function is then simple, because:
(a) The function needs to manage the state of only one job rather than multiple concurrent jobs.
(b) If a thread must wait for I/O completion or user response, it doesn't have to coordinate with other threads, it just blocks.

Have I missed something?

Presumably you could write a DOS application that fields interrupts, backs RAM onto disc, handles multiple outstanding I/O requests, and thus achieves the performance benefits of (1) and (2) above. But then, of course, you would be starting to write your own virtual memory multiprogramming system. The beauty of Windows NT is that it does it all for you behind the scenes.

Having achieved this, what exciting new advances do we make when we learn to program the I/O Completion Port?

Dar

davids · ‎07-27-2003

The main thing I/O completion ports do is allow you to keep the number of threads that are ready-to-run very close to the number of CPUs. This minimizes context switch penalties and makes caches more effective.

I/O completion ports combine an I/O completion discovery mechanism with a thread dispatch mechanism. So if you're handling 5,000 network connections, you can dispatch each network event to a thread without having a thundering herd if 1,000 connections are ready at once.

You can configure the concurrency on an I/O completion port, it defaults to the number of CPUs. If a thread pulls a completion report from the report, it is 'charged' to the port. If the thread subsequently blocks for any reason, the port is 'credited' for the thread and another thread is released if a completion report is ready for it. (When the first thread resumes, you will temporarily have one too many threads running, but that's unavoidable.)

You could certainly discover that there are 2 CPUs and only dispatch two jobs at a time yourself. The problem would be that if one of those jobs blocked, you wouldn't know to dispatch a third job. The IOCP does.

It's a very clever mechanism and I wish UNIX had something comparable.

daradder · ‎07-30-2003

Obviously a fan. Nevertheless there are aspects of I/O Completion Port that worry me; some tradeoff perhaps not being recognised. IOCP employs two 'gates' for throttling active thread numbers down. I see these as independent, in that either technique could be used on its own. The one I want to get out of the way first receives more press, possibly because it is governed by a tunable parameter in Internet Information Server, but I judge it less important (and from your post you obviously think it is the other technique that characterises IOCP).

The case for thread pooling arises in a multiprogramming (in this case multithreading) system when handling stateless protocols like HTTP1.0. A server may be asked to cope with thousands of concurrent threads, and dedicating a thread to each is impractical. The simplest variant of thread pooling would have the server application start some fixed number of threads (say 64) initially, and maintain that same number throughout its life. This makes a difference to the programming of the server.

With simple multithreading, there is a main thread (it's a popular design at any rate) that intercepts client requests. For each such request the main thread starts a subsidiary thread dedicated to processing the request. The thread function is, in outline:
initialise;
parse request;
perform request;
transmit result to client;
terminate;
Under thread pooling, the main thread puts client requests into a queue. The thread function becomes:
repeat
reinitialise;
wait until queue is non-empty;
dequeue a client request;
parse request;
perform request;
transmit result to client;
forever

So for a thread pool, the thread function is slightly more complex. It still uses local variables and memory allocations from the heap in an entirely conventional way to assist it in processing the client request.

It should be clear that, while synchronising threads is never trivial, there is no particular difficulty in crafting a thread-pooling server, and you do not need IOCP to cover this aspect for you.

Whichever way thread pooling is achieved, client requests in excess of the fixed number of threads are not being serviced while they remain queued. Imagine two HTTP servers with the same throughput, one using the simple thread-per-request method, the other, the thread pool. The difference in behaviour is between a client browser getting its HTML file a bit at a time over the course of 15 seconds (say) and nothing happening for 14 seconds then the whole file arriving in one burst. Whether this reduction in responsiveness is a drawback depends on what the application is for.

The primary reason that the thread pool is an attractive technique for servers running one-shot protocols is the performance impact of starting and destroying a thread for each request. The cost of thread creation is likely to be disproportionate whatever the operating system. (BTW I have no practical experience of this subject; all my knowledge is derived from Jeffrey Richter's classic "Advanced Windows"; I have the third edition). Restricting the number of threads also reduces memory loading. How much memory each thread takes will of course vary hugely according to the nature of the application. Under Windows NT, only RAM consumption (as opposed to address range or pagefile memory) is likely to be of significance, and the appropriate metric is the number of 4 KB (assuming Pentium architecture) memory blocks that have to be paged in when a context switch occurs on a heavily loaded machine. I would guess that for a bloat-free program, one page of stack, five pages of heap, one page of Thread Environment Block and one page of some kernel data structure (for all I know) might be typical, for 32 KB RAM per thread.

I'll leave you with a quote from Richter's book. It's a persuasive argument for IOCP (the other aspect, the one David described). But what I've done with the quote is remove references to multiprocessor machines.

[modified quote]
To make Windows NT an awesome server environment, Microsoft needed to address this problem. The result is a new kernel object called the I/O completion port, which was first introduced in Windows NT 3.5. The theory behind the I/O completion port is that the number of threads running concurrently must have an upper bound; that is, 500 simultaneous client requests cannot allow 500 runnable threads to exist. But what is the proper number of concurrent runnable threads? Well, if you think about this question for a moment, you'll come to the realization that it really doesn't make sense to have more than one runnable thread. As soon as you have more runnable threads than one, the system has to spend time performing thread context switches, which
wastes precious CPU cycles.
[end of modified quote]

Does that sound odd to anyone else ?

Dar

joe-laughlin7 · ‎07-30-2003

There is more at here than just context switches. IOCP allows the NIC to DMA data directly into/from the buffers the application has posted thus saving one or more copy operations.

Data sent/received notifications for connections are intermixed and a single thread can handle requests for multiple clients. Thus data is sent/recieved based on the time required by the server to handle each incomming/outgoing buffer. This allows a smooth flow of data for each connection which is not possible in the thread per connection model.

The benefit of IOCP over a generic thread pool is the DMA to from user posted buffers and the ability to schedule a new thread when one of the IOCP worker threads block.

Also having fewer threads running preserves the processor caches which is critical to scalability.

daradder · ‎08-02-2003

That DMA aspect of I/O completion port is new to me. I don't have the hardware and kernel background to fully understand it.

I made a mistake in my last post. Thread pooling isn't part of IOCP; its a standard technique for applications designed around an IOCP. Also, a thread pool used with IOCP does not reduce responsiveness, unlike a thread pool used to cap thread numbers for thread-per-connection.

In fact I'm going to stop using the term 'thread pool' and describe IOCP as involving a 'primary' thread and a 'backup' thread; though it's understood that:
a) Multiprocessor machines run more threads; my description applies to uniprocessor.
b) It's equally easy to have several backup threads if your application warrants it.
c) The primary and backup threads may occasionally swap roles.
Setting the concurrency to one thread per processor like this is the norm; presumably it's what one million installations of Internet Information Server run with.

The central idea of IOCP is to interleave the CPU and I/O aspects of multiple concurrent jobs (such as client requests being satisfied by a server) so that they are all performed by the primary thread. The backup thread takes over when the primary thread blocks on something other than the completion port. This may not happen too often, as any subsidiary I/O the primary thread undertakes (e.g. reading the file a client requested) is likely to be designed as asynchronous I/O directed through the same completion port.

The primary thread and the backup thread run the same code:

repeat
wait on I/O completion port;
determine which job and segment the completed I/O relates to;
access the data for the job in question;
perform the succeeding segment of the job in question;
initiate I/O to complete the segment;
forever

The above generic pseudocode is based on a certain structuring:
A job is (must be) broken down into a sequence of CPU-intensive segments, each segment terminated by an I/O operation. This example:
extract name of index file from data passed by client;
read index file;
look up index to find path of data file;
read data file;
parse content of data file;
transmit result to client;
defines three segments (more in practice, unless each file can be read with a single I/O).

When released by the completion port, the thread first works out which job and segment to position itself for, based on a few words of tag data obtained from the port. The thread has to maintain the state of every current job; if it is on the first stage (e.g. the completed I/O was a new client request) it allocates heap entries to hold the state of the new job; otherwise it sets pointers to appropriate existing heap entries. The thread can then get on with the CPU-heavy stuff that constitutes the segment, often working with a buffer handed to it by the completion port, which has just received relevant input. When the time comes for more I/O, this is initiated as an asynchronous transfer. The thread then goes back to the completion port to collect another segment's worth of work; an arbitrary segment of some other job is probably already queued up for it.

This design outline, which could I suppose be called a workflow pattern or an elaborate finite state machine, is compulsory for IOCP applications. The heap entries (or whatever) that hold the state of current jobs must be globally ac cessible in case the backup thread runs. Thread safety is not too difficult because the jobs are assumed to be independent and their state data should not overlap. If the primary thread and backup thread are both runnable (which will happen if the primary thread blocks, the completion port releases the backup thread, then the primary thread resumes) they are known to be working on different jobs. However the shared data structures that provide an access path to the job-specifc state data need to be protected against concurrent access, unless the application is strictly for uniprocessors.

Unlike Joes's contribution, mine has been more or less a rehash of the material in the Richter book, for which I apologise. For me, stripping out the multiprocessor dimension makes IOCP easier to understand. I hope I got it right this time. I do have some original stuff on performance coming up though.

Dar

daradder · ‎08-02-2003

But first a short and snappy one.

My "Advanced Windows" book sets out the circumstances under which the primary thread is considered blocked, causing the I/O completion port to release a backup thread. The most detailed description is this:

Now if one of the running threads calls Sleep, WaitForSingleObject, WaitForMultipleObjects, SignalObjectAndWait, or any function that would cause the thread not to be runnable, the I/O completion detects this and wakes [another] thread immediately.
[end quote]

Of course Jeffrey Richter couldn't be expected to cover every little detail, but the text leaves open the possibility that when the primary thread page-faults, the thread which gets scheduled is the idle thread.

That would be astonishing.

Dar

laughlinjs · ‎08-18-2003

Windows will try and schedule the idle thread if possible. The only way it won't be the idle thread is if there are other threads in the ready state of equal or higher priority. I set my IOCP threads to time critical, making them the likely ones to be scheduled. Only OS threads or real time priority threads will be chosen over my IOCP thread.