Re: Using monitor/mwait for producer/consumer threads.

astor8 · ‎03-16-2004

I'm looking for examples / best practice for using monitor and mwait with a set of producer/consumer threads.

What I am trying to accomplish is to keep a pair (actually a set) of producer/consumer threads in the L1 cache using hyperthreading on a prescott CPU. "In the L1 cache" means that the producer should stall if the buffer between the producer and the consumer becomes larger than 10kB. When the buffer is "reasonably" filled, both threads should work simultaneously.

As documented, mwait/monitor are primarily intended to optimize idle loops and to ensure that the processor uses lower power.

Does this mean that mwait/monitor is slow and should not be used for producer/consumer code?

How long does it take to enter and exit mwait? How many L1 cache misses equals an monitor/mwait?

astor

ClayB · ‎03-17-2004

Astor8 -

This is an excellent question, but one that is hard to answer. There is not much that has been published. Even some of my internal Intel sources said they didn't have anything that could be sent to me.

However, The Intel Technology Journal has published an article on the Pentium 4 architecture on 90nm technology. One section of this article talks about some of the new SSE3 instructions, which include monitor and mwait. (See "Thread Synchronization" section at bottom of the page in http://www.intel.com/technology/itj/2004/volume08issue01/art01_microarchitecture/p06_sse.htm). You can either jump over to the article or keep reading my paraphrased bits below.

The monitor instruction sets up hardware to detect changes in memory locations (typically in cache), while the mwait instruction puts a thread into low-power "sleep" until those memory changes are detected. The mwait only looks for changes that have been set up by the previous monitor instruction. Thus, there must be a monitor instruction executed for every mwait instruction; you can have more monitor instructions than mwaits without any problems.

For the Producer/Consumer code, the example given in the article cited above should be a good starting point for the consumer code.

while (more work to be put in queue) {
MONITOR EAX, ECX, EDX
while (queue is NOT full enough) {
MWAIT EAX, ECX
//if queue is not ready, must reset monitor
MONITOR EAX, ECX, EDX
}
Pull item from queue and process
}
// finish after Producer done
while (queue not empty) {
pull item and process
}

It's a little sketchy on the details, I know, but I've not found anything that gives more details syntax than this. Has anyone else seen better descriptions of the monitor or mwait instructions? Can you post a pointer to them?

-- clay

Chris_M__Thomasson · ‎03-18-2004

cpbreshe wrote:

while (more work to be put in queue) {
MONITOR EAX, ECX, EDX
while (queue is NOT full enough) {
MWAIT EAX, ECX
//if queue is not ready, must reset monitor
MONITOR EAX, ECX, EDX
}
Pull item from queue and process
}
// finish after Producer done
while (queue not empty) {
pull item and process
}

It's a little sketchy on the details, I know, but I've not found anything that gives more details syntax than this. Has anyone else seen better descriptions of the monitor or mwait instructions?

This could be used for an optimized ( load-locked( reserved )/store-conditional ). In a normal LL(reserved)/SC operation, the SC fails on every memory change singe the last "reservation". The monitor/wait could be used to deffer SC's completion until memory has meta specificcondition.

This would be useful for lock-free condition-variables...

ClayB · ‎03-18-2004

Astor -

astor8 wrote:
How long does it take to enter and exit mwait? How many L1 cache misses equals an monitor/mwait?

One little bit of information that has been pointed out to me is that this set of features is intended for operating system usage. Reading more carefully the text of the ITJ article section, OS use is pointed out in the first and last paragraph. It also points out that the processor will be put in low-power wait mode, not the thread calling mwait. So, on HT systems, this would effectively put one of the logical processors to sleep. In the case where you might have four threads on two processors, calling mwait would effectively reduce your processing power in half and lump three threads onto the other processor.

Also, it apparently takes several 1000's of clocks to wake up the processor once the monitored memory access has happened. Not the kind of thing applications should be using for thread synchronization, IMO.

There are some more details in the "IA-32 Intel Architecture Software Developer's Manual." Volume 2A (http://www.intel.com/design/Pentium4/manuals/25366613.pdf), Chapter3, has details of the monitor and mwait instructions. Volume 3 (http://www.intel.com/design/Pentium4/manuals/25366813.pdf), Chapter 7, Section 7.7, has something on usage of the monitor instruction.

This sounds like great functionality for supporting threads. It would be like a watchpoint in a debugger that could wake up threads. For now, though, we'll have to stick with conventional means of accomplishing this; e.g., condition variables in Pthreads.

-- clay

Message Edited by cpbreshe on 03-18-2004 02:20 PM

olszewski_marek · ‎12-20-2005

Are you sure that other processes will migrate over to another processor? I thought the docs said that the processor would be woken up by an interrupt. If this is true, the timer interrupt used for context switching would work as usual and another process could start executing as if the processor was at full power.

Also, regarding the 1000 cycle startup time, this is how long the final memory read will take when a lock is released (since your cache line will be invalidated and you will probably have to go off chip to bring it in), so it's actually not that great of an overhead.

I've played around a bit with these two instructions and I've managed to create an MSC lock implementation that uses them. I had to drop into kernel space via a syscall for the actual spinning, since user space execution is apparently not yet supported. Regardless I was able to get a 40% speedup on a micro benchmark and 4% speedup on a parallel version of quicksort when running on 2 XEONs with SMT. With such speedups, I would imagine that a user level library such as pthreads could benefit. Of course, this would require user space support.

Why do the docs say that user level support is there when, currently, it is not. When can we expect user level runnable versions of these instructions?

Regards,

Marek

jseigh · ‎12-20-2005

Basically, as answered here and on previous questions on this topic, the current implementation if MONITOR/MWAIT isn't practical for this sort of application just yet. If you know your producer/consumer threads are scheduled on hyperthreads on the same processor, you can use the PAUSE instruction to allow reallocation of processor resources to the slower of the two threads. A little crude and restricted as to application, but general support for fine grained threading isn't really out there yet.

nik80 · ‎12-20-2005

Marek,

I suppose the speedups you present refer to the acceleration you observe over the serial versions of the programs you test.
It would be interesting to give us further info on two things:
a) what is the speedup (if any) of the parallel versions using your synch. mechanisms (syscalls with mwait/monitor) over the parallel versions using other common synch. mechanisms (e.g. pthreads functions, user level spin-wait loops, etc.).
b) what is the nature of the locks in the programs you test (i.e., lock characterisation in terms of contention rate between threads for lock acquisition, lock frequency, lock duration, ...)?

And a more general question:
what happens when a thread bound on a specific logical processor (e.g. through Linux affinity syscalls) calls mwait in kernel space? Since there is no migration opportunity for that thread, what exactly will be the resume-state of the processor and the thread after an update on the monitored memory?

Nik.

olszewski_marek · ‎01-31-2006

The speedups were over a parallel version.

I was using thread binding and MCS locks with and without mwait/monitor. I could only get 4 threads running concurrently as I only have a 2-way SMP (with SMT). I used a microbenchmark that was something like this (I don't have it in front of me right now):

for (i = 0; i < 1000000; i++)
{
mthread_mutex_lock(&queue_mutex);
sum++;
mthread_mutex_unlock(&queue_mutex);

// Do some work so that we benefit from the sibling
// processor going to sleep.
for (j = 0; j < 1000; j++)
// Some memory + arithmetic operation
}

With this benchmark the monitor/mwait MCS lock outperformed the standard MCS lock version by a factor of 2. Of course, with only 4 threads, a simple test-and-set lock would probably be superior; however, for a machine capable of supporting 16 or more concurrent threads (e.g. 4 2-way CMPs with SMT), the bus traffic would start to become excessive and an MCS lock would be preferred. Of course, a simple test-and-set lock should also benefit, however I didnt check it out. Maybe if I have the time I'll give it a try.

For quicksort, I simply used a parallel work queue version of quicksort with the standard MCS and with my monitor/mwait MCS locks. I tuned it so that each thread would continue sequentially when the work items become small. This kept the contention on the queue reasonable. With this tuning, I believe I was getting a greater than 2 speedup over the sequential version (remember that I only had 2 physical processors). Using the mwait/monitor MCS locks, I was able to get an additional speedup of 1.04.

I also compared this technique to one where I would yield if the lock was locked (to somewhat mimic the new NPTL locks). I found that this would improve performance on some benchmarks, but would be detrimental on others (e.g. quicksort). In all cases, using mwait/monitor was superior to this technique.

Regarding migration, I didn't play with this since MCS locks require that your threads are bounded to a CPU.

Im assuming that these numbers will only improve when the instructions become available in userspace as one of my system calls will be obviated.

Cheers,

Marek

olszewski_marek · ‎01-31-2006

Perhaps I posted to the wrong thread. I was just saying that I could use the monitor/mwait instructions to implement MCS locks that outperformed regular MCS locks. These locks were for user space but the spinning was carried out in kernel space through a custom system call.

I guess I was slightly annoyed that people keep saying that these instructions have no place in userspace. I don't quite see why this would be the case. Since interrupts will wake up the processor, I don't see how using these instructions in user space can at all affect other applications in a multi-programmed environment. Please let me know if there is something that I am missing.

Also, I'm annoyed that the Intel documentation states that they are available in user space when they are not (at least not on my test machine).

Cheers,

Marek

nik80 · ‎02-15-2006

olszewski_marek@yahoo.com wrote: Since interrupts will wake up the processor, I don't see how using these instructions in user space can at all affect other applications in a multi-programmed environment.

Marek,
In your mwait/monitor MCS locks implementation, do you let interrupts wake the processor up and exit the spin-loop? Don't you check continuously whether the processor has been woken up as a result of a write to the monitored area (which is the only case where the spin-loop must exit)?
I mean, something like this:

do {

disable_interrupts;

monitor(monitored_memory);

enable_interrupts;

mwait;

} while (monitored_memory is not changed);

Message Edited by nik80 on 02-15-2006 08:48 AM

olszewski_marek · ‎02-15-2006

I keep interrupts enabled at all times. I just check twice, within my loop, that the memory location that I am spinning on hasn't changed (at the end, and in between the monitor and mwait instructions). This is inline with some example code I saw by intel.

Cheers,

Marek