Solved: Memory model / Help finding the race

jdybnis · ‎01-15-2009

Hi,

This is my first post to the Intel Software Network Forums. I hope this is the right place. I have a question about a multi-threaded program running on an Intel processor, but not using TBB.

I'm running into data race problems while implementing a basic two thread mutual exclusion algorithm. I've attached test code that demonstrates the problem. First I'll describe how the race is showing up, then I'll give a pseudo-code description of the algorithm, and then the actual C code is in the attachment.

I'm testing the effectiveness of the mutual exclusion by having two threads non-atomically increment a protected shared counter 10 million times each and then checking the total. The code works if I scatter mfences around, but I don't think they should be necessary under the memory model described in the Intel documentation.

Instead of using an mfence after a store I'm using a dummy load from the location I just stored to. That should prevent the processor from re-ordering the store with subsequent loads from other locations. According to the Software Developers Manual a load will not be reordered with an older store to the same location, and loads will not be reordered with other loads even if they are from different locations.

The result of running the attached program with the mfences:
Count: 20000000 Time:750ms

The result of running it with a dummy load instead of an mfence:
Count: 19908764 Time:102ms

As you can see, the program runs much faster without the heavyweight mfences, so I would like to get to the bottom of the incorrect results.

The mutual exlusion algorithm is an asymetrical two-thread algorithm. It is similar to Peterson's mutual exclusion algorithm in that it doesn't use atomic read-modify-write instrutions, only regular loads and stores. The following psuedo-code is for a sequentially consistent abstract machine. The actual code in the attachment is for an x86 processor.

Initialize:

[cpp]guard0 = guard1 = 0;[/cpp]

Thread 1:

[cpp]guard0 = 1;
while (guard1) { /* spin */ }
/* critical section goes here */
guard0 = 0;[/cpp]

Thread 2:

[cpp]guard1 = 1;
while (guard0) {
    guard1 = 0;
    while (guard0) { /* spin */ }
    guard1 = 1;
}
/* critical section goes here */
guard1 = 0;[/cpp]

Dmitry_Vyukov · ‎01-16-2009

Quoting - Dmitriy Vyukov

Nah, you won't get store-load ordering that way. If it would so easy and cheap, then why everybody is paying the price of honest store-load memory fence?
Loads do actually reordered with other loads, if store to load forwarding is involved (the case here).
The essence is that [traditional] mutual exclusion will be expensive and you just need store-load fence for it, no matter what kind of tricks you will try to employ.

However, similar trick works on SPARC. On SPARC you can do the following (psudo-code):
uint32_t X;
(uint16_t&)X = 1; // half-word store
uint32_t y = X; // full-word load
uint16_t w = y << 16; // extract another half-word

This effectively provides store-load ordering for single store and single load (which is sufficient for Peterson's algo).
You can find details and example implementation here:
http://home.comcast.net/~pjbishop/Dave/QRL-OpLocks-BiasedLocking.pdf
(search by "collocation")

But I think that such trick anyway has some performance penalty because load anyway has to stall in the pipeline until store will do full round trip through store buffers and L1$.
And this "collocation trick" won't work on x86, because AFAIK x86 allows half-word store-to-load forwarding.

View solution in original post

Dmitry_Vyukov · ‎01-16-2009

Quoting - jdybnis

Instead of using an mfence after a store I'm using a dummy load from the location I just stored to. That should prevent the processor from re-ordering the store with subsequent loads from other locations. According to the Software Developers Manual a load will not be reordered with an older store to the same location, and loads will not be reordered with other loads even if they are from different locations.

Nah, you won't get store-load ordering that way. If it would so easy and cheap, then why everybody is paying the price of honest store-load memory fence?
Loads do actually reordered with other loads, if store to load forwarding is involved (the case here).
The essence is that [traditional] mutual exclusion will be expensive and you just need store-load fence for it, no matter what kind of tricks you will try to employ.

Dmitry_Vyukov · ‎01-16-2009

Quoting - Dmitriy Vyukov

Nah, you won't get store-load ordering that way. If it would so easy and cheap, then why everybody is paying the price of honest store-load memory fence?
Loads do actually reordered with other loads, if store to load forwarding is involved (the case here).
The essence is that [traditional] mutual exclusion will be expensive and you just need store-load fence for it, no matter what kind of tricks you will try to employ.

However, similar trick works on SPARC. On SPARC you can do the following (psudo-code):
uint32_t X;
(uint16_t&)X = 1; // half-word store
uint32_t y = X; // full-word load
uint16_t w = y << 16; // extract another half-word

This effectively provides store-load ordering for single store and single load (which is sufficient for Peterson's algo).
You can find details and example implementation here:
http://home.comcast.net/~pjbishop/Dave/QRL-OpLocks-BiasedLocking.pdf
(search by "collocation")

But I think that such trick anyway has some performance penalty because load anyway has to stall in the pipeline until store will do full round trip through store buffers and L1$.
And this "collocation trick" won't work on x86, because AFAIK x86 allows half-word store-to-load forwarding.

Dmitry_Vyukov · ‎01-16-2009

Quoting - jdybnis

As you can see, the program runs much faster without the heavyweight mfences, so I would like to get to the bottom of the incorrect results.

The mutual exlusion algorithm is an asymetrical two-thread algorithm. It is similar to Peterson's mutual exclusion algorithm in that it doesn't use atomic read-modify-write instrutions, only regular loads and stores. The following psuedo-code is for a sequentially consistent abstract machine. The actual code in the attachment is for an x86 processor.

The good news is that it still possible to get indeed very fast synchronization in special cases.

I've developed asymmetric reader-writer mutual exclusion algorithm which removes mfence from reader side (however writer side becomes much heavier):
http://groups.google.ru/group/lock-free/browse_frm/thread/1efdc652571c6137

David Dice, Hui Huang, Mingyao Yang have similar solution for two thread mutual exclusion "Asymmetric Dekker Synchronization":
http://blogs.sun.com/dave/resource/Asymmetric-Dekker-Synchronization.txt

jimdempseyatthecove · ‎01-16-2009

If the performance of the simulteaneous increment of count is important && if your code will peek at count during execution as an indicator of progress towards completion then I would suggest you create a struct containing a count plus padd bytes to extend the size of the struct to 2x the cache line size (struct size =128 bytes). Then allocate an array of these structs, one struct (and count) for each thread. Initialize all counts to 0 then each thread increments the count contained in the structure indexed by its thread number (0:n-1). Have a function that returns the sum of counts as an approximation of the progress.

Generally you will be more concerned about the incrementation of the counts as opposed to producing the sum of counts. Note, gathering the sum of counts will experience a significant hit when the counters are incrementing e.g. with 8 threads running, summing the count will take 8 memory reads (7 if the thread reading the count was one of the threads performing the count).

Jim Dempsey

jdybnis · ‎01-16-2009

Quoting - Dmitriy Vyukov

However, similar trick works on SPARC. On SPARC you can do the following (psudo-code):
uint32_t X;
(uint16_t&)X = 1; // half-word store
uint32_t y = X; // full-word load
uint16_t w = y << 16; // extract another half-word

This effectively provides store-load ordering for single store and single load (which is sufficient for Peterson's algo).
You can find details and example implementation here:
http://home.comcast.net/~pjbishop/Dave/QRL-OpLocks-BiasedLocking.pdf
(search by "collocation")

But I think that such trick anyway has some performance penalty because load anyway has to stall in the pipeline until store will do full round trip through store buffers and L1$.
And this "collocation trick" won't work on x86, because AFAIK x86 allows half-word store-to-load forwarding.

Too bad it won't work on x86. It's exactly what I need, a way to artificially insert a dependancy between a single store and a a single load. This post by Linus http://lkml.org/lkml/2008/6/4/412 seems to be in agreement with what you say about partial word store-to-load forwarding.

jdybnis · ‎01-16-2009

Quoting - jimdempseyatthecove

If the performance of the simulteaneous increment of count is important && if your code will peek at count during execution as an indicator of progress towards completion then I would suggest you create a struct containing a count plus padd bytes to extend the size of the struct to 2x the cache line size (struct size =128 bytes). Then allocate an array of these structs, one struct (and count) for each thread. Initialize all counts to 0 then each thread increments the count contained in the structure indexed by its thread number (0:n-1). Have a function that returns the sum of counts as an approximation of the progress.

Generally you will be more concerned about the incrementation of the counts as opposed to producing the sum of counts. Note, gathering the sum of counts will experience a significant hit when the counters are incrementing e.g. with 8 threads running, summing the count will take 8 memory reads (7 if the thread reading the count was one of the threads performing the count).

Jim Dempsey

The counter isn't in the real application. It is just a means to test the correctness of the mutual exclusion algorithm.

jdybnis · ‎01-16-2009

Quoting - Dmitriy Vyukov

However, similar trick works on SPARC. On SPARC you can do the following (psudo-code):
uint32_t X;
(uint16_t&)X = 1; // half-word store
uint32_t y = X; // full-word load
uint16_t w = y << 16; // extract another half-word

This effectively provides store-load ordering for single store and single load (which is sufficient for Peterson's algo).
You can find details and example implementation here:
http://home.comcast.net/~pjbishop/Dave/QRL-OpLocks-BiasedLocking.pdf
(search by "collocation")

But I think that such trick anyway has some performance penalty because load anyway has to stall in the pipeline until store will do full round trip through store buffers and L1$.
And this "collocation trick" won't work on x86, because AFAIK x86 allows half-word store-to-load forwarding.

I just tried this trick on a Core i7 920 and it appears to work. Unfortunately it is no faster than the version with mfences. I've attached the code for interest anyway.

Dmitry_Vyukov · ‎01-19-2009

Quoting - jdybnis

I just tried this trick on a Core i7 920 and it appears to work. Unfortunately it is no faster than the version with mfences. I've attached the code for interest anyway.

Hmmm... Interesting... IIRC I read somewhere in Intel docs that partial store-to-load forwarding is allowed... IIRC there was something like that partial store-to-load forwarding now (in latest processor generation) become faster. I interpreted it that partial store-to-load forwarding will not serve as store-load fence anymore.
So it still serves as store-load fence... But it is as expensive as mfence :)
Mutual exclusion is indeed expensive!

jdybnis · ‎01-19-2009

Quoting - Dmitriy Vyukov

Hmmm... Interesting... IIRC I read somewhere in Intel docs that partial store-to-load forwarding is allowed... IIRC there was something like that partial store-to-load forwarding now (in latest processor generation) become faster. I interpreted it that partial store-to-load forwarding will not serve as store-load fence anymore.
So it still serves as store-load fence... But it is as expensive as mfence :)
Mutual exclusion is indeed expensive!

A whitepaper on the Core i7 I saw talked about partial store-to-load forwarding for accesses spanning cache-lines. If one cache line was in the store buffer and the other wasn't. Maybe that is what you saw. The whitepaper mentioned that it would speed up motion detection algorithms.

The experiment I tried was spiliting a short into bytes. It was actaully about 15% faster, but I don't consider that significant enough in a micro-benchmark to warrent such a hack.

jimdempseyatthecove · ‎01-20-2009

I think you should step back for a moment to view the counter increment problem as to how it will be used in a particular application. There is no best solution for all situations (could be a bad assumption on my part).

The test scenarios you presented on this thread to simulate the environment was to have two (or more) threads in competition to increment the counter and to only increment the counter.

A potential program environment might be where a program will intermittantly enter into a high increment rate section and then fall out into a no increment section of code. The other thread(s) doing the same. Under this scenario the better algorithm might be one that favors the same thread performing the next increment.

Jim Dempsey