Re: Checking Against Other Memory Models (x86, PPC, Java, CLI)

Dmitry_Vyukov · ‎08-03-2008

I want to announce tool called Relacy Race Detector, which I've developed. It's synchronization algorithm verificator for C++0x's relaxed memory model. The tool is implemented in the form of header-only library for C++03, which can be used for efficient execution of unit-tests for synchronization algorithms. The tool executes unit-test many times under control of special scheduler, on every execution scheduler models different interleaving between threads, at the same time every execution is exhaustively analyzed for data races, accesses to freed memory, failed assertions etc. If no errors found then verification terminates when particular number of interleavings are verified (for random scheduler), or when all possible interleavings are verified (for full search scheduler). If error is found then tool outputs detailed execution history which leads to error and terminates.

The tool was designed for verification of algorithms like memory management, memory reclamation for lock-free algorithms, multithreaded containers (queues, stacks, maps), mutexes, eventcounts and so on.

My personal subjective feeling to date is that tool is capable of finding extremely subtle algorithmic errors, memory fence placement errors and memory fence type errors within a second. And after error is detected error fixing is relatively trivial, because one has detailed execution history which leads to error.

Main features:

- Relaxed ISO C++0x Memory Model. Relaxed/acquire/release/acq_rel/seq_cst memory operations and fences. The only non-supported feature is memory_order_consume, it's simulated with memory_order_acquire.

- Exhaustive automatic error checking (including ABA detection).

- Full-fledged atomics library (with spurious failures in compare_exchange()).

- Arbitrary number of threads.

- Detailed execution history for failed tests.

- Before/after/invariant functions for test suites.

- No false positives.

Types of detectable errors:

- Race condition (according to ISO C++0x definition)

- Access to uninitialized variable

- Access to freed memory

- Double free

- Memory leak

- Deadlock

- Livelock

- User assert failed

- User invariant failed

You can get some more information (tutorial, examples etc) here:

http://groups.google.com/group/relacy/web

And here is dedicated news group/discussion forum:

http://groups.google.com/group/relacy/topics

If you want to get a copy of the tool, please, contact me via dvyukov@gmail.com.

Any feedback, comments, suggestions are welcome.

Relacy Race Detector is free for open-source, non-commercial development; research with non-patented, published results; educational purposes; home/private usage. Please contact me (dvyukov@gmail.com) about other usages.

--------------------------------------------------------

Here is quick example.

Code of unit-test:

#include

// template parameter '2' is number of threads

struct race_test : rl::test_suite

    std::atomic a;

    rl::var x;

    // executed in single thread before main thread function

    void before()

        a($) = 0;

        x($) = 0;

    // main thread function

    void thread(unsigned thread_index)

        if (0 == thread_index)

            // code executed by first thread

            x($) = 1;

            a($).store(1, rl::memory_order_relaxed);

        else

            // code executed by second thread

            if (1 == a($).load(rl::memory_order_relaxed))

                x($) = 2;

    // executed in single thread after main thread function

    void after()

    // executed in single thread after every

    // 'visible' action in main threads.

    // disallowed to modify any state

    void invariant()

};

int main()

    // start simulation

    rl::simulate();

And here is output of the tool:

struct race_test
DATA RACE
iteration: 8

execution history:
[0] 0: <00366538> atomic store, value=0, (prev value=0), order=seq_cst, in race_test::before, test.cpp(14)
[1] 0: <0036655C> store, value=0, in race_test::before, test.cpp(15)
[2] 0: <0036655C> store, value=1, in race_test::thread, test.cpp(23)
[3] 0: <00366538> atomic store, value=1, (prev value=0), order=relaxed, in race_test::thread, test.cpp(24)
[4] 1: <00366538> atomic load, value=1, order=relaxed, in race_test::thread, test.cpp(28)
[5] 1: <0036655C> store, value=0, in race_test::thread, test.cpp(29)
[6] 1: data race detected, in race_test::thread, test.cpp(29)

thread 0:
[0] 0: <00366538> atomic store, value=0, (prev value=0), order=seq_cst, in race_test::before, test.cpp(14)
[1] 0: <0036655C> store, value=0, in race_test::before, test.cpp(15)
[2] 0: <0036655C> store, value=1, in race_test::thread, test.cpp(23)
[3] 0: <00366538> atomic store, value=1, (prev value=0), order=relaxed, in race_test::thread, test.cpp(24)

thread 1:
[4] 1: <00366538> atomic load, value=1, order=relaxed, in race_test::thread, test.cpp(28)
[5] 1: <0036655C> store, value=0, in race_test::thread, test.cpp(29)
[6] 1: data race detected, in race_test::thread, test.cpp(29)

--------------------------------------------------------

Dmitriy V'jukov

Dmitry_Vyukov · ‎08-03-2008

Q: Can I use Relacy Race Detector to check my algo againts other that C++0x memory models (x86, PPC, Java, CLI)?

A Yes, you can. Fortunately, C++0x memory model is very relaxaed, so for the main part it's a "superset" of basically any other memory model. You just have to define "binding" between your target memory model and C++0x memory model.

Let's create such binding, for example, for x86 memory model:
- plain load operation is always "acquire" (i.e. memory_order_acquire)
- plain store operation is always "release" (i.e. memory_order_release)
- atomic RMW operation is always sequentially consistent (i.e. memory_order_seq_cst)
- mfence instruction is std::atomic_thread_fence(memory_order_seq_cst)
That is all. You can create such bindings for other hardware memory models you are interested in (PPC, Itatium, SPARC etc).

And you can define such binding to other abstract memory model, like Java MM. Let's see:
- plain load is "relaxed" (i.e. memory_order_relaxed)
- plain store is "relaxed" (i.e. memory_order_relaxed)
- volatile load is "acquire" (i.e. memory_order_acquire)
- volatile store operation is "release" (i.e. memory_order_release)
- atomic RMW operation is always sequentially consistent (i.e. memory_order_seq_cst)
But here are some caveats. First, you have to emulate work of GC, i.e. put all allocated memory to special list, and free all allocated memory in test_suite::after(). Second, you have to manually emit sequentially consistent memory fence between every volatile store and volatile load. Third, you have to manually initialize all variables to default value (0). Fourth, there is no such thing as data race, so you have to define all variables as std::atomic, this will effectively disable data race detection mechanizm. Well, actually you can use rl::var variables, if you know that there must be no concurrent accesses to variable, this will enable some automatic error detection wrt data races.
Sounds not very cool... So I'm going to add built-in support for Java and CLI. Then user would have to define something like RL_JAVA_MODE/RL_CLI_MODE, and get all those things out-of-the-box. But yes, it still will be C++ library. What do you think?

Dmitriy V'jukov

RafSchietekat · ‎08-03-2008

Sounds cool!

I think that it would take a bit more to model Java volatiles, because they're also mutually sequentially consistent (right?), which on typical hardware probably means that they have to be sequentially consistent with everything (are there any exceptions?). So would you be able to test the memory model as defined and not as mapped to the hardware running the test?

Dmitry_Vyukov · ‎08-03-2008

Raf_Schietekat:
Sounds cool!

Thank you ;)

Raf_Schietekat:
I think that it would take a bit more to model Java volatiles, because they're also mutually sequentially consistent (right?), which on typical hardware probably means that they have to be sequentially consistent with everything (are there any exceptions?). So would you be able to test the memory model as defined and not as mapped to the hardware running the test?

Completely! Memory model is modelled entirely programmatically.
I'm running the tests on x86 hardware, but I'm able to test program against relaxed C++0x memory model, i.e. not total store order, relaxed interlocked operations and so on.

Btw, what you mean by "mutually sequentially consistent"?
As I understand Java's volatiles:
- volatile load is acquire
- volatile store is release
- and between every volatile store and volatile load must be #StoreLoad memory fence.
This basically means that "Peterson's mutex works w/o any additional memory fences". But there are no requirements for total store order to different locations.

So I can map following Java program:
volatile int x;
x = 1;
int y = x;

to C++0x this way:
std::atomic x;
x.store(1, std::memory_order_release);
std::atomic_thread_fence(std::memory_order_seq_cst);
int y = x.load(std::memory_order_acquire);
Right?

Dmitriy V'jukov

RafSchietekat · ‎08-04-2008

I don't see how unit testing can go beyond the constraints of the hardware on which it is running? Unless you replaced the basic building blocks with a mockup that opens up the cracks, so to say, and you are testing code that sits between your own building blocks replacement and the testing framework, at the cost of considerably reduced performance? What would be the relative merits between this approach and something that tries to find flaws in a more systematic way, by reasoning about what can go wrong and/or trying to provide formal ~~prove~~proof that the algorithm is correct?

For Java, I'm thinking of the following from <>: "5.2.1 Operations on Volatiles are sequentially consistent." To achieve that, you would need those heavy #StoreLoad barriers, although a compiler can do a lot to soften the impact on performance by eliminating redundant barriers and/or relocating them, even though they're still scheduled statically. A less sophisticated compiler, or a library of volatiles/atomics, may have to apply barriers directly around the volatiles that effectively, if not intentionally, make volatiles sequentially consistent with everything, so the interlocks are costlier than intended. Luckily C++ atomics don't have the mutual sequential consistency to consider.

From my still limited understanding of the matter the mapping seems strict enough (maybe a little too strict?), but in a slightly more complex example it would probably be difficult to model the exact constraints defined in the Java Language Specification, which probably cannot be specified in a linear way even within a control block, and probably require a graph instead, which can be flattened in more than one way.

Does that make sense?

Dmitry_Vyukov · ‎08-05-2008

Raf_Schietekat:
I don't see how unit testing can go beyond the constraints of the hardware on which it is running? Unless you replaced the basic building blocks with a mockup that opens up the cracks, so to say, and you are testing code that sits between your own building blocks replacement and the testing framework, at the cost of considerably reduced performance?

Yes. I replace 'basic building blocks'. My std::atomic<> is special object which tracks 'happens-before' relations. And this is why one have to use rl::var<> instead of plain variables. And actually execution of unit test is single-threaded, so underlying hardware memory model plays no role at all. I manually switch between different "user threads".
As to performance. Yes, it's reduced. As compared to "unchecked run". But I don't see any value in high speed of execution of unit-test, when unit-test doesn't actually test anything.
Execution speed is about 100'000 fully checked executions per second (very depends on unit-test). Most errors are found on iterations 1-10. Very subtle errors are found on iterations 1-10'000. So basically after 1 second of execution one has some confidence.
If you want to check *every possible* "interleaving" (it's possible with 'full search scheduler'), then, yes, execution speed plays significant role, because test can run 1 munute, 1 hour, 1 day...

Raf_Schietekat:

What would be the relative merits between this approach and something that tries to find flaws in a more systematic way, by reasoning about what can go wrong and/or trying to provide formal ~~prove~~proof that the algorithm is correct?

I've considered formal model verification too, and have concluded that it's not suitable to test real life C/C++ code, it's more for "scientists", so to say :)
And I'm an engineer, interested in testing real-life synchronization algorithms written in C/C++, including bit manipulation, pointer compression, access violations as part of the algorithm etc.
Also with my tool one don't have to rewrite his algorithm in some "scientific form", and can you his IDE/debugger of choice.

Dmitriy V'jukov

Dmitry_Vyukov · ‎08-05-2008

Raf_Schietekat:

For Java, I'm thinking of the following from <>: "5.2.1 Operations on Volatiles are sequentially consistent." To achieve that, you would need those heavy #StoreLoad barriers, although a compiler can do a lot to soften the impact on performance by eliminating redundant barriers and/or relocating them, even though they're still scheduled statically. A less sophisticated compiler, or a library of volatiles/atomics, may have to apply barriers directly around the volatiles that effectively, if not intentionally, make volatiles sequentially consistent with everything, so the interlocks are costlier than intended. Luckily C++ atomics don't have the mutual sequential consistency to consider.

From my still limited understanding of the matter the mapping seems strict enough (maybe a little too strict?), but in a slightly more complex example it would probably be difficult to model the exact constraints defined in the Java Language Specification, which probably cannot be specified in a linear way even within a control block, and probably require a graph instead, which can be flattened in more than one way.

Yes, I agree, that it's difficult to manually model Java's volatiles. It's possible to place #StoreLoad after every volatile store, for example. But this is not very good approximation. It's possible to try to make some manual "optimization" wrt #StoreLoad placement...
But note that C#/CLI doesn't have such problems, because C#'s volatiles are not sequentially consistent.
Now I am working on incorporating support for Java/C#. Wrt Java's volatiles I going to use *dynamic* scheduling of #StoreLoad barriers. Something like this:

struct thread
{
bool store_load_pending;

void volatile_store()
{
store_load_pending = true;
...
}

void store_load_fence()
{
// if user manually execute full fence between volatile store and volatile load,
// then run-time will not emit additional fence
store_load_pending = false;
...
}

void volatile_load()
{
if (store_load_pending)
store_load_fence();
...
}
};

So I will try to execute #StoreLoad as late as possible. I think that this is "the worst possible conforming implementation". And "worst" in context of unit-testing is good, because it can reveal more errors.

Dmitriy V'jukov

Dmitry_Vyukov · ‎08-05-2008

If you would not mind, please, let's continue discussion here:
http://groups.google.com/group/relacy/browse_frm/thread/b34be1439060fe84

This is not actually about TBB, and I don't want such information to spread over different forums.

Dmitriy V'jukov

Dmitry_Vyukov · ‎08-15-2008

I want to announce release 1.1 of Relacy Race Detector.

First of all, now you can freely DOWNLOAD latest version of Relacy
Race Detector DIRECTLY FROM WEB:
http://groups.google.com/group/relacy/files

Main change in release 1.1 is support for standard synchronization
primitives:
1. mutex (std::mutex, pthread_mutex_init, InitializeCriticalSection)
2. rw_mutex (pthread_rwlock_init, InitializeSRWLock)
3. condition variable (std::condition_variable,
std::condition_variable_any, pthread_cond_init,
InitializeConditionVariable)
4. semaphore (sem_init, CreateSemaphore)
5. event (CreateEvent)

Now you can test more traditional "mutex-based" algorithms (i.e. no
lock-free), and Relacy still will detect data races, deadlocks,
livelocks, resource leaks, incorrect usage of API etc.
Also you can test mixed lock-based/lock-free algorithms. For example
fast-path is lock-free, and slow-path is mutex-based.
For examples see 'spsc_queue' example in distributive, it uses
std::mutex and std::condition_variable as well as lock-free fast-path.
And following manual implementation of condition_variable via Win
API's CreateEvent and CreateSemaphore:
http://groups.google.com/group/comp.programming.threads/msg/30c2ec41c4d498a2

Also I add RL_DEBUGBREAK_ON_ASSERT and RL_MSVC_OUTPUT options. And
initial_state/final_state parameters. See details here:
http://groups.google.com/group/relacy/web/advanced-moments

Dmitriy V'jukov

Synchronization Algorithm Verificator for C++0x