Solved: Suspicious helgrind "possible datarace" errors for tbb::concurrent_queue

pachash · ‎07-24-2009

Guys, I'm using tbb21_20080605oss on Linux x86 with gcc-4.3.3 and I'm getting the following suspucios helgrind errors when using tbb::concurrent_queue:

25178== Possible data race during write of size 4 at 0x5f2f404 by thread #1
==25178== at 0x53FEE26: tbb::internal::micro_queue::push(void const*, unsigned int, tbb::internal::concurrent_queue_base_v3&) (in /lib/tbb/build/tbb21_20080605oss/build/linux_ia32_gcc_cc4.3.3_libc2.9_kernel2.6.28_release/libtbb.so.2)local/game/wc/shared/lib/tbb/build/tbb21_20080605oss/build/linux_ia32_gcc_cc4.3.3_libc2.9_kernel2.6.28_release/libtbb.so.2)e_aligned_allocator<:LOG::WRITEREQUEST> >::push(gmesh::Log::WriteRequest const&)
...
==25178== This conflicts with a previous read of size 4 by thread #2
==25178== at 0x53FF27B: tbb::internal::micro_queue::pop(void*, unsigned int, tbb::internal::concurrent_queue_base_v3&) (in /home/pachanga/dev/local/game/wc/shared/lib/tbb/build/tbb21_20080605oss/build/linux_ia32_gcc_cc4.3.3_libc2.9_kernel2.6.28_release/libtbb.so.2)
==25178== by 0x53FF4C0: tbb::internal::concurrent_queue_base_v3::internal_pop_if_present(void*) (in /lib/tbb/build/tbb21_20080605oss/build/linux_ia32_gcc_cc4.3.3_libc2.9_kernel2.6.28_release/libtbb.so.2)
==25178== by 0x53934E5: tbb::concurrent_queue<:LOG::WRITEREQUEST> >::pop_if_present(gmesh::Log::WriteRequest&) (in /foo.so)
...

I'm using concurrent_queue for log messages which are collected in a separate log and flushed onto the disk

Should I worry about these errors?

Thanks.

Alexey-Kukanov · ‎07-24-2009

There is internal synchronization in TBB that a tool can not recognize as correct. E.g. some common synchronization patterns (for example, test and test and set) might have benigndata races.Alsotools usually can not recognize synchronization that does not use "standard" primitives (such as pthread_mutex) and instead is uses carefully designed protocols based on atomic operations.

I can not advice you to simply not worry, because as a TBB developer I have a biased view :). Instead, I will say what I would possibly do to decide whether to trust the library or the tool.
- (Simpler way) Build a unit test for your application that thoroughly and intensively exercises the related functionality (logging, in your case) from multiple threads and checks that all (and only) messages sent to the log were in fact flushed. Run it multiple times with varied number of threads, including 2x - 4x more than the number of cores on your system (to ensure threads get preempted in different places of the code. If it will not fail in any conditions, I would consider it enough evidence that the logging implementation (including concurrent_queue as part of it) works correctly. Though of course it is not a logical proof of error absense.
- (Harder way) There are available all TBB sources, and alsobinaries with debug information. Can the tool point to exact codelines (ideally, exact variables) where it suspects races? If yes, you might be able to analyze the code and decide whether the report is a false positive, a true but harmless race, or a real bug.
- (Another hard way) Check the concurrent_queue implementation with tools for formal algorithm verification, e.g. Chess, Spin, Relacy.

View solution in original post

Alexey-Kukanov · ‎07-24-2009

There is internal synchronization in TBB that a tool can not recognize as correct. E.g. some common synchronization patterns (for example, test and test and set) might have benigndata races.Alsotools usually can not recognize synchronization that does not use "standard" primitives (such as pthread_mutex) and instead is uses carefully designed protocols based on atomic operations.

I can not advice you to simply not worry, because as a TBB developer I have a biased view :). Instead, I will say what I would possibly do to decide whether to trust the library or the tool.
- (Simpler way) Build a unit test for your application that thoroughly and intensively exercises the related functionality (logging, in your case) from multiple threads and checks that all (and only) messages sent to the log were in fact flushed. Run it multiple times with varied number of threads, including 2x - 4x more than the number of cores on your system (to ensure threads get preempted in different places of the code. If it will not fail in any conditions, I would consider it enough evidence that the logging implementation (including concurrent_queue as part of it) works correctly. Though of course it is not a logical proof of error absense.
- (Harder way) There are available all TBB sources, and alsobinaries with debug information. Can the tool point to exact codelines (ideally, exact variables) where it suspects races? If yes, you might be able to analyze the code and decide whether the report is a false positive, a true but harmless race, or a real bug.
- (Another hard way) Check the concurrent_queue implementation with tools for formal algorithm verification, e.g. Chess, Spin, Relacy.

pachash · ‎07-24-2009

Quoting - Alexey Kukanov (Intel)

There is internal synchronization in TBB that a tool can not recognize as correct. E.g. some common synchronization patterns (for example, test and test and set) might have benigndata races.Alsotools usually can not recognize synchronization that does not use "standard" primitives (such as pthread_mutex) and instead is uses carefully designed protocols based on atomic operations.

Thanks a lot for the clarification

Dmitry_Vyukov · ‎07-24-2009

Quoting - Alexey Kukanov (Intel)

- (Another hard way) Check the concurrent_queue implementation with tools for formal algorithm verification, e.g. Chess, Spin, Relacy.

CHESS, SPIN as well as Helgrind are all memory model blind, so do not actually applicable for verification of such code.
SPIN will also require you to rewrite algorithm on special language Promela.
Relacy Race Detector is designed especially for verification of lock-free algorithms that exploit relaxed memory models.
Relacy ROCKS!
Yes, I'm interested party ;)