Quote:Peter L. wrote:

Peter_L_1 · ‎08-25-2014

One of our customers is hitting a deadlock condition in our application where a lock is being held forever by one of our application threads. I have taken a memory dump of the application while deadlocked, and I believe I have traced the problem to an infinite loop in the tbbmalloc dll. Here is the stack trace:

>   tbbmalloc.dll!rml::internal::internalPoolMalloc(rml::internal::MemoryPool * memPool=0x00000001968aba91, unsigned __int64 size=0x00000001968a8000) Line 2288   C++
    tbbmalloc.dll!rml::internal::allocateAligned(rml::internal::MemoryPool * memPool, unsigned __int64 size=0x0000000000000030, unsigned __int64 alignment) Line 2105   C++
    tbbmalloc.dll!scalable_aligned_malloc(unsigned __int64 size, unsigned __int64 alignment=0x0000000000000030) Line 2754   C++
    rtserver.exe!operator new(unsigned __int64 size=0x0000000000000030) Line 275   C++

It appears that the actual last function entered (inlining optimizations cause these calls not to show up in the stack trace) is privatizePublicFreeList, called from getPublicFreeListBlock, which matches up with two previous posts to these forums:

https://software.intel.com/en-us/forums/topic/278331

https://software.intel.com/en-us/forums/topic/372934

I believe that in both of these earlier reports, as well as mine, an older version of tbbmalloc is in use (I am using 4.1.2013.613)

What I would like to know is whether or not this bug has been fixed in the latest version of tbbmalloc. I have done a comparison of the source code, and it does not appear that any meaningful changes have been made to privatizePublicFreeList, but the following was added to getPublicFreeListBlock:

!>     if (!FencedLoad((intptr_t&)mailbox)) // hotpath is empty mailbox
->         return NULL;
!>     else { // mailbox is not empty, take lock and inspect it

Is this intended to fix this infinite loop bug? The infinite loop appears to happen because a linked list that seems to be expected to end in a NULL or UNUSABLE, is in fact a circular list that does not contain a NULL or UNUSABLE, so the loop never reaches the end of the list... I believe I have verified that this is the case in my memory image.

Our customer is running on Windows Server 2008 R2, on a virtualized server. Our app was built with VS2012 with statically linked CRT. The dll is the vc9 build of 4.1.2013.613 version.

One other question: can we simply drop the newest version of the tbbmalloc DLL onto the customer's server? Is there any reason we need to rebuild our app? The only part of TBB that we use is the memory allocator (scalable_aligned_malloc).

I cannot offer a reproducible test case, as this same build of our application runs fine on all of our other customers, and the bug does not reproduce on any of our test environments. At the one customer where the problem does occur, it takes many hours (even days) of runtime before the problem occurs.

Vladimir_P_1234567890 · ‎09-01-2014

Peter L. wrote:

Our customer is running on Windows Server 2008 R2, on a virtualized server. Our app was built with VS2012 with statically linked CRT. The dll is the vc9 build of 4.1.2013.613 version.

One other question: can we simply drop the newest version of the tbbmalloc DLL onto the customer's server? Is there any reason we need to rebuild our app? The only part of TBB that we use is the memory allocator (scalable_aligned_malloc).

I cannot offer a reproducible test case, as this same build of our application runs fine on all of our other customers, and the bug does not reproduce on any of our test environments. At the one customer where the problem does occur, it takes many hours (even days) of runtime before the problem occurs.

Yes, you can drop the latest tbbmalloc.dll (4.3) instead of old one and check.

--Vladimir

Peter_L_1 · ‎09-02-2014

OK, thanks, we will try that and I will let you know if the bug still occurs.

Any feedback on the likelihood that this problem has been fixed? My guess is that the bug is actually that the list is circular... I expect it is supposed to be a logical invariant that the list never be circular, so whatever code is updating/inserting into the list is the ideal place to find and fix a bug. But since the assumption of data integrity is sometimes false, "defensive" coding would suggest changing the traversal of the list to "the whole list, not to exceed the first N elements" where N is set to the logical limit of the size of the list, or the actual size of the list (if its known), or a small constant (for performance reasons, one might only want to search a portion of the list). I do not know if there is a logical limit for N, but I would be surprised if the intention is for this list to be allowed to get very long.

Vladimir_P_1234567890 · ‎09-02-2014

We have not managed to reproduce the hang yet like you mentioned:

Peter L. wrote:

I cannot offer a reproducible test case, as this same build of our application runs fine on all of our other customers, and the bug does not reproduce on any of our test environments. At the one customer where the problem does occur, it takes many hours (even days) of runtime before the problem occurs.

So this is not clear whether there is a bug in tbbmalloc or not. We are looking into this.

Peter_L_1 · ‎09-05-2014

Thanks Vladimir! Of course its possible we have such a race condition... its hard to prove the negative, right? If it is possible for the allocator to detect the second free and throw an exception, that would be nice, but I can understand why it might not be possible. And I think some defensive programming might be worthwhile at the place where that infinite loop occurs... again, throwing an exception would be better than looping.

Thanks for your help with the investigation... we will try and find a double-free race condition in our code.

Peter_L_1 · ‎09-05-2014

Quick followup question: Is there a diagnostic mode or diagnostic version of the allocator that could help find this type of bug? Perhaps detect this type of double free and throw an exception? Or capture something in a log? (even just the thread id would be helpful... a stack trace would be even better)

Thanks!

Vladimir_P_1234567890 · ‎09-06-2014

Hello, we managed to reproduce the hang. It might happen in case the same pointer in being freed twice from different threads like in the sample below. One of next allocations can hang in this case.

#include <thread>
#include "tbb/scalable_allocator.h"

int main(void)
{
    for ( int i = 0; i < 10000; i++ )
    {
        int *a = (int*)scalable_malloc(32);
        std::thread thread1(
            {
            scalable_free(a);
        }
        );
        std::thread thread2(
            {
            scalable_free(a);
        }
        );
        thread1.join();
        thread2.join();
        a = (int*)scalable_malloc(32);
        scalable_free(a);
    }
    return 0;
}

double free() of non-zero pointers lead to undefined behavior.

Could you check, there are no such dataraces in the user code. For example in the below code there are 3 data races at once:

#include <thread>
#include "tbb/scalable_allocator.h"

int main(void)
{
    for ( int i = 0; i < 10000; i++ )
    {
        int *a = (int*)scalable_malloc(32);
        std::thread thread1(
            [&a]{
            //triple data race
            if ( a ){
                scalable_free(a);
                a = NULL;
            }
        }
        );
        std::thread thread2(
            [&a]{
            //triple data race
            if ( a ){
                scalable_free(a);
                a = NULL;
            }
        }
        );
        thread1.join();
        thread2.join();
        //possible hang
        a = (int*)scalable_malloc(32);
        scalable_free(a);
    }
    return 0;
}

--Vladimir

Vladimir_P_1234567890 · ‎09-08-2014

We are looking for adding similar diagnostics to debug libraries but anyway it will find real problems that can occur very rarely.

To find potential problems / data races that might lead to real problems it would better to try/use Intel Inspector (https://software.intel.com/en-us/intel-inspector-xe). Its role is to find exactly such problems.

Rishi_A_ · ‎02-25-2015

Hello @Peter, were you able to find the double free in your code? Also, @Vladimir, is there a more recent version of tbb that protects against duplicate free calls, by throwing an exception, or something similar? We're using TBB 4.2 Update 2 on Scientific Linux.

Peter_L_1 · ‎02-25-2015

@Rishi, yes, I think we did... I don't recall the specifics (wasn't MY code! :) )

@Vladimir, did you ever add any defensive coding or anything else to TBBMalloc to avoid the infinite loop in this case? I understand that a double free leads to undefined behavior, so from a pedantic point of view, there is not a bug here, but I still think the situation could be handled better than falling into an infinite loop. :)

Vladimir_P_1234567890 · ‎02-25-2015

Yes, it was addressed in 4.3 update 1 debug library.

- More checks for an incorrect address to release added to the debug version of the memory allocator.

--Vladimir

Rishi_A_ · ‎02-26-2015

@peter and @vladimir thank you for a quick response! Much appreciated.

Just saw this too:

Intel TBB 4.3 Update 1
TBB_INTERFACE_VERSION == 8001

Changes (w.r.t. Intel TBB 4.3):

- The ability to split blocked_ranges in a proportion, used by
    affinity_partitioner since version 4.2 Update 4, became a formal
    extension of the Range concept.
- More checks for an incorrect address to release added to the debug
    version of the memory allocator.
- Different kind of solutions for each TBB example were merged.

Does this mean the new checks aren't available in the release version? For performance reasons perhaps? We should probably not be using the debug version in a production environment, right?

Rishi_A_ · ‎02-26-2015

Just saw this too: https://github.com/sailrish/tbb/blob/master/src/tbbmalloc/frontend.cpp#L403-L433

There are several comparisons as part of this operation, but other than that, it doesn't seem hugely intense to have to execute this in a prouduction app.

    inline FreeObject *findObjectToFree(const void *object) const;
    void checkFreePrecond(const void *object) const {
#if MALLOC_DEBUG
        const char *msg = "Possible double free or heap corruption.";
        // small objects are always at least sizeof(size_t) Byte aligned,
        // try to check this before this dereference as for invalid objects
        // this may be unreadable
        MALLOC_ASSERT(isAligned(object, sizeof(size_t)), "Try to free invalid small object");
        // releasing to free slab
        MALLOC_ASSERT(allocatedCount>0, msg);
        // must not point to slab's header
        MALLOC_ASSERT((uintptr_t)object - (uintptr_t)this >= sizeof(Block), msg);
        if (startupAllocObjSizeMark == objectSize) // startup block
            MALLOC_ASSERT(object<=bumpPtr, msg);
        else {
            // non-startup objects are 8 Byte aligned
            MALLOC_ASSERT(isAligned(object, 8), "Try to free invalid small object");
            MALLOC_ASSERT(allocatedCount <= (slabSize-sizeof(Block))/objectSize
                          && (!bumpPtr || object>bumpPtr), msg);
            FreeObject *toFree = findObjectToFree(object);
            // check against head of freeList, as this is mostly
            // expected after double free
            MALLOC_ASSERT(findObjectToFree(object) != freeList, msg);
            // check against head of publicFreeList, to detect double free
            // involiving foreign thread
            MALLOC_ASSERT(findObjectToFree(object) != publicFreeList, msg);
        }
#else
        suppress_unused_warning(object);
#endif
    }

Rietschin__Axel · ‎10-09-2016

Guys, I think you have a race, and it's still there in today's TBB 2017 Update 1

Like the OP, I ran into this situation where a thread in my server was stuck in that same loop in privatizePublicFreeList(). I checked for double-frees all over the place but I do not use the allocator directly, only through containers such as tbb::concurrent_hash_map, tbb::concurrent_unordered_set and... tbb::combinable.

Before going any further, I want to state that your loop should be protected against infinite looping:

    while( isSolidPtr(temp->next) ){ // the list will end with either NULL or UNUSABLE
        temp = temp->next;
        if (--allocatedCount == UINT16_MAX) {
            // failfast: we should never get there
        }
        MALLOC_ASSERT( allocatedCount < (slabSize-sizeof(Block))/objectSize, ASSERT_TEXT );
    }

...or something similar, in the retail code. This would save a lot of people a lot of time.

Now, I've found a way to trigger the problem quite repeatably: just assign a scalar to a combinable, as in

tbb::combinable<int> bomb;

bomb = 0;

Just make sure you do that assignment from a few threads concurrently. Yes, I know it should be bomb.local() = 0 but instead of explaining that you should thank me for the following minimal repro :-) The two for loops are entirely irrelevant, they just create favorable conditions.

int main()
{
    // Assertion toFree != publicFreeList failed on line 432 of file .. / .. / src / tbbmalloc / frontend.cpp
    // Detailed description : Possible double free or heap corruption.

    combinable<uint32_t> g_combinable;

    parallel_for(blocked_range<size_t>(0, static_cast<size_t>(-1)), [&g_combinable](const blocked_range<size_t>& range)
    {
        for (size_t n = range.begin(); n != range.end(); ++n)
        {
            g_combinable = n; // CORRUPTS THE TBB ALLOCATOR. Should be g_combinable.local() = n;
        }
    });

    return 0;
}

My guess is that swapping the guts of the combinable under the nose of other threads creates the race that leads to double frees and ultimately leads the allocator to spin forever. Either a lock is missing in the combinable, or you should perhaps consider removing the move constructor and the assignment operator altogether. You should also look for a similar problem in other containers that use the tbb_allocator.

Axel

Alexei_K_Intel · ‎10-10-2016

Hi Axel,

Thank you very much for the reproducer. The main issue is that we allow assignment of anything to tbb::combinable. It is an error-prone interface that does not have any checks at compile time. I.e. if user forgets to call "local()" and performs assignment to the object of tbb::combinable class it must be a compile time error.

Why does the assignment works and why it leads to double free? As you have already mentioned, the assignment operator is called; however, if we look at the documentation there is no assignment operator for "any-type" (besides the tbb::combinable type). We fell a victim to the template constructor that can construct tbb::combinable from anything implicitly. So the object of tbb::combinable is created implicitly and the move assignment operator is called (because it is a temporary object). The implementation swaps the internal fields one by one in a thread unsafe manner. One of the fields (the construct callback) is always allocated from the allocator (tbb::cache_aligned_allocator by default). So it can be swapped into two (or more) implicitly created tbb::combinable objects that are automatically destroyed after the assignment; as a result, double-free occurs.

We definitely should prohibit implicit template constructor, i.e. make it explicit. However, I am not sure that it worth adding thread-safety of the copy and assignments methods in tbb::combinable because I do not see reasonable use cases but it will affect the implementation performance, including "local()" that should be aware of possible container copying. As for infinite loop in tbbmalloc, it is also a question of performance because the checks will be performed for each "free" call and can affect applications that use dynamic memory intensively. So we have these checks only in a debug version of tbbmalloc to provide a possibility to check double free conditions. It should be noted that it is a quite difficult problem to detect a double free occurrence in a common case because the deallocating object is already invalid. Perhaps, the allocator implementation can remember all allocated objects and check if the deallocating object is allocated but it seems ineffective solution (in terms of memory and performance).

In any case, sorry for the inconvenience.

Regards, Alex

Rietschin__Axel · ‎10-10-2016

Thanks for getting back so quickly Alex.

Yes it would be great if some safeguard could be put in order to avoid that race!

Regarding the runtime check in the loop, while I agree with your argument in general, I believe the overhead would be minimal in this particular case as the loop easily fits in the L1 and the (never taken) branch would be easily predictable.

Perhaps promote the variable from uint16 to int to avoid partial register stalls (if any), and pick the best test (check the sign or test against -1, not sure which is faster) and I bet the impact will be minimal, if any: how long that linked list is expected to grow in the average case?

People having sub-microsecond-scale perf concerns should probably not do dynamic memory allocation on their hot path anyway :-)

Thanks,
Axel

Alexei_K_Intel · ‎10-10-2016

You are welcome. I suppose your are right: one more runtime check is not a big deal. Moreover, it is already present but only in a debug mode:

MALLOC_ASSERT( allocatedCount < (slabSize-sizeof(Block))/objectSize, ASSERT_TEXT );

"(slabSize-sizeof(Block))/objectSize" is a compile time constant of maximal number of blocks that can be allocated in a slab. We only need to enable it in a release mode if we are sure that there is no performance issues.

Peter_L_1 · ‎10-10-2016

I'm the OP, and I really do think it is a worthwhile change to include this check in the release version of the library. In our case, we had a double-free bug, but because the result of that was an infinite loop, it made it very hard to diagnose. If an exception were thrown, then I expect we would have gotten a core dump, and probably would have found the bug much more quickly. BTW, thanks for a great library!

Rietschin__Axel · ‎10-10-2016

So it seems we have a nice consensus :-)

As far as how to signal the error, for as long as you are sure it's a fatal programmer error leading to a catastrophic failure, I'd just fail fast and stop the process dead, using for example __fastfail (int 0x29) instead of throwing, first because throwing unwinds the stack and this can have lots of side effects, but mainly because exceptions can always be caught and swallowed - and ultimately go unnoticed - while it's a lot harder to overlook a crash.

In general, and it's a bit of a philosophical concern, you will get our attention better by failing fast on fatal programmer error, and the process state captured in the dump will be as close as possible to the conditions that led to the error, making it easier to understand what's going on.

Anyways, I made a trivial change to avoid triggering the tbb:combinable race and my process went back to rock solid. I'm looking forward to use more of TBB in the future!

Alexei_K_Intel · ‎10-11-2016

Peter, Axel,

Thank you very much for feedback. We really want to add some checks to improve the robustness of the implementation and fail "as close as possible to the conditions that led to the error". However, there is a technical problem. This "infinite" loop is just a side effect of double-free issue. The naïve check in this loop will happen much later and on another thread then double-free has occurred. I.e. the thread that performed double-free can already finish and another thread will fail on malloc. As a result, we cannot see the problematic thread in a core dump.

Fortunately, we have some ideas how to detect double-free at the moment of the second free with some level of probability. However, these checks are not trivial and we need to examine the list of free objects (up to 2K elements) for each deallocation that can affect performance. We are thinking about some heuristics not to perform slow checks too often.

Thank you for the fruitful discussion, I am sure it will help us to improve the library.

Regards, Alex

tbbmalloc privatizePublicFreeList appears to have an inifinte loop bug