omp data race condition detected

mullervki · ‎09-23-2015

Hello,

Intel Inspector has detected a data race condition that I cannot understand. The entire program has A SINGLE omp pragma:

#pragma omp parallel num_threads(m)
{
#pragma omp for private(n)
      for(n=0; n<m; ++n) {
         efunc(data);
      }
}

Of course, "efunc" is supposed to be a thread-safe function and I'm using Inspector to verify this.

To my surprise, Inspector reports that the data race condition is happening between the Master and a Worker thread. This has me baffled: isn't there an implicit synchronization of the threads at the end of the loop? In other words, once the Master thread reaches the loop, it will only continue once all threads in the loop have finished. Correct?

If that's the case, how can Inspector report a data race condition between the Master and one of the Worker threads?

Thanks.

mullervki · ‎09-29-2015

I managed to create a stand-alone, self-contained program that replicates the problem I'm running into. Below is the entire program. The loop sizes were adjusted to make sure I had multiple threads running simultaneously on my machine. I have 32 cores and 192GB of memory on my machine.

If I did this correctly, then there should be a png image file showing Inspector's screen reflecting the data race condition. I can't see where this is happening, so if anybody can point it out to me I would appreciate it.

#include <omp.h>
#include <malloc.h>
#include <stdio.h>

static void
myfunc(void *a)
{
   int *b = (int*)a;
   int *c;
   int i, n, num, sum;

   num = b[0];
   c = (int*)malloc(num*sizeof(int));
   c[0] = 0;
   for(i = 0; i < 10000; ++i) {
      for(n = 1; n < num; ++n) {
         c = c[n-1] + 1;
      }
   }

   sum = 0;
   for(n = 0; n < num; ++n) {
      sum += c;
   }
   b[1] = sum;
   free (c);
}

int main(int argc, char* argv[])
{
   int sizes[10][2], n, nthreads;

   for(n = 0; n < 10; ++n) {
      sizes[0] = 1000*(n+1);
   }

   for(nthreads = 1; nthreads <= 10; ++nthreads) {
#pragma omp parallel num_threads(nthreads)
{
#pragma omp for private(n)
   for(n = 0; n < 10; ++n) {
      myfunc(sizes);
   }
}

   for(n = 0; n < 10; ++n) {
      printf("sum[%d] = %d\n",n,sizes[1]);
   }
   }
}

mullervki · ‎09-29-2015

Here's the png image I mentioned above

TimP · ‎10-01-2015

At a quick glance, it appears that all threads are updating a[1] and your way of calling myfunc() stores those in distinct locations only up to 10 threads, while Inspector may not be digging down to see that you have so limited it. I also don't know whether Inspector may get tripped up by what to me is obscure syntax.

mullervki · ‎10-01-2015

Tim,

Thanks for looking into this. The main program calls "myfunc(sizes)", and "n" is the loop variable that omp is supposed to thread on.

So on one thread, myfunc is called with sizes[0], on the other with sizes[1], then sizes[2], etc. Since "sizes" is declared as sizes[10][2], inside "myfunc" the variable is just a one-dimensional array of size 2. Note how inside "myfunc" I first cast the void* argument into an int* variable that I call "b", then set b[0] and b[1].

In other words, on one thread I may change sizes[0][0],sizes[0][1]. On another sizes[1][0],sizes[1][1], etc. all the way to sizes[9][0],sizes[9][1]. So each thread is working on a different memory location.

Kevin_O_Intel1 · ‎10-01-2015

Tim,

Inspector works at the binary level. So by the time Inspector gets involved everything is just an address. So the syntax is not is not the issues.

In the case of the data races reported, they are all from "c" which is heap allocated, and freed by the end of myfunc.

What I think is happening is when one thread frees "c", a subsequent thread allocates another "c" which just so happens to

be the same address of the last one freed.

Inspector compares the addresses the threads are accessing and they are the same, hence the data race.

Kevin

mullervki · ‎10-01-2015

Kevin,

If each thread is running "myfunc", then "c" is a local stack variable in each of these instances. So I expect each instance of myfunc to be fully self-contained, i.e., each instance of the stack variable "c" is distinct from each other. Am I right? I imagine what's happening as myfunc being called 10 times simultaneously, and each instance of it creates its own stack - hence, its own "c".

So just because the variable has the same name - "c" - it shouldn't mean anything because their address is distinct - each in its own stack.

So I still don't see why one malloc is conflicting with another free.

Kevin_O_Intel1 · ‎10-01-2015

Yes, each thread has its own stack and it's own c... the names are not the issues, they have different addresses.

But the heap is global.

if you run

a = malloc(10)

free(a)

b = malloc(10)

The heap allocator will give a and b the same piece of memory. Once memory is freed it becomes available for reuse.

There is no guarantee that it will give you will the same address, but it is a possibility. And if you are allocating the memory of the same size it

seems likely,

You could try printing "c" for each thread to verify. You could also not free and see what Inspector reports.

Kevin

mullervki · ‎10-01-2015

Kevin,

Thanks again.

Yes: the heap is global, but I assume that malloc and free are thread safe.This means there shouldn't be any memory contention here. If a thread mallocs the data then, until that data is freed again - which will only happen after I'm done using it - then no other thread will malloc the same memory location from the heap. Hence, there cannot be a data race.

If I'm correct, then Inspector is reporting a false positive. And, in a big program like the one I'm trying to use Inspector on, where each thread mallocs and frees memory all the time, Inspector reports so many false positives that it almost makes it unusable.

Is there a way to eliminate these false positives - assuming, of course, that these are indeed false positives?

Kevin_O_Intel1 · ‎10-02-2015

Agreed. The issue looks like a false positive.

You can suppress the issue using Inspector.

https://software.intel.com/en-us/articles/user-editable-suppression-format-for-intel-inspector-xe

mullervki · ‎10-02-2015

Kevin,

Thanks again for the reply.

I'm going to have to give this issue some more thought. I can certainly suppress the issue with the trivial example I posted that reproduces the issue. But my real multi-threaded code is about a half a million lines of code - some written by me, some not.

I was expecting Inspector to allow me to add some controls - even at the cost of slowing down the analysis - to inspect the code in a stricter way so that the false positives would go away. Given the trivial example I posted, I need to consider the value of using Inspector at all. I've spent a couple of weeks going after one false positive. I'm not sure I want to spend the time to look at every false positive just to conclude it's not a false positive at all; I have quite a few of them.

I did post a question to Intel's Premier Support, but got no reply from them after a week. I'll probably just sit on this for a while...

Thanks again.

Kevin_O_Intel1 · ‎10-02-2015

Also consider...

As you mention malloc and free are thread safe. But the cost of this safety is locking.

Using a solution such as heap pools or scalable_malloc in tbb would be preferred.

For your specific case I would need to do some further analysis to comment further.

What is the IPS number of your issue? I'll ask them to escalate the issue.

Hope this helps!

Kevin

mullervki · ‎10-02-2015

I assume that by IPS number you are referring to the case id.

Here it is: 6000132108

As for malloc issues, I wasn't really that concerned with them at first. But, just to be sure, since all our malloc/free/realloc calls go through a single function, I hacked in a mutex to make sure these calls were thread safe. It made no difference, as I was still getting the Inspector false positives.

I need to look at scalable_malloc. Heap pools may be harder, as I wouldn't know what to set the pool sizes to. Are you suggesting that Inspector will do a better job if I switch to a scalable_malloc? Like I said, all our memory allocations are in one location, so it should be relatively easy to replace the malloc calls.

Kevin_O_Intel1 · ‎10-02-2015

I have escalated your IPS issue. We will try to verify exactly what is happening with your test case.

We will be in touch shortly.

Kevin

mullervki · ‎10-02-2015

Kevin,

Wow! Thanks!

Jennifer_D_Intel · ‎10-05-2015

Thank you for the sample code. After compiling it with VC++ v120 I ran it through Inspector XE 2015, which showed the data races you reported. Running the same executable through the 2016 release did not show the data races, although there was a warning about cross-thread stack accesses. Hopefully upgrading to 2016 will solve the issue with your original code as well.

Thanks,

Jennifer