Re: Problem with multithreading and common blocks on a dual cor

cartman4 · ‎08-07-2008

My app creates a child thread to do calculation while the GUI remains active. It works fine on a single core machine, but behaves inconsistently on a dual core. If I change it to do the calculation in the parent thread (suspending the GUI) it works fine on the dual core machine, so the problem seems rated to the multithreading.

The child thread shares a common block with the parent thread, although the parent thread doesn't modify it while the child is active. My trace output indicates that when the problem occurs, the common block that the child reads is incomplete. It's as if the data in the common block is being copied somewhere else in memory for the child (I don't know why it would) but the child starts reading it before the copy is complete. If I put a 100 millisecond sleep at the start of the child thread the problem goes away, but of course that's not a good fix. Any ideas on what is happening here, and how I should fix it?

Steve_Nuchia · ‎08-08-2008

The data are "being copied somewhere else" -- into multiple locations in the cache hierarchy. But the cache coherency model in the Windows x86 / x64 world should guarantee that this is invisible to software.

Without seeing the code the only thing I know is that the data haven't been flushed into coherent memory before the child thread is eligable for scheduling. We need to see the code to know why. One possibility is that the common area is being initialized, in part, by an optimized copy loop that uses a specialized memory write instruction. That code *should* end with a "fence" instruction to force coherency if that's what's happening. Plus the child thread initiation should involve system calls that would act as fences anyway.

So no, I can't think of a reason why the program as you describe it would act that way on typical hardware. We need more information.

-swn

Steven_L_Intel1 · ‎08-08-2008

You should protect the shared memory accesses with some sort of synchronization call - mutex, etc. Or if you're using OpenMP, you can specify that the common block is shared. This will make sure that the memory view is consistent across the threads.

jimdempseyatthecove · ‎08-08-2008

It is likely that you are starting your child thread before you are done initializing the common block.

Try adding a flag at the end of the common block statically initialized to 0.

Not seeing your code (or template of your code) makes it difficult to make an assessment.

You may also be doing something like

main launches
1) GUI app thread
2) child thread
main waits for GUI done

And your GUI code assumes the GUI app thread runs first
And/Or your child thread assumes the GUI app thread runs first
And the data initialization code is in the GUI app thread initialization section prior to it'smessage loop

The assumption that the 1st thread launched is the 1st thread run is false.

Jim Dempsey

jimdempseyatthecove · ‎08-08-2008

I forgot to add, after the GUI app finishes its initialization, set the flag to 1

Jim

Steve_Nuchia · ‎08-08-2008

Jim's suggestion is an example of lockless concurrent programming and it appears to be a correct one. It will fix the problem if his guess as to the root cause is correct. It it's something more exotic it may or may not fix or mask the problem.

In detail:

Allocate a memory cell (variable) that is small enough and properly aligned so that it is guaranteed to be updated atomically. Atomicity is not strictly required but it simplifies the analyis.

Initialize it to a known value (zero) before forking; this value signals the slave that the master has not yet completed initialization of the shared data area.

Slave "spins" waiting for the semaphore cell to be updated to asecond agreed-uponvalue, signifying completion of the initialization. The spin loop should include some kind of sleep or yield call so your program will run in a uniprocessor environment in finite time.

Slave now owns the data structure, including the semaphore. When it is finished it may signal this fact to the master either by writing a third agreed-upon value into the semaphore cell or by some other means. The master may poll the semaphore periodically, e.g. on idle events or using a timer.

Updates to the semaphore cell are atomic by construction. The programmer must ensure that these updates are serialized with respect to all other updates to shared data. Unless you are using special cache control instructions this will be the case (on conventional x86 / x64platforms) for any C or C++ code that has sequence points before and after the synchronization mechanism and the shared data is all declared using the "volatile" type modifier. Otherwise, you get what you pay for. Those special instructions exist to improve the performance of things like array initialization so there is a non-zero chance they are in play unless you can rule it out.

I don't know exactly what the equivalent caveats are in fortran but in general the language permits more movement of memory operations (with respect to their apparent source-level ordering) than C and C++ do. The documentation for the mechanism you are using to implement the fork operation should lead you to suitable incantations to ensure serializability of memory access. For OMP the suggestion to mark the common block as shared in the OMP directive sounds like a good idea to me.

-swn

cartman4 · ‎08-08-2008

Thanks for all the input! While attempting to add mutex protection to the common blocks I discovered that the child thread is accessing other global data that the parent thread may still be initializing. Unfortunately this is a legacy app with tons of global data so it may take me a while to find where the clash is. I don't think its original designers had multithreading in mind.

jimdempseyatthecove · ‎08-10-2008

Cartman (Eric?:)

When you converted to multi-threaded you effectively moved (changed) the execution sequence. This is no fault of the original designers. You (or whomever enhanced the code) must assume responsibility for the faux pas.

While using compiler option to generate code to check for use of uninitialized variables may help to catch some of the errors it won't catch

1) race conditions where the problem does not show up when you are looking for it, but will show up when you are not looking for it.

2) Situations where a variable is, subsequent to start of application, initialized to a default value. Then subsequently reset by "your parent thread" for use by the "child thread" but where the "child thread" (impatient as they often are) uses the default value instead of the intended value (due to using the value before the parent thread finishes reseting the value).

This is another responsibility the person converting the code must assume.

Often the only way to do this (reliably) is to walk the code and examine every variable and ascertain if there is a sequencing problem.

Jim Dempsey

Steven_L_Intel1 · ‎08-10-2008

I would suggest that you download a free trial of Intel Thread Checker and spend some time with it. It may help you find many problems.

Steve_Nuchia · ‎08-11-2008

http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-1.pdf

I'm rather enjoying this paper on the problem of grappling mentally with parallelism as implemented in current languages. A bit essoteric for the present discussion but some participants may find it interesting.

-swn

Problem with multithreading and common blocks on a dual core