Let's start by noting that

scott_g_ · ‎12-31-2013

I have a situation that developed trying to port a valid OpenMP application to Intel Cilk Plus (this is my first time using Cilk Plus).

I was getting the fatal error at runtime: "error: tried to pass SEH exception c0000005 through a spawn", sometimes two of them in a row.

I found that while I had Cilk For and Syncs implemented my queries for cpu counts and thread numbers were still OpenMP.
With OpenMP Support=Generate Parallel Code (/Qopenmp) set, if I (inappropriately) simply trade out the corresponding Cilk calls for OpenMP I can induce this error. I suspect it's actually just the omp_get_thread_num() since that actually occurred in the worker. Of course if OpenMP support is disabled, the OpenMP calls are undefined and the problem goes away entirely.

#ifdef USEOPENMP
#define CPU_COUNT omp_get_num_procs();
#define CPU_THREAD_NUM omp_get_thread_num()
#endif
#ifdef USECILKPLUS
#define CPU_COUNT __cilkrts_get_nworkers()
#define CPU_THREAD_NUM __cilkrts_get_worker_number()
#endif

Interestingly, I'm not seeing an exact equivalent to omp_get_num_procs() for Cilk. __cilkrts_get_nworkers() seems to be equivalent enough for me on the first call (avoiding hooking in unnecessary Windows calls here).

Thanks

-Scott

Jim_S_Intel · ‎12-31-2013

Is the original code making assumptions about which threads a program executes on? OpenMP and Cilk Plus programs have different philosophies and semantics about controlling the mapping of tasks to threads, and it is possible that it could affect the correctness of the ported code. It is hard to say for sure without seeing code that reproduces the problem, but perhaps that might explain the problem?

Typically, it is recommended that one avoids writing Cilk Plus programs that use __cilkrts_get_nworkers() or __cilkrts_get_worker_number() for anything but debugging / output purposes. The task parallel constructs in Cilk plus are intended to enable programmers to write processor-oblivious code, i.e., code that does not explicitly depend on the number of workers being used to execute the program. A direct mechanical translation of an OpenMP program to a Cilk Plus program is not generally guaranteed to preserve performance or even correctness, especially if the original OpenMP code is making explicit assumptions about which threads are being used to execute the program. In Cilk Plus, it is generally not safe to assume that a particular task executes on a particular thread inside parallel regions.

Cheers,

Jim

Barry_T_Intel · ‎12-31-2013

Let's start by noting that cilk_for is built out of cilk_spawns and cilk_syncs.

When the continuation after a cilk_spawn is stolen, the continuation is run on a different stack. At the sync, the application will continue on the original stack. Since exceptions propagate up the callstack, the Cilk runtime catches all exceptions that may cross over a cilk_spawn and "deals" witht them - ensuring that code is synched before re-raising the exception to find the appropriate handler.

SEH exception code c0000005 is an Access Violation. Since the Cilk runtime only knows how to propagate C++ exceptions, it is announcing that it cannot handle the exception and aborting your application.

The easiest way to deal with this is to run your application under the debugger. You may need to tell the debugger to stop the application when the Access Violation exception is raised (thrown). To do this:

Start you application under the debugger. Either step into it, or set a breakpoint on main() and then run your application.
Once the application is started and you're in the debugger at the beginning of main(), pull down the "Debug" menu and select "Exceptions..."
In the Exceptions dialog, click the "+" next to "Win32 Exceptions" to display the list of exceptions. The make sure that the entry for
c0000005 Access violation" is checked in the "Thrown" column.
Click OK, then allow your application to continue.

Evenutally your application should stop in the debugger when the Access Violation exception is raised. You'll be in the function that's faulting. Finding bugs like this one are almost fun. :o)

You could have also just checked all Win32 Exceptions, but if you're like me you'll forget that you did it and then be confused when you break into the debugger unexpectedly. The exceptions settings is something Visual Studio sets globally instead of on a per-project basis.

I assume that your analysis is correct; omp_get_thread_num() is attempting to access some state that hasn't been initialized which is generating the exception. Catching the exception when it's raised will confirm this (or point at something else). If that's where the Access Violation is coming from, that would be a bug to report to the OpenMP folks.

As for the number of processors, Cilk is much better about sharing the processors among tasks than OpenMP. Generally a Cilk programmer doesn't think about how many processors are available. Properly written code will scale to use whatever is available. Nested cilk_for loops and cilk_spawns will compose properly and not oversubscribe the machine, unless you force it to by setting the number of workers explicitly. We used to provide a function to calculate the number of processors, but that tended to encourage folks to write code specific to the number of processors on their machines, so we removed it.

- Barry

scott_g_ · ‎12-31-2013

Hi Barry and Jim, I'm not doing anything 'abusive' with either call. I'm doing performance and scalability analysis on different platforms so I simply need to know how many cores are available and then be able to choose an appropriate number accordingly (you could consider that to be debug code as it's not production oriented other than establishing an upperbound). One of the tests is whether Cilk will do any better than OpenMP with this. The actual thread/worker number is used to access an array of vectors so I can store my output (essentially polygons) dynamically without having to use access locks anywhere. The actual sequence of output is irrelevant here so everything is fine.

I'll take a further look with your information and get back to you later today.

Thanks!

-Scott

scott_g_ · ‎12-31-2013

The relevant portion of code is as follows with 'CPU_THREAD_NUM' (inappropriately) defined as omp_get_thread_num():

std::vector< P3< TLINDEXTYPENC > > polygons[ MAX_EXTRACTION_THREADS ];

  _Cilk_for
   ( int x=0; x<TLX_1; x+=2 )
  {
   const int proc=CPU_THREAD_NUM;
   std::vector< P3< TLINDEXTYPENC > > &_polygons=polygons[ proc ];
   if( _polygons.capacity()-_polygons.size()<POLYGON_RESERVE_THRESHOLD )
    _polygons.reserve( _polygons.size()+POLYGON_RESERVE_INCREMENT );

...

The threads are all picking up proc=0, which makes sense due to wrongness, and then there is a collision in the call to _polygons.reserve(...) between them.

Conveniently, omp_get_num_procs() works as expected so that can still be used regardless. I have further inspiration to streamline the listed block, however.

jimdempseyatthecove · ‎01-12-2014

Scott,

In Cilk Plus there is no assurance that all, or which of the cilk worker threads, will be used. Will this present a problem when you process polygons[...] after the __Cilk_for? And will this create a problem when the same "proc" runs a 2nd, 3rd, ... time in the __Cilk_for?

In OpenMP you will likely see all threads from the thread pool engaged in the parallel for.

Jim Dempsey

scott_g_ · ‎01-12-2014

Hi Jim, no, this is actually not a problem at all. The 'polygons' vector is built so that each 'thread', can freely deposit its output triangles somewhere without having to lock a resource on a shared destination. The actual order of triangles and their distribution across each element of 'polygons' is uninteresting and they get aggregated together afterwards. There's never a guarantee that any thread produces triangles.

This application is one testbed for analyzing absolute performance and scalability so most of the time I actually expect some elements of 'polygons' to be empty as I'm often running on only a subset of cores. As long as the threads have unique identities so they don't collide in their writes I'm good.

Barry_T_Intel · ‎01-13-2014

STL vectors are definitely *not* thread safe. Your call to _polygons.reserve is a race. You should definitely check your program for races using Cilkscreen or Inspector.

- Barry

jimdempseyatthecove · ‎01-13-2014

Barry,

The reserve is performed on the thread's Cilk_for scoped _polygons (note "_"), not the outer scope polygons (sans "_").

From what Scott has shown, it appears thread-safe.

Jim Dempsey

jimdempseyatthecove · ‎01-13-2014

Scott,

Is TLX_1 a multiple of 2?

Try adding an assert(x%2==0); in the start of the Cilk_for loop scope as a sanity check.

Jim Dempsey

Barry_T_Intel · ‎01-13-2014

RE: Jim's comment

After looking at the code again, you're correct. That usage should be safe. But I'd still run a race detector over the code. Races sometime reveal themselves as exceptions.

Scott, You never said if you managed to track the access violation into omp_get_thread_num(). If you do change it to __cilkrts_get_worker_number(), be aware that this is going to do a TLS lookup which can be slow.

- Barry

scott_g_ · ‎01-13-2014

Hi Barry and Jim,

Inspector finds no problems. Since I resolved the original condition I've built the application in thousands of different configurations and executed without any issues.

Barry-
Like I mentioned, omp_get_thread_num() doesn't itself result in an access violation. Instead, any Cilk thread calling such receives 0 back from it which then DOES cause an access violation in this case because all the threads are inadvertently mapping into the same vector (because vector isn't threadsafe). This was alleviated by putting in the proper call to __cilkrts_get_worker_number().

The lookup seems to be necessary as I have three options for accumulating output:
1- a common repository with locks :-(
2- an exclusive repository shaped according to the data (size proportionate to TLX_1) :-|
3- an exclusive repository shaped according to the machine (size proportionate to core count) :-)

If you think __cilkrts_get_worker_number() will really be painful I can put in an option to see which of options 2 or 3 is better, but I'm expecting 3 is ultimately more appropriate for more systems I'll be working on.

Jim-
TLX_1 IS a multiple of 2. It's determined from the shape of input data.

Barry_T_Intel · ‎01-18-2014

We're adding a vector reducer. *if* the reducer lookup is being hoisted properly, that would give you better performance and a lock-free way of doing this. It's available on the cilkplus.org website as part of the CilkPub library.

- Barry

Jim_S_Intel · ‎01-18-2014

How many output triangles get deposited into the vector? If the vector that is created has size proportional to TLX_1, then you might not get as good a performance out of it as the solution #3 that is there already --- using the worker number. If the number of output triangles is sparsely populated, then it might work well.

The reducer_vector is one that must be used with caution. I suspect one of the reasons why the original design of reducers did not include it is because it has a "reduce" operation that is not a constant-time operation, i.e., merging two views which are vectors involves data movement from one into the other. The benefit of the reducer_vector is that you are guaranteed to get the same output vector every time, even in a parallel execution. But if you don't need the deterministic ordering, then it might not be what is needed.

That being said, the easiest way to find out whether it works well or not is probably just to try it out. Also, I think using an object with the reducer_vector interface might result in cleaner code in the long run.

In this case, my primary objection to the use of worker number is more a matter of elegance of code rather than correctness. Ideally, there should be a way to express the same idea without explicitly referring to worker number in the user code. Perhaps if preserving the deterministic ordering ends up being too much overhead, then it might not be too bad to create a version of the reducer for Cilkpub that does abstract it away...

Cheers,

Jim S.

scott_g_ · ‎01-18-2014

I'm running tests between options 2, 3, and a bunch of other architectures; I can report back what shows up as best since you guys really seem interested.

Using vectors straight-up on this kind of operation really seems to be brutally inefficient and a reduction operation on anything similar doesn't sound like it's going to be any better. It seems like going in that direction just obscures using the worker number for another more complicated object which is going to have greater performance overhead particularly when determinism is absolutely irrelevant.

For a 256x256x256 voxel space which gets translated to TLX_1 ~512, I'm pulling about 1.5M polygons out of it. For this data it's probably about homogeneous on the index, but one can never tell. Have no fear, though, this code is absolutely intentionally NOT elegant in any way and it will resist all attempts to make it so :-)