openmp question

finjulhich · ‎11-08-2007

Hello
icpc -v 10.0 MacOSX10.4

double *const opt = new double[...];
#pragma omp parallel shared(opt)
{
opt[threadSpecifcIndex] = value;
cout<< opt[threadSpecifcIndex] ;
}

Once one thread writes a value to an element of opt (threadSpecifcIndex is never the same for 2 different threads), is the same thread supposed to read that array element and find the same value it has written... There is no "flushing" required?

The behavior seems different depending on the optimization level. (no optim vs -O3)
rds,

TimP · ‎11-08-2007

In this example, you have no control over the order in which the results go to stdout, so it will vary with optimization, from one run to another, etc. If you sorted the results by thread, you should not see any variation.

finjulhich · ‎11-08-2007

Thanks. Please let me rephrase myself, clarify the problem and expand the question to many:

double

*constA = new double[ nbrThreads*N+ 1 ];

1. I cannot allocate A simply on the stack because the number of threads and Nare determined only at runtime. N is of the order of 500000. Otherwise, I could have done
double A[nbr of omp threads];
Below, I will access A linearily as A[i*N+thread] instead of A[thread]
I wonder whether there will be a perf difference between a stack allocation and a heap allocation?

2.
int *const timestep = new int[nbrThreads];
int *const spacestep = new int[nbrThreads];
the threads in the // region will read the position reached by other threads (on which they depend for values) by reading timestep[otherthread] and spacestep[otherthread].

3.
#pragma omp parallel shared(opt,timestep,spacestep)
If i understand the openmp spec, by default, all variables in the scope before the start of the // region are shared anyways, unless one adds default(none) ?
I still like to write the shared clause, because it shows me clearly the variables i intend to share. The pointers opt,timestep,spacestep are shared, any element of these arrays is always written by exactly 1 thread, never more, but it is read by other threads.

#pragma omp parallel shared(opt,timestep,spacestep)
{
const int thr = omp_get_thread_num();
const int dependent = (thr+1)%nbrThreads; // thread which depends on current thread
const int dependee = (thr+nbrThreads-1)%nbrThreads; // thread on which this thread depends
int& ss = spacestep[thr];
int& ssdependent = spacestep[dependent];
int& ssdependee = spacestep[dependee];
int& ts = timestep[thr];
int& tsdependent = timestep[dependent];
int& tsdependee = timestep[dependee];

All these variables are thread-local. Each thread is both a dependent

while

(ts>=0) { // Back through time
for (ss=0; ss<=N; ss++) { // down across space

// wait until the dependee has finished with the nodes this thread needs
while ( tsdependee>ts && ssdepen dee<=ss+1 );
// wait until thedependent has used our nodes it needs beforewe overridethemat next timestep
while ( tsdependent>ts && ssdependent<=ss );
....
// proceed with work
}
ts -= nbrThreads;
}

spacestep and timestep are updated for current thread via the references ss and ts.
Other threads, at some point, will see new values. No deadlock occurs.

The above works in Debug mode but not in Release with optimization. I suspect the waiting doesn't happen. Perhaps the while() gets optimized away?
In practise, not much time is spent in those loops. Is there still a point with making the thread sleep for some time. In openmp, how can that be done?

Thanks,

jimdempseyatthecove · ‎11-08-2007

finjulhich,

Make the controlarrays spacestep and timestep volatile. (the contents of the arrays may change without notice).

Also, verify that your program is not performing usless work while waiting for other threads to complete.

Your particular style will (may) work well if each thread takes equal amount of time to finish. This is not always the case as there are other factors beyond the control of the programmer. Principal among these is the operating system may run something else during your program run time. You may need to consider dynamic scheduling of the work

Also, by having each thread take adjacent array locations you inhibit the ability of the compiler to generate vector instructions (e.g. work on two of your doubles at a time).

And cach lines also reference adjacent packets of memory e.g. 64 bytes (8 doubles) so your threads will be evicting the cache of the other threads.

Your program may run significantly faster if you reorganize how you split up the work.

Jim Dempsey

finjulhich · ‎11-09-2007

Thanks for your answers.

>finjulhich,
>Make the controlarrays spacestep and timestep volatile. (the contents of the arrays may >change without notice).
I changed to

volatile int* timestep = new volatile int[nbrThreads];
volatile int* spacestep = new volatile int[nbrThreads];
...
#pragma omp parallel shared(opt,timestep,spacestep)
{ 
  volatile int& ss = spacestep[thr];
  volatile int& ssdependent = spacestep[dependent];
  volatile int& ssdependee = spacestep[dependee];
  volatile int& ts = timestep[thr];
  volatile int& tsdependent = timestep[dependent];
  volatile int& tsdependee = timestep[dependee];
  ...

And the wrong result in optimized mode still happens. More precisely, I had

while (ts>=0) {
 for (ss=0, SSi=Sinit; ss<=spacestepmax; ss++, SSi*=fr) {
  while ( tsdependee>ts && ssdependee<=ss+1 )
 os<<"ss="<<<:ENDL>os << ts<<<<<:ENDL> while ( tsdependent>ts && ssdependent<=ss )
 os << " ss="<<<:ENDL>os << ts<<<<<:ENDL> // do work
 }
 ts-= nbrthreads;
}

os is a thread-specific file output stream. With these printouts, even in -O3, it works fine.
If i comment out the 2nd printout after the first while, then it fails.

>Also, verify that your program is not performing usless work while waiting for other >threads to complete.Your particular style will (may) work well if each thread takes equal >amount of time to finish. This is not always the case as there are other factors beyond the >control of the programmer. Principal among these is the operating system may run >something else during your program run time. You may need to consider dynamic >scheduling of the work...
I have a different, working algorithm that splits up the work vertically, instead of alternating each time step, but it has #pragma omp barrier at the frontiers.
I wanted to see if this time-step alterning thread solution was faster.
In practice, no other CPUintensive processes will be running at same time. at least for now.

>Also, by having each thread take adjacent array locations you inhibit the ability of the >compiler to generate vector instructions (e.g. work on two of your doubles at a time).
Given the nature of the algo, it is likely threads will never be far from each other in the space level (i'm not sure about this)
That is why i thought to place adjacent doubles from consecutive threads.

>And cach lines also reference adjacent packets of memory e.g. 64 bytes (8 doubles) so >your threads will be evicting the cache of the other threads.
basically, 1 thread node (t,s) will = previous thread node (t,s)*const0 + previous thread node (t,s+1)*const1
Ideally, i would get the 3 nodesin the cache line.(actually itwould happen onlyif the threads are synced perfectly)

>>Your program may run significantly faster if you reorganize how you split up the work.
That is the algo i started with first,

Jim Dempsey

finjulhich · ‎11-09-2007

After some more investigation,

while ( tsdependee>ts && ssdependee<=ss+1 );
os << std::endl;
while ( tsdependent>ts && ssdependent<=ss );

removing os << std::endl; in -O3 triggers the failure. Also, os<

Seems the flushing makes the difference....

jimdempseyatthecove · ‎11-10-2007

As a test, temporarily do

#define Debug_os

#ifdef Debug_os
// testing potential rtl problem with os<<...
#else
os << ...
#endif

You may have a problem with using a non-multi-threaded version of the string template library. i.e one that uses a static formatting buffer as opposed to a stack based formatting buffer.

In your code:

  while ( tsdependee>ts && ssdependee<=ss+1 )
 os<<"ss="<<<:ENDL>

If the above while performs the statement following how does it break out?

Does this imply << modifies one or more of the test variables in the while statement?

As a test for the compiler optimization issue, introduce some "noise" in to your program. Something that will thwart the compiler from joining code fragments.

while (ts>=0) {
 for (ss=0, SSi=Sinit; ss<=spacestepmax; ss++, SSi*=fr) {
  while ( tsdependee>ts && ssdependee<=ss+1 )

{

 os<<"ss="<<<:ENDL>if(ss==0xF000F000) // impossible index

exit();

os << ts<<<<<:ENDL> while ( tsdependent>ts && ssdependent<=ss )

  {

os << " ss="<<<:ENDL>if(ss==0xFEEEFEEE) // impossible indexexit();
     }
os << ts<<<<<:ENDL> // do work
 }
 ts-= nbrthreads;
}

You might find you only need the {}'s around the statement following the while.

Jim Dempsey

finjulhich · ‎11-11-2007

>>You may have a problem with using a
>>non-multi-threaded version of the string template library.
I doubt there is an issue there... I put the output stream there just for "debugging". Please let me show why:

In non optimized mode (no -O), all works fine.

In optimized mode, this (os is std::ofstream, just to see what's going on) works

while (ts>=0) {
 for (ss=0, SSi=Sinit; ss<=spacestepmax; ss++, SSi*=fr) {
  while ( tsdependee>ts && ssdependee<=ss+1 );
while ( tsdependent>ts && ssdependent<=ss );
os << std::endl;
// work
  }
  ts -= nbrthreads;
}

but this (which is how i want it to be, no os output)

while (ts>=0) {
 for (ss=0, SSi=Sinit; ss<=spacestepmax; ss++, SSi*=fr) {
 while ( tsdependee>ts && ssdependee<=ss+1 );
while ( tsdependent>ts && ssdependent<=ss );

// work
 }
 ts -= nbrthreads;
}

nor this (no endl in os output, also just to see what's going on)

while (ts>=0) {
 for (ss=0, SSi=Sinit; ss<=spacestepmax; ss++, SSi*=fr) {
 while ( tsdependee>ts && ssdependee<=ss+1 );
while ( tsdependent>ts && ssdependent<=ss );



os << 5;
     // work
 }
 ts -= nbrthreads;
}

do.

>>In your code:

  while ( tsdependee>ts && ssdependee<=ss+1 )
 os<<"ss="<<<:ENDL>

>>If the above while performs the statement following how does it break out? Does this imply << modifies one or more of the test variables in the while statement?

The while loop is really the synchronization attempt. ts and ss are refs to local thread entry in the shared array.volatiletsdependee and ssdependee will be changed by thread dependee (hopefully in a short time), so current thread should see new values and break out of the while loop.

I would have done

while( tsdependee>ts && ssdependee<=ss+1 )
  // some openmp statment for making thread sleep some usecs

if i could.

PS: I tried

while (ts>=0) {
 for (ss=0, SSi=Sinit; ss<=spacestepmax; ss++, SSi*=fr) {
 while ( tsdependee>ts && ssdependee<=ss+1 ){}
while ( tsdependent>ts && ssdependent<=ss ){}

 // work
 }
 ts -= nbrthreads;
}

but that's the same.

Thanks Jim,

jimdempseyatthecove · ‎11-12-2007

finjulhich,

I would venture to guess that the compiler is registerizing one or more of tsdependee, ssdependee, tsdependent, ssdependent, and that the use of os <<... told the compiler that it could not rely on the preserved values of the regisrters and therefore it generated code to look at the memory values instead. Try inserting a call to a dummy function that simply returns *** but which is not within the scope of gloabal optimizations. That is you do not want the optimizer to examine the dummy code to determine which registers are preserved across the call.

 while ( tsdependee>ts && ssdependee<=ss+1 ){DummyFunction();}
while ( tsdependent>ts && ssdependent<=ss ){DummyFunction();}

You could confirm this hypothesis by placing a break on the 1st while (before you add the DummyFunction) then open a Dissassembly window. Then examine to see if the while clause is referencing registers or memory for any of the shared variables. If the optimized code is rearranged too much to make sense of then trythe following

for (ss=0, SSi=Sinit; ss<=spacestepmax; ss++, SSi*=fr) {
DummyFunction();
while ( tsdependee>ts && ssdependee<=ss+1 ){}
while ( tsdependent>ts && ssdependent<=ss ){}
DummyFunction();

You should be able to see the calls to DummyFunction. The code that is sandwidched between the calls should be examined. Only ts and ss should be candidates for registerization, the other four must not be registerized. Note, it would be permitted for the while statement to move the contents of memory for one of the shared variables into a register and then immediately compare it's value. The keyword volatile should prevent this.

If you have problems deciphering this then send the snippet of code to the forum.

Jim Dempsey

finjulhich · ‎11-12-2007

Thanks Jim,

I volatilized opt, timestep and spacestep arrays. It still fails unfortunately.

The compilation line is

icpc -xT -O3 -ipo -no-prec-div -openmp -parallel -fp-model fast=2 -fpic -fvisibility=hidden -I../api -I../common -DFINH_DYNAMICLIB="__attribute__((visibility("default")))" -c -o Tree.o Tree.cpp

Attached is1 file Tree.h that contains the implementation of the method. It is included from Tree.cpp

rds,

finjulhich · ‎11-12-2007

I don't see the file attached....

jimdempseyatthecove · ‎11-12-2007

Insert the call to a dummy function ( or a system function such as Sleep(0);) before and after the two while statements. Set break on 1st Sleep then run to break point. Then Click

Debug | Windows | Dissassembly

Select text in dissassembly window from first call to Sleep through 2nd call to sleep. Then paste into forum message for review. Also verify that the problem occures with Sleep(0); in the code.

If problem persists I will assist in examining the assembly code.

Jim Dempsey

jimdempseyatthecove · ‎11-12-2007

Finjulhich,

I haven't fully analyzed your code however I do have a observation.

Your master thread (thr==0) is not available to perform work in the major while loop. Is this by design?

Jim Dempsey

finjulhich · ‎11-12-2007

Jim,

I don't see why you conclude that.

#pragma omp parallel shared(opt,timestep,spacestep)
{

if (thr==0) {
for (ss=0; ss<=itmIdx; ss++, SSi*=fr)
opt[ss*nbrThreads+0] = cT::payoff(SSi,X);
ts -= nbrThreads;
Sinit *= ft;
}

while (ts>=0) {
. ....
}

}

The block inside the if(thr==0) gets executed only in the 0 thread, but the big while loop gets run in all threads.

PS: i will add sleep() and locate the assembly, thank you so much for your assistance,

finjulhich · ‎11-12-2007

I'm writing this on vista64 C2Duo.... I don't know the equivalent of a sleep(), there's no unistd.h so I put a dummy function from another translation unit.
I compiled Debug mode with full optimization like in Release mode. and I debugged in Debug mode.

while (ts>=0) {00000001800CCD26 mov rax,qword ptr [ts] 
00000001800CCD2D mov eax,dword ptr [rax] 
00000001800CCD2F test eax,eax 
00000001800CCD31 jl 00000001800CD11F const int spacestepmax = itmIdx00000001800CCD37 mov rax,qword ptr [rbp+1A8h] 
00000001800CCD3E mov eax,dword ptr [rax] 
00000001800CCD40 mov edx,dword ptr [rbp+238h] 
00000001800CCD46 cmp edx,eax 
00000001800CCD48 jl 00000001800CC9CC 
00000001800CCD4E mov rax,qword ptr [rbp+1A8h] 
00000001800CCD55 mov eax,dword ptr [rax] 
00000001800CCD57 mov dword ptr [rbp+234h],eax 
00000001800CCD5D mov eax,dword ptr [rbp+234h] 
00000001800CCD63 mov dword ptr [rbp+21Ch],eax for (ss=0, SSi=Sinit; ss<=spacestepmax; ss++, SSi*=fr) {00000001800CCD69 mov rax,qword ptr [rbp+190h] 
00000001800CCD70 mov dword ptr [rax],0 
00000001800CCD76 fld qword ptr [rbp+180h] 
00000001800CCD7C fstp qword ptr [rbp+188h] 
00000001800CCD82 mov rax,qword ptr [rbp+190h] 
00000001800CCD89 mov eax,dword ptr [rax] 
00000001800CCD8B mov edx,dword ptr [rbp+21Ch] 
00000001800CCD91 cmp eax,edx 
00000001800CCD93 jle 00000001800CCDDA 
00000001800CCD95 jmp 00000001800CD03E 
00000001800CCD9A mov rax,qword ptr [rbp+190h] 
00000001800CCDA1 mov eax,dword ptr [rax] 
00000001800CCDA3 add eax,1 
00000001800CCDA6 mov rdx,qword ptr [rbp+190h] 
00000001800CCDAD mov dword ptr [rdx],eax 
00000001800CCDAF fld qword ptr [rbp+188h] 
00000001800CCDB5 fld tbyte ptr [rbp+148h] 
00000001800CCDBB fmulp st(1),st 
00000001800CCDBD fstp qword ptr [rbp+188h] 
00000001800CCDC3 mov rax,qword ptr [rbp+190h] 
00000001800CCDCA mov eax,dword ptr [rax] 
00000001800CCDCC mov edx,dword ptr [rbp+21Ch] 
00000001800CCDD2 cmp eax,edx 
00000001800CCDD4 jg 00000001800CD03E globalf();00000001800CCDDA call globalf (1800010C3h) while ( tsdependee>ts && ssdependee<=ss+1 ){}00000001800CCDDF mov rax,qword ptr [rbp+1B8h] 
00000001800CCDE6 mov eax,dword ptr [rax] 
00000001800CCDE8 mov rdx,qword ptr [rbp+1A8h] 
00000001800CCDEF mov edx,dword ptr [rdx] 
00000001800CCDF1 cmp eax,edx 
00000001800CCDF3 jle 00000001800CCE3D 
00000001800CCDF5 mov rax,qword ptr [rbp+1A0h] 
00000001800CCDFC mov eax,dword ptr [rax] 
00000001800CCDFE mov rdx,qword ptr [rbp+190h] 
00000001800CCE05 mov edx,dword ptr [rdx] 
00000001800CCE07 add edx,1 
00000001800CCE0A cmp eax,edx 
00000001800CCE0C jg 00000001800CCE3D 
00000001800CCE0E mov rax,qword ptr [rbp+1B8h] 
00000001800CCE15 mov eax,dword ptr [rax] 
00000001800CCE17 mov rdx,qword ptr [rbp+1A8h] 
00000001800CCE1E mov edx,dword ptr [rdx] 
00000001800CCE20 cmp eax,edx 
00000001800CCE22 j
le 00000001800CCE3D 
00000001800CCE24 mov rax,qword ptr [rbp+1A0h] 
00000001800CCE2B mov eax,dword ptr [rax] 
00000001800CCE2D mov rdx,qword ptr [rbp+190h] 
00000001800CCE34 mov edx,dword ptr [rdx] 
00000001800CCE36 add edx,1 
00000001800CCE39 cmp eax,edx 
00000001800CCE3B jle 00000001800CCE0E while ( tsdependent>ts && ssdependent<=ss ) {}00000001800CCE3D mov rax,qword ptr [rbp+1B0h] 
00000001800CCE44 mov eax,dword ptr [rax] 
00000001800CCE46 mov rdx,qword ptr [rbp+1A8h] 
00000001800CCE4D mov edx,dword ptr [rdx] 
00000001800CCE4F cmp eax,edx 
00000001800CCE51 jle 00000001800CCE95 
00000001800CCE53 mov rax,qword ptr [rbp+198h] 
00000001800CCE5A mov eax,dword ptr [rax] 
00000001800CCE5C mov rdx,qword ptr [rbp+190h] 
00000001800CCE63 mov edx,dword ptr [rdx] 
00000001800CCE65 cmp eax,edx 
00000001800CCE67 jg 00000001800CCE95 
00000001800CCE69 mov rax,qword ptr [rbp+1B0h] 
00000001800CCE70 mov eax,dword ptr [rax] 
00000001800CCE72 mov rdx,qword ptr [rbp+1A8h] 
00000001800CCE79 mov edx,dword ptr [rdx] 
00000001800CCE7B cmp eax,edx 
00000001800CCE7D jle 00000001800CCE95 
00000001800CCE7F mov rax,qword ptr [rbp+198h] 
00000001800CCE86 mov eax,dword ptr [rax] 
00000001800CCE88 mov rdx,qword ptr [rbp+190h] 
00000001800CCE8F mov edx,dword ptr [rdx] 
00000001800CCE91 cmp eax,edx 
00000001800CCE93 jle 00000001800CCE69 globalf();00000001800CCE95 call globalf (1800010C3h)
With the volatilization of opt[] also, it's much
harder to reproduce th error on mac Xeon5100 2 dualcores
And I failed totally to reproduce this so far on single Core2Duo 4300
so far.
As, even with the volatile, i reproduced it at least once in optim mode,
I doubt the problem is gone.
THanks,

jimdempseyatthecove · ‎11-12-2007

The generated code sequence for those two while statements is strange. For each while statement it performs the test twice. The code may run OK maybe it is a loop unrolling thing.

I think there is a race condition going on with your code. Will look closer.

Jim

finjulhich · ‎11-13-2007

Thanks, here is the schematics of the algorithm

0 1 2 N

X X X ...... X 0
X X X 1
X X 2
..........
XXXX itmIdx
0 0 0
0

ts aka timestep backwards from N to 0, and ss aka spacestep goes from 0 to either itmIdx or N. Thread 0 take the N column, thread 1the N-1 column and so on. And the
thread 0 takes the N-nbrthreads column and so on.

Each node depends on the node at the rightcolumn and same row, and right col and row below, and 2 constants p0,p1

rds,

jimdempseyatthecove · ‎11-13-2007

Here is a potential problem with the algorithm:

a) for(ss=0 loop exits with ss > spacestepmax
b) you exit the loop and decrement ts
c) Note, ss still greater than 0 (it's > spacestepmax)

Momentarily it will look like you've completed your space step loop for the new ts (at least until you begin the for(ss=0 loop).

Perhaps you could consider something like:

const int spacestepmin=0;
while (ts>=0) {
 const int spacestepspan = itmIdx const int spacestepmax = spacestepmin + spacestepspan;
 for (ss=spacestepmin, SSi=Sinit; ss<=spacestepmax; ss++, SSi*=fr) {
 ...
 }
 Sinit *= ft;
 spacestepmin += spacestepspan;
 ts -= nbrThreads;
}

Run test to check for fix of problem

Then consider omitting ts in the two while statemenst as ss is a composite of ts and former ss.

Jim Dempsey

jimdempseyatthecove · ‎11-13-2007

Additional comments:

Programming a compute intensive wait loop as done with your code is counter productive. Intuitively your current code {} may look like the lowest latency to respond to other thread(s) completing a step. The problem is your synchronization variables are or tend to be contained within the same cache line. There are performance issues relating to cache line sharing.

Additionally, your code assumes it is the only thing running on your system. Unless this code is running on a system without an operating system (e.g. an embedded system), other things are running on the system. Therefore, you should code in a manner that takes into consideration that other code may intervene with your code and your coding practice causes a delay of the other code completion and thus the causes a delay in the completion of a step in your code.

Consider the following changes:

const int SpinMax = 250;// pick a reasonable number for spincount
int SpinCount;

#ifdef _DIAGNOSTIC
static int MaxSpinCountDependee = 0;
struct BlockMin128
{
union
{
int i;
char c[128];
};
};

...
volatile BlockMin128 *const timestep = new volatile BlockMin128[nbrThreads];
volatile BlockMin128 *const spacestep = new volatile BlockMin128[nbrThreads];
...
timestep.i = N-t;
spacestep.i = 0;
...
volatile int& ss = spacestep[thr].i;
volatile int& ssdependent = spacestep[dependent].i;
volatile int& ssdependee = spacestep[dependee].i;
volatile int& ts = timestep[thr].i;
volatile int& tsdependent = timestep[dependent].i;
volatile int& tsdependee = timestep[dependee].i;

// inner wait loops
static int MaxSpinCountDependent = 0;
#endif

SpinCount = 0;
while ( tsdependee>ts && ssdependee<=ss+1 )
{
if(++SpinCount>SpinMax)
Sleep(0);
else
_mm_pause();
}

#ifdef _DIAGNOSTIC
MaxSpinCountDependee =
max(MaxSpinCountDependee,SpinCount);
#endif

SpinCount = 0;
while ( tsdependent>ts && ssdependent<=ss )
{
if(++SpinCount>SpinMax)
Sleep(0);
else
_mm_pause();
}

#ifdef _DIAGNOSTIC
MaxSpinCountDependent =
max(MaxSpinCountDependent,SpinCount);
#endif

Jim Dempsey

jimdempseyatthecove · ‎11-13-2007

A paste got in the wrong place.

const int SpinMax = 250;// pick a reasonable number for spincount
int SpinCount;

#ifdef _DIAGNOSTIC
static int MaxSpinCountDependee = 0;
static int MaxSpinCountDependent = 0;
#endif

struct BlockMin128
{
union
{
int i;
char c[128];
};
};

...
volatile BlockMin128 *const timestep = new volatile BlockMin128[nbrThreads];
volatile BlockMin128 *const spacestep = new volatile BlockMin128[nbrThreads];
...
timestep.i = N-t;
spacestep.i = 0;
...
volatile int& ss = spacestep[thr].i;
volatile int& ssdependent = spacestep[dependent].i;
volatile int& ssdependee = spacestep[dependee].i;
volatile int& ts = timestep[thr].i;
volatile int& tsdependent = timestep[dependent].i;
volatile int& tsdependee = timestep[dependee].i;

// inner wait loops

SpinCount = 0;
while ( tsdependee>ts && ssdependee<=ss+1 )
{
if(++SpinCount>SpinMax)
Sleep(0);
else
_mm_pause();
}

#ifdef _DIAGNOSTIC
MaxSpinCountDependee =
max(MaxSpinCountDependee,SpinCount);
#endif

SpinCount = 0;
while ( tsdependent>ts && ssdependent<=ss )
{
if(++SpinCount>SpinMax)
Sleep(0);
else
_mm_pause();
}

#ifdef _DIAGNOSTIC
MaxSpinCountDependent =
max(MaxSpinCountDependent,SpinCount);
#endif

finjulhich · ‎11-14-2007

Reply to your 11-13-2007, 6:05 AM post,

Yes, as spacestep and timestep are separate, there will be always be a possiblity that other threads see a situation when one has been updated but not the other.

The above didn't work because i think the situation can still happen.

I am now encoding both spacestep and timestep in 1 int variable (location) like this:

. . . itmIdx+1 0
. . . . 1
. . . . 2
.
. . . 2itmIdx itmIdx

starting from 0 to an end number atnode (0,0). A complication is that the algo has 2 parts, 1 rectangular and the other triangular. It's a trapeze basically.

I just need to find out how to decode the number to timestep and spacestep.

Once i get that running correctly, i will study your following posts.

PS: on windows, i can't use sleep() .... any idea which other primitive? no omp runtime lib function can help ?

A million thanks,