Intel® Moderncode for Parallel Architectures
Support for developing parallel programming applications on Intel® Architecture.

what wrong with my code ?

afd_lml
Beginner
1,343 Views
Hi, all,
I am a newbie to learn openmp.
I am now facing a strange problem. My code in the following, which can be carried out correctly with my 2-core laptop, however, it will crash on my 24-core workstation, and the operating system (win vista) says: code 0xc0000005.

Would anyone like to help me ? thank you !


do{
#pragma omp parallel for private(v)
for (int i=0; i {
double* vLast = new double[nvG]; //nvG=10000

double delta = 0.30;
for (int m=0; m {
vLast = v;
v = (1.0-delta)*v + delta*vLast;
}

delete[] vLast;
}
} while (!convergenced)
0 Kudos
1 Solution
Michael_K_Intel2
Employee
1,343 Views
Quoting - Tudor

In your example, a is shared because it is declared outside the parallel for region and not with a private clause. b is private because any variable declaration within the parallel for region is implicitly private.

Just to make the example more complete:

double *a = new double[10];
double *b = new double[10];
double *c = new double[10];

#pragma omp parallel private(b) firstprivate(c)
{
double *d = new double[10];

}

We get:

  • a is shared, so is the memory behind.
  • b is private, but the pointer is dangling, as the privatized b is left uninitialized.
  • c is made private and the pointer to the array is passed along. The memory pointed to is shared amongst all threads.
  • d is automatically private by scoping rules. As each executes the new operator, each threads has a private memory referenced by d.
Cheers,
-michael

View solution in original post

0 Kudos
14 Replies
Michael_K_Intel2
Employee
1,343 Views
Quoting - afd.lml
Hi, all,
I am a newbie to learn openmp.
I am now facing a strange problem. My code in the following, which can be carried out correctly with my 2-core laptop, however, it will crash on my 24-core workstation, and the operating system (win vista) says: code 0xc0000005.

Would anyone like to help me ? thank you !


do{
#pragma omp parallel for private(v)
for (int i=0; i {
double* vLast = new double[nvG]; //nvG=10000

double delta = 0.30;
for (int m=0; m {
vLast = v;
v = (1.0-delta)*v + delta*vLast;
}

delete[] vLast;
}
} while (!convergenced)

Hi!

At first glance the code you've posted looks pretty OK to me. It would help to find the problem if you could strip down your program to the above loop with a little bit of skeleton around to have a compilable code that still shows the problem. It would also help to know how you compiled the code (compiler version and switches used, optimization levels, etc.).

One little remark on the loop: You should seperate the parallel region ("parallel") and the work-sharing construct ("for") by using this structure for your code:

#pragma omp parallel
do {
$pragma omp for
for (...) ( {...}
}

With the new code, the parallel region is created only once (which is a rather expensive operation). In your original code, the parallel region is created and torn down a cazillion of time until convergence is reached. But that's an optimization you should perform, after you've found the bug that triggered your post.

Cheers,
-michael
0 Kudos
TimP
Honored Contributor III
1,343 Views
Quoting - afd.lml

do{
#pragma omp parallel for private(v)
for (int i=0; i {
double* vLast = new double[nvG]; //nvG=10000

double delta = 0.30;
for (int m=0; m {
vLast = v;
v = (1.0-delta)*v + delta*vLast;
}

delete[] vLast;
}
} while (!convergenced)
According to this example, it appears private(v) would cause everything done in the parallel loop to be hidden from visibility after the parallel. Contrary to the first response, with no change in shape of the private copy of v[], the same private copy should persist from one instance of parallel to the next, but the private v[] hasn't been initialized, and requires stack space increasing with number of threads. If v[] is a shared array, OpenMP will take care automatically of partitioning access to it among the threads. There appears no purpose in making an array of vLast[], other than to assist in exhausting stack as number of threads increases; a scalar would accomplish the same purpose, with the vectorizer extending it automatically to something like 8 wide.
If this example is a true sample of your problem, one would have to be concerned that strange code could result in "a strange problem."
0 Kudos
Michael_K_Intel2
Employee
1,343 Views
Quoting - tim18
According to this example, it appears private(v) would cause everything done in the parallel loop to be hidden from visibility after the parallel. Contrary to the first response, with no change in shape of the private copy of v[], the same private copy should persist from one instance of parallel to the next, but the private v[] hasn't been initialized, and requires stack space increasing with number of threads.

Hi tim18,

You assume that v is declared to be an array. If it's a pointer, only the pointer would be privatized while the array reference still is shared. Without access to the declaration of v we can't be sure about that. One other thing: If v would be a true array that is privatized it would consume 3000*10000 elements of "something" which would already cause a stack overflow with only a few thread. Won't it?

Cheers,
-michael
0 Kudos
TimP
Honored Contributor III
1,343 Views

Hi tim18,

You assume that v is declared to be an array. If it's a pointer, only the pointer would be privatized while the array reference still is shared. Without access to the declaration of v we can't be sure about that. One other thing: If v would be a true array that is privatized it would consume 3000*10000 elements of "something" which would already cause a stack overflow with only a few thread. Won't it?

Cheers,
-michael
excellent points.
0 Kudos
afd_lml
Beginner
1,343 Views
Quoting - afd.lml
Hi, all,
I am a newbie to learn openmp.
I am now facing a strange problem. My code in the following, which can be carried out correctly with my 2-core laptop, however, it will crash on my 24-core workstation, and the operating system (win vista) says: code 0xc0000005.

Would anyone like to help me ? thank you !


do{
#pragma omp parallel for private(v)
for (int i=0; i {
double* vLast = new double[nvG]; //nvG=10000

double delta = 0.30;
for (int m=0; m {
vLast = v;
v = (1.0-delta)*v + delta*vLast;
}

delete[] vLast;
}
} while (!convergenced)




Thank you all for your reply.

I think I shoud post more detailed code segments of my problem.

v is a two-dimensional dynamic array, which is allocated before the parallel region.

v is updated in each iteration step. vLast is used to test whether the iteration procedure has been convergenced
.


double** v = new double* ; // M=3000
for (int m=0; m {
v = new double; // N=10000
for (int n=0; n {
v = .... // initialize v
}
}

const int maxIterations = 500;
double maxError = 0.0;
int iterationCount = 0;
do
{
++iterationCount;
maxError = 0.0;


#pragam omp parallel for
for (int m=0; m {
double* vLast = new double; // N=10000
double delta = 0.30;
for (int n=0; n {
// keep v in vLast for evluating the max error.
vLast = v;

// call a function to update v
...........

// update v by using Gauss-Siedel method
v = (1.0-delta)*v + delta*vLast;
}

double tm = ComputeError(vLast, v);
if (tm > maxError)
maxError = tm;

delete[] vLast;
}
}
while (maxError>TOL && iterationCount

for (int m=0; m delete[] v;
delete[] v;

============================================================

I wonder that vLast is shared or private ? if it is shared, should it be allocated in a critical region ? something like

double* vLast = NULL;
#pragam omp parallel for private(vLast)
for (int m=0; m {
#pragma omp critical
vLast = new double; // N=10000
.............................................................

furthermore, I have tested that when I use the following code:
#pragma omp parallel for num_thread(24)
on my laptop, the progrmam can be implemented correctly. so I am really confused about the problem.

0 Kudos
Michael_K_Intel2
Employee
1,343 Views
Quoting - afd.lml





Thank you all for your reply.

I think I shoud post more detailed code segments of my problem.

v is a two-dimensional dynamic array, which is allocated before the parallel region.

v is updated in each iteration step. vLast is used to test whether the iteration procedure has been convergenced
.

[... code ... ]


============================================================

I wonder that vLast is shared or private ? if it is shared, should it be allocated in a critical region ? something like

double* vLast = NULL;
#pragam omp parallel for private(vLast)
for (int m=0; m {
#pragma omp critical
vLast = new double; // N=10000
.............................................................

furthermore, I have tested that when I use the following code:
#pragma omp parallel for num_thread(24)
on my laptop, the progrmam can be implemented correctly. so I am really confused about the problem.


Hi!

So v is a pointer that you make private. As private only creates an uninitialized thread-private incarnation of v, the your pointer your data structure is gone when the thread starts executing. You should use firstprivate instead to pass along the value of the pointer from outside of the parallel region. That would make your data structure accessible by all threads through a private copy of the pointer to the data. I guess that this is what you want.

For vLast, I don't know what exactly you need.

If you need a private array vLast[0..N] for each thread, then your code is not quite right. As vLast is shared by default, each thread allocates an array and concurrently overwrites the vLast variable. You should declare vLast as private (as you did in your last code snippet) or hoist the declaration of the vLast pointer into the parallel region. Then you don't need the critical anymore as call to the new operator should be thread-safe.

If you need only a single instance of the vLast[0..N] array that is shared amongst the workers, you'd better allocate it once before going into parallel and only pass a firstprivate pointer into the parallel region (as with v).

Cheers,
-michael

0 Kudos
afd_lml
Beginner
1,343 Views

Hi!

So v is a pointer that you make private. As private only creates an uninitialized thread-private incarnation of v, the your pointer your data structure is gone when the thread starts executing. You should use firstprivate instead to pass along the value of the pointer from outside of the parallel region. That would make your data structure accessible by all threads through a private copy of the pointer to the data. I guess that this is what you want.

For vLast, I don't know what exactly you need.

If you need a private array vLast[0..N] for each thread, then your code is not quite right. As vLast is shared by default, each thread allocates an array and concurrently overwrites the vLast variable. You should declare vLast as private (as you did in your last code snippet) or hoist the declaration of the vLast pointer into the parallel region. Then you don't need the critical anymore as call to the new operator should be thread-safe.

If you need only a single instance of the vLast[0..N] array that is shared amongst the workers, you'd better allocate it once before going into parallel and only pass a firstprivate pointer into the parallel region (as with v).

Cheers,
-michael


Many thanks for your help.

I am not clear about the dynamic memory.
Some books say that the variables declared by new (malloc) is shared by default. Is it correct ? I am confused about this.For example, in the following code segment, which property of the variable a and b should be ?

double* a = new double [100]; // a is shared ? pivate ?

#pragma omp parallel for
{
double* b = new double [100]; // b is shared ? private ?
}


In fact, what I need is that v is shared and vLast is private.

Thank you !

0 Kudos
Tudor
New Contributor I
1,343 Views
Quoting - afd.lml

Many thanks for your help.

I am not clear about the dynamic memory.
Some books say that the variables declared by new (malloc) is shared by default. Is it correct ? I am confused about this.For example, in the following code segment, which property of the variable a and b should be ?

double* a = new double [100]; // a is shared ? pivate ?

#pragma omp parallel for
{
double* b = new double [100]; // b is shared ? private ?
}


In fact, what I need is that v is shared and vLast is private.

Thank you !


In your example, a is shared because it is declared outside the parallel for region and not with a private clause. b is private because any variable declaration within the parallel for region is implicitly private.
0 Kudos
Michael_K_Intel2
Employee
1,344 Views
Quoting - Tudor

In your example, a is shared because it is declared outside the parallel for region and not with a private clause. b is private because any variable declaration within the parallel for region is implicitly private.

Just to make the example more complete:

double *a = new double[10];
double *b = new double[10];
double *c = new double[10];

#pragma omp parallel private(b) firstprivate(c)
{
double *d = new double[10];

}

We get:

  • a is shared, so is the memory behind.
  • b is private, but the pointer is dangling, as the privatized b is left uninitialized.
  • c is made private and the pointer to the array is passed along. The memory pointed to is shared amongst all threads.
  • d is automatically private by scoping rules. As each executes the new operator, each threads has a private memory referenced by d.
Cheers,
-michael
0 Kudos
afd_lml
Beginner
1,343 Views

Just to make the example more complete:

double *a = new double[10];
double *b = new double[10];
double *c = new double[10];

#pragma omp parallel private(b) firstprivate(c)
{
double *d = new double[10];

}

We get:

  • a is shared, so is the memory behind.
  • b is private, but the pointer is dangling, as the privatized b is left uninitialized.
  • c is made private and the pointer to the array is passed along. The memory pointed to is shared amongst all threads.
  • d is automatically private by scoping rules. As each executes the new operator, each threads has a private memory referenced by d.
Cheers,
-michael

Accroding to your explanation, I am sure that my codeshould becorrect. But why it crashes on the 24-core platform ? My program can run correctly on my 2-core laptop even the number of threads was set to be 24 to simulate the 24-core workstation.

The hardward/software information of the workstation is:4 Intel Xeon 7400processors (totally 24-core); 128GB RM; 3TB HDD; Windows 2008 enterprise; intel parallerl studio 1.0.
0 Kudos
Tudor
New Contributor I
1,343 Views
Quoting - afd.lml

Accroding to your explanation, I am sure that my codeshould becorrect. But why it crashes on the 24-core platform ? My program can run correctly on my 2-core laptop even the number of threads was set to be 24 to simulate the 24-core workstation.

The hardward/software information of the workstation is:4 Intel Xeon 7400processors (totally 24-core); 128GB RM; 3TB HDD; Windows 2008 enterprise; intel parallerl studio 1.0.

We may be looking at the problem the wrong way. Perhaps it is an OS problem, rather than a code problem.
What OS do you have on your laptop? Code 0xc0000005 in Vista means access violation. Try using Run as administrator.
0 Kudos
Michael_K_Intel2
Employee
1,343 Views
Quoting - afd.lml

Accroding to your explanation, I am sure that my codeshould becorrect. But why it crashes on the 24-core platform ? My program can run correctly on my 2-core laptop even the number of threads was set to be 24 to simulate the 24-core workstation.

Hi,

I have checked your code on my machine with different thread counts and did not receive any error. Did you try debugging your code with a debugger to see where the error comes from? Maybe that can give an indication on where the problem actually is.

Cheers,
-michael

0 Kudos
robert-reed
Valued Contributor II
1,343 Views
Quoting - afd.lml
According to your explanation, I am sure that my codeshould becorrect. But why it crashes on the 24-core platform ? My program can run correctly on my 2-core laptop even the number of threads was set to be 24 to simulate the 24-core workstation.

The hardward/software information of the workstation is:4 Intel Xeon 7400processors (totally 24-core); 128GB RM; 3TB HDD; Windows 2008 enterprise; intel parallerl studio 1.0.

I think Michael is on the right track. Are you declaring v private? The first code example shows a private v but the second example doesn't specify. For the size of that array and the purposes you're putting it to use, the pointer should be declared and initialized outside the parallel region (as you're doing) and then shared across the threads; that is, don't declare it private.

It's also unfortunate that this code has to declare and destroy vLast so many times. It shouldn't cause a fault but must be adding a lot of overhead to the algorithm to repeatedly allocate and destroy those arrays. Maybe you could do something like this:


double *vLast = (double *)0;

#pragma omp parallel firstprivate(vLast) {

#pragma omp for
for (int m=0; m
if (! vLast) vLast = new double; // N=10000

// All the loop over n and error computations
}

if (vLast) delete[] vLast;
}

I think this should work. Declaring and initializing vLast outside the parallel region and then declaring it firstprivate means the copies available initially should be initialized to NULL. The cost of an if in the middle loop should be a lot less than the repeated allocate/deallocate of the original example. Separating the parallel from the for provides a place where the use of the vLast components is complete but the private pointers still exist. The dynamic memory can be cleanly released once per thread,... per outer loop interation (500 times in the second example?). The parallel region construct might be pulled outside the outer loop but would take more OMP hair to properly handle the outer loop exit condition. It should be the best performing code by eliminating all those extraneous memory allocation operations.

0 Kudos
robert-reed
Valued Contributor II
1,343 Views
It's also unfortunate that this code has to declare and destroy vLast so many times. It shouldn't cause a fault but must be adding a lot of overhead to the algorithm to repeatedly allocate and destroy those arrays. Maybe you could do something like this:


Du-oh! I was havingtrouble finding an exit strategy and when I found it, I didn't realize it makes the whole thing simpler;it came to me as I was bumbling about the house that this is all you probably need:

#pragma omp parallel
{
double *vLast = new double; // N = 10000

#pragma omp for
for (int m=0; m
for (int n=0; n vLast = v;

.........
}
}

delete[] vLast;
}

Inside the scope of the parallel block, each HW thread gets a private copy of *vLast, to which it attaches a dynamic array. Threads divide the range of m and each has a private vLast array to reuse over the interval(s) it gets in the omp for.

Oh, but there is another wrinkle I hadn't considered earlier:

double tm = ComputeError(vLast, v);
if (tm> maxError)
maxError = tm;

Oops. tm is declared inside the parallel section so is private, but maxError needs to be shared in order to find the largest over the range of m. Unfortunately, though < and > are associative operations, they are not supported in OpenMP's reduction clause so you'll need to do a manual reduction. It might look something like this:


#pragma omp parallel
{
double *vLast = new double; // N = 10000

#pragma omp for
for (int m=0; m
for (int n=0; n vLast = v;

.........
}

double tm = ComputeError(vLast, v);

#pragma omp critical reduce_maxError
{
if (tm > maxError)
maxError = tm;
}
}

delete[] vLast;
}

Because maxError is shared, access to it from multiple threads needs to be guarded. This is a serial region within an otherwise parallel loop so you could reduce the overhead even more by doing a real reduction:


#pragma omp parallel
{
double *vLast = new double; // N = 10000
doublelocalMaxError = 0.0;

#pragma omp for
for (int m=0; m
for (int n=0; n vLast = v;

.........
}

double tm = ComputeError(vLast, v);
if (tm > localMaxError)
localMaxError = tm;
}

#pragma omp critical reduce_maxError
{
if (localMaxError > maxError)
maxError = localMaxError;
}

delete[] vLast;
}

Each thread in the team would keep a localMaxError that would collect the error over the range of m handled by each thread and then through the named critical section (the name makes this critical section private) the partial reductions of maxError would be accumulated together.

0 Kudos
Reply