Intel C++ compiler and OMP

dehvidc1 · ‎09-19-2010

I'm trying to thread the code in a production system that's currently running unthreaded. I'm working on Windows 7. I would be grateful if anyone could help with the following queries.

a/ In the function with the main loop that I'm attempting to thread there are a number of stack based variables. My understanding is that the default for these variables is OMP shared. The variables needto be OMP private in the threads spawned to avoid race conditions. I've tried using the private clause in the omp parallel for clause

#pragma omp for private(xVar)

but I get an error from ICC

(0): internal error: 0_12032

which is on thecryptic side.

b/ I thought that I might be able to use the parallel construct before the private variables were declared to generate individual copies of the OMP private variables for each thread:

#pragma omp parallel
{

var definitions

...

parallel for
{
loop
}
} // end of parallel region

This would be inefficient in that any initialisation involving computation would be repeated for each variable rather than done for one instance and this instance copied to each thread's OMP private instance but I could wear this to get it going. The parallel section is entered, 8 threads are created (number of cores on my box) some initialisation done but then theprogram crashes.

Is this an OK way to get OMP thread private variable instances?

c/A library used by the code, Blitz, isn't thread safe. It's possible that this is interfering with the approach in/b. To get Blitz running threadsafe requires POSIX threads. Windows doesn't support native POSIX threads. There is an Open SourceWindows pthread implementation http://sourceware.org/pthreads-win32/. Getting Blitz threadsafe under ICC on Windows using the Open Source Windows pthread implementation would no doubt be an interesting little project but not one I want to take on at the moment. Anyone have any experience at this? I noticed that with the Parallel Composer install there's a pthread.h file used bythe tachyon example. The pthread.h file provides someemulation of a few POSIX threads bits and pieces. I could go down this route by looking for the POSIX dependencies in Blitz and implementingemulations but it's outside the scope of what I'm trying to do

Regards

David

Milind_Kulkarni__Int · ‎09-19-2010

a) It would be helpful to know the compiler version you are using, and try with latest version if the internal error comes.
I hope its a compiler error and not runtime??
It will be useful to give the structure of routine / loop you are parallelizing. If possible, give testcase.

b) Default is shared.. I think that would not work with private clause, as I believe private variables do not persist across parallel regions. If you use Threadprivate clause for the stack variables needed for each thread, the variable will persist in for(...) , else it will have garbage value in the 2nd parallel region, I think so.. Yes, the computation will be done for each thread, but I suppose the extra computations will just be equal to num_threads - 1 ??

Also , I think you can give threadprivate clause a try, and see if the program crashes.. which you may specify after declaration, and before definition.. i.e, before 1st parallel region. probably, the runtime crash may be due to variables not able to retain values?

#pragma omp threadprivate(a, x)

That is not to say that , the a) should not work fine..

TimP · ‎09-20-2010

I don't see the objection to defining the variables inside the parallel region; it's not necessarily any different functionally from defining them outside with private (until you require firstprivate, lastprivate, ....), and it seems likely to lead to clearer style.
Internal error is a bug, so, if you can submit a case which shows it with the current compiler release, that should be valuable.

jimdempseyatthecove · ‎09-20-2010

Declaring the private variables within the scope of the parallel region is the preferred route to take *** provided that the ctor, if any, is thread safe. If the ctor is not thread safe then you may need to add a critical section arround the ctor. Placing the object outside the scope of the parallel region _may_ have pitfalls as well. In particular, does the object allocate components? Is the copy constructor (use to copy into private copy) thread safe? The protections you require in your code will depend on the characteristics of the variables (which may have ctors and copy ctors).

Jim Dempsey

dehvidc1 · ‎09-21-2010

Apologies if I didn't understand properly what you're saying, Milind, butyou're saying that the initial parallel regionfollowing the OMP parallel statement is a separate parallel region from the parallel region following theOMP for statement? I thought from the syntax that the first parallel region contained the second. ie

omp parallel
{

vars

omp for
for loop
{
}
}// end of parallel region

My thought was that private copies of the vars following the initial parallel pragma would be allocated to the threads created when the initial parallel regions was entered. These threads would in turn be allocated to iterations of the parallelised loop. So the threads would executethe parallelised loop iterations already prepared with private copies of the variables concerned.

But if OMP regards these as separate regions I guess this wouldn't work.

Thanks forthe help

David

Milind_Kulkarni__Int · ‎09-21-2010

I misunderstood it to think that that they are written as separate parallel regions in the code itself.. But as the for(..) is enclosed in that same one, and there is only 1 parallel region, there would be no problem..
Sorry I thought you are defining in 1 region, and using in another which is not required.

Om_S_Intel · ‎09-21-2010

I do not see internal compiler error.

$ icc -V

Intel C Intel 64 Compiler Professional for applications running on Intel 64, Version 11.1 Build 20100806 Package ID: l_cproc_p_11.1.073

$ uname -a

Linux dpd22 2.6.18-52.el5 #1 SMP Wed Sep 26 15:26:44 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux

$ cat tst.cpp

#include

#define N 10000

int main()

{

int a;

int x, y;

int sum;

int i;

for( i = 0; i < N; ++i)

a = i;

#pragma omp parallel for private(x, y) reduction(+:sum)

for (i = 0 ; i < N; ++i){

sum = sum + a;

x +=i;

y +=i*i;

}

std::cout<<"Hi sum = "<< sum << std::endl;

return 0;

}

$ icc -openmp tst.cpp

$ ./a.out

Hi sum = 49995000

It would be nice if you canshare your test case.

jimdempseyatthecove · ‎09-22-2010

[one thread running here]
omp parallel
{
[n threads running here]
vars

omp for
for loop
{
[same n threads running different slices of for loop]
}
[same n threads running here
}// end of parallel region
[same thread as first thread running above]

There is only one parallel region

If you should change "omp for" to "omp parallel for" and if nested regions enabled, then the inner loop will create n teams of m threads, each team slicing the entire array.

IOW as to if these are seperate parallel regions, it depends on if the keyword "parallel" is contained on the omp statement.

Jim Dempsey

TimP · ‎09-22-2010

Quoting jimdempseyatthecove

[one thread running here]
omp parallel
{
[n threads running here]
vars

omp for
for loop
{
[same n threads running different slices of for loop]
}
[same n threads running here
}// end of parallel region
[same thread as first thread running above]

There is only one parallel region

If you should change "omp for" to "omp parallel for" and if nested regions enabled, then the inner loop will create n teams of m threads, each team slicing the entire array.

Then you will need to turn on OMP_NESTED (environment variable, or omp_set_nested()) in order to make the inner parallel active.

jimdempseyatthecove · ‎09-22-2010

Then you will need to turn on OMP_NESTED (environment variable, or omp_set_nested()) in order to make the inner parallel active.

That's what I said "...if nested regions enabled..."

Jim

Milind_Kulkarni__Int · ‎09-23-2010

I think this also works, if definitions are declared as threadprivate to persist values:--

threadprivate decl...

pragma omp parallel
{
----definitions----
}

pragma omp parallel for
for(...)
{
----use and compute----
}

In above case as there is no nesting parallelism, so no n*m threads will be created, so it will be just n threads. Also, I don't think any barrier is needed.
Though not sure about any extra overheads or perf impact, please let know whether the above solution is also right?

jimdempseyatthecove · ‎09-24-2010

Using thread private data is a good technique for passing data or containing temporary workspace

*** however ***

It is the progammers responsibility to account for runtime differences that may occure outside the program test environment.

In particular

Doall the threads run through the first region then the second region?

If not, do the same threads run through both regions?

The test environment may not be using nested parallel regions (always see the same threads passing through both regions).

Whereas the production environment may be using nested parallel regions and the threads passing through the first region are not the same threads (or same number of threads) passing through the second region(s).

This should not scare you away from using thread local storage. It should only caution you to pay attention to the details.

Jim Dempsey

johnyl2009yahoo_com · ‎09-26-2010

I believe its so.