Intel® C++ Compiler
Community support and assistance for creating C++ code that runs on platforms based on Intel® processors.
7944 Discussions

Couldn't create more than 981 OpenMP threads with Intel(R) C++ Composer XE 12 Update 9 - RESOLVED - more than 18,607 threads created

SergeyKostrov
Valued Contributor II
944 Views
I'd like to report an OpenMP related problem.

Intel Software Engineers statedsome time agothat Intel's implementation of OpenMP allows to create up to 16,384 threads.

I've just completed a test andOpenMP based applicationcompiled with Intel C++ Composer XE 12 Update 9couldn't create
more than 981 OpenMP threads:

Error messages are as follows:

...
OMP: Error #136: Cannot create thread.
OMP: System error #8: Not enough storage is available to process this command.
OMP: Error #178: Function GetExitCodeThread() failed:
OMP: System error #6: The handle is invalid.
...

OpenMP Support was enabled in aVisual Studio's project: Generate Parallel Code (/openmp, equiv. to /Qopenmp).

My environment:

OS: Windows XP 32-bit
IDE: Visual Studio 2005 SP1
C++ compiler: Intel C++ Composer XE 2011 Update 9

Best regards,
Sergey

0 Kudos
22 Replies
SergeyKostrov
Valued Contributor II
767 Views
0 Kudos
Vladimir_P_1234567890
767 Views
Hello Sergey,
Do you have enough free memory and number of handles? It looks you have reached 2GB per process windows limitation. Could you work with 64 bit version to get more threads working?
--Vladimir
0 Kudos
SergeyKostrov
Valued Contributor II
767 Views
Hi Vladimir,

I'll continue investigation today and keep you informed.

Quoting Vladimir Polin (Intel)
...Do you have enough free memory and number of handles?

Yes.

It looks you have reached 2GB per process windows limitation.

A total amount of allocated memory ( for thread stasks, etc ) was significantly less than 2GB and I'll provide
exact numbers later.

Could you work with 64 bit version to get more threads working?

No.


Best regards,
Sergey

0 Kudos
jimdempseyatthecove
Honored Contributor III
767 Views
Without using the BOOT.INI option to instruct the 32-bit Windows to permit processes to use up to 3GB of user space, the user application is limited to 2GB (plus system space in upper 2GB address range).

With 2GB
Subtract code size
Subtract static data
Subtract main thread initial stack
The remaining memory is in your initial heap
Prior to creating your threads you may perform allocations, remove this from the amount of available memory.

Assume for example you have 1GB remaining.

Default thread stack limit is 1MB. Therefore 1000 threads could possibly be created in the remaining 1GB assuming they used no additional resources. *** and leaving 0 RAM for additional allocations ***

64-bit does not have this limitation.

Does your system have more than 981 logical processors?
If not, then why so many threads???

Jim Dempsey
0 Kudos
TimP
Honored Contributor III
767 Views
The (important) facilities of each OpenMP for thread affinity are limited to the number of logical processors with hardware support on the supported systems (no Intel platforms currently support more than 248 logical processors, and not on Windows).
0 Kudos
SergeyKostrov
Valued Contributor II
767 Views
Jim, Tim,

I'll follow up on your posts some time later. Thank you for the feedback!

I'm simply overwhelmed by a number of different issuesand little problems related to integration of Intel C++ compiler withthe project.

Best regards,
Sergey
0 Kudos
SergeyKostrov
Valued Contributor II
767 Views
Hi Vladimir,

I still can't resolve the problem. Here is a new Test-Case 2and it reproduces the problem:

[cpp] // Test-Case 2 - Maximum number of OpenMP threads for Intel C++ compiler ( XE v12.1.3 ) ... uint uiNumThreads = 0; // uiNumThreads = 512; // No Errors: Created 512 threads uiNumThreads = 981; // No Errors: Created 981 threads // uiNumThreads = 982; // OMP: Error #136: Cannot create thread // uiNumThreads = 1024; // OMP: Error #136: Cannot create thread omp_set_num_threads( uiNumThreads ); #pragma omp parallel for for( int i = 0; i < 4096; i++ ) { int iValue = 2; printf( "Iteration: %4ld - Thread %4ld out of %4ldn", ( int )i, ( int )omp_get_thread_num() + 1, uiNumThreads ); } ... [/cpp]
Could you forward my concerns to the Intel Engineering Team, please?

Best regards,
Sergey
0 Kudos
SergeyKostrov
Valued Contributor II
767 Views
Hi Vladimir,

I'll continue investigation today and keep you informed.

Quoting Vladimir Polin (Intel)
...It looks you have reached 2GB per process windows limitation.

A total amount of allocated memory ( for thread stasks, etc ) was significantly less than 2GB and I'll provide
exact numbers later.

Here is a screenshot ( ~110MB allocated ):



983 - 2 ( Default process threads of the test application )= 981
0 Kudos
Vladimir_P_1234567890
767 Views
Hello Sergey,
try this example
[cpp]#include #include DWORD WINAPI thread_routine(LPVOID lpParameter) { Sleep(20000); return 0; } int main() { unsigned int uiNumThreads = 0; HANDLE h[10000]; for (uiNumThreads=1; uiNumThreads<10000; uiNumThreads+=1) { h[uiNumThreads]= CreateThread ( NULL, 0, (LPTHREAD_START_ROUTINE) thread_routine, (LPVOID)uiNumThreads, 0, 0 ); if( h[uiNumThreads] == NULL ){ printf( "Kernel object limit is %dn",uiNumThreads); break; } } Sleep(1000); return 0; } [/cpp]
to utilize all 2^24 kernel objects you need 64-bit OS and 64 bit application.
there is nothing to do with OpenMP.
--Vladimir
0 Kudos
Vladimir_P_1234567890
767 Views
Quoting TimP (Intel)
(no Intel platforms currently support more than 248 logical processors, and not on Windows).

In theory our RTL should work on4096-waySGI* UV 1000 on Windows (http://www.sgi.com/products/servers/uv/specs.html). Are there anyvolunteersto check?:)

--Vladimir

0 Kudos
jimdempseyatthecove
Honored Contributor III
767 Views
What I think the task manager is failing to account for is the address space reserved by the threads as opposed to the page file space comitted. Let's see if I can explain (surmise) this.

When your test program starts, and runs up to, but before OpenMP starts, your virtual memory address space is something like this (order may differ

(4KB reserved) at 0x00000000
(static data) at +4KB
(code)
(initial heap)
(unmapped address) 2GB/3GB less above and below items
(reserved 4KB)
(main thread stack)
--------------------
0x80000000 or 0xC0000000 to 0xFFFFFFFF system address space of your virtual memory

If/when the heap expires prior to or following additional thread allocations, additional heaps are mapped/allocated/reserved from the unmapped address space (a portion thereof), assuming there is available address space.

Now then, when a new thread is allocated/created (the surmise part):

The O/S checks the unmapped address space to see if it has sufficient space for:

thread stack (default 1MB, you may specify differently)
guard page (4KB on x32)
optional thread context information (?KB)

These addresses come out of the virtual memory address space (assuming address space available)

*** Now then, until something is pushed onto the thread stack, more specifically a thread stack page (4KB page granularity), that formerly was an untouched page (4KB) of the thread's stack, had a reservaton of 4KB of the virtual address, but until touched, did not require physical memory nor page file space. The attempted touch causes (would cause) a page fault, then the O/S would map the page (assuming available page file space). A similar thing happens each time you add an additional heap (expand the heap).

What this means is your 981 threads have:

981x (default thread stack + 4KB guard) virtual address space consumed (~1GB)
981x (4KB touched stack + 4KB guard) RAM/pagefile space consumed (~8MB)

When the program attempts to allocate the 982nd thread there is no available virtual address space.

At least this is my assessment as to what you are observing.

As TimP ponted out, in OpenMP, creating more threads than you have logical processors is generally counter-productive.

Jim Dempsey

0 Kudos
SergeyKostrov
Valued Contributor II
767 Views
Quoting TimP (Intel)
(no Intel platforms currently support more than 248 logical processors, and not on Windows).

In theory our RTL should work on4096-waySGI* UV 1000 on Windows (http://www.sgi.com/products/servers/uv/specs.html). Are there anyvolunteersto check?:)

I would be glad to verify it.


I finally resolved it and my test application created more than 16,384 threads. A maximum number of threads I was able
to see was18,623!

I'll provide more details later today.

Best regards,
Sergey

0 Kudos
SergeyKostrov
Valued Contributor II
767 Views
Hi Vladimir,

Thank you for the Test-Case.

Best regards,
Sergey
0 Kudos
jimdempseyatthecove
Honored Contributor III
767 Views
>>A maximum number of threads I was able to see was18,623!

Yes, you can (by reducing the stack size) but on x32 what is the point?

In a compute bound system, more software threads than available hardware threads, is generally counterproductive. There may be a few outlier cases where a bad algorithm may see better performance (I should say may work). An example might be a poorly written mesh filter where node progress is blocked by waiting for other node(s) to complete. A better way to write this type of program would be to use a tasking based system where the software thread migrates from task to task as opposed to having more threads.

Jim Dempsey

0 Kudos
levicki
Valued Contributor I
767 Views
I guess that some people get a knock out of testing whether compiler conforms to pubished specifications :)
0 Kudos
SergeyKostrov
Valued Contributor II
767 Views
Hi Vladimir,

The "problem" was related to OMP_STACKSIZE environment variable. By default it is set to 2MB for 32-bit platforms
inIntel OpenMP library.I've changed the OMP_STACKSIZEto a minimal valueanda test application created significantly more OpenMP threads.

Screenshots are enclosed.

Best regards,
Sergey
0 Kudos
SergeyKostrov
Valued Contributor II
767 Views
Screenshot 1 ( Task Manager - Processes):

0 Kudos
SergeyKostrov
Valued Contributor II
767 Views
Screenshot 2 ( Task Manager - Performance ):



You can see that the test application crashed as soon as all available memory was allocated.
0 Kudos
Vladimir_P_1234567890
767 Views
Good for you Sergey,
I'm wondering whether you can find apractical application for your expiriments.
--Vladimir
0 Kudos
SergeyKostrov
Valued Contributor II
657 Views
Hello Sergey,
try this example

  1. for(uiNumThreads=1;uiNumThreads<10000;uiNumThreads+=1)
  2. {
  3. h[uiNumThreads]=CreateThread(NULL,0,(LPTHREAD_START_ROUTINE)thread_routine,(LPVOID)uiNumThreads,0,0);
  4. if(h[uiNumThreads]==NULL){
  5. printf("Kernelobjectlimitis%d\n",uiNumThreads);
  6. break;
  7. }
...


Hi Vladimir,

Here are a couple of questions:

How many threads did it create on your system?
Is it a32-bit or 64-bit system?

By default your example creates Win32 threads with a 1MBstack size.

I'll provide results of my tests obtained with my own Test-Case some time later. I alsowould be glad to see
your results for a 64-bit system!

Best regards,
Sergey

0 Kudos
Reply