- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Intel Software Engineers statedsome time agothat Intel's implementation of OpenMP allows to create up to 16,384 threads.
I've just completed a test andOpenMP based applicationcompiled with Intel C++ Composer XE 12 Update 9couldn't create
more than 981 OpenMP threads:
Error messages are as follows:
...
OMP: Error #136: Cannot create thread.
OMP: System error #8: Not enough storage is available to process this command.
OMP: Error #178: Function GetExitCodeThread() failed:
OMP: System error #6: The handle is invalid.
...
OpenMP Support was enabled in aVisual Studio's project: Generate Parallel Code (/openmp, equiv. to /Qopenmp).
My environment:
OS: Windows XP 32-bit
IDE: Visual Studio 2005 SP1
C++ compiler: Intel C++ Composer XE 2011 Update 9
Best regards,
Sergey
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes.
It looks you have reached 2GB per process windows limitation.
A total amount of allocated memory ( for thread stasks, etc ) was significantly less than 2GB and I'll provide
exact numbers later.
Could you work with 64 bit version to get more threads working?
No.
Best regards,
Sergey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
With 2GB
Subtract code size
Subtract static data
Subtract main thread initial stack
The remaining memory is in your initial heap
Prior to creating your threads you may perform allocations, remove this from the amount of available memory.
Assume for example you have 1GB remaining.
Default thread stack limit is 1MB. Therefore 1000 threads could possibly be created in the remaining 1GB assuming they used no additional resources. *** and leaving 0 RAM for additional allocations ***
64-bit does not have this limitation.
Does your system have more than 981 logical processors?
If not, then why so many threads???
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'll follow up on your posts some time later. Thank you for the feedback!
I'm simply overwhelmed by a number of different issuesand little problems related to integration of Intel C++ compiler withthe project.
Best regards,
Sergey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I still can't resolve the problem. Here is a new Test-Case 2and it reproduces the problem:
[cpp] // Test-Case 2 - Maximum number of OpenMP threads for Intel C++ compiler ( XE v12.1.3 ) ... uint uiNumThreads = 0; // uiNumThreads = 512; // No Errors: Created 512 threads uiNumThreads = 981; // No Errors: Created 981 threads // uiNumThreads = 982; // OMP: Error #136: Cannot create thread // uiNumThreads = 1024; // OMP: Error #136: Cannot create thread omp_set_num_threads( uiNumThreads ); #pragma omp parallel for for( int i = 0; i < 4096; i++ ) { int iValue = 2; printf( "Iteration: %4ld - Thread %4ld out of %4ldn", ( int )i, ( int )omp_get_thread_num() + 1, uiNumThreads ); } ... [/cpp]
Could you forward my concerns to the Intel Engineering Team, please?
Best regards,
Sergey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting Sergey Kostrov
A total amount of allocated memory ( for thread stasks, etc ) was significantly less than 2GB and I'll provide
exact numbers later.
Here is a screenshot ( ~110MB allocated ):

983 - 2 ( Default process threads of the test application )= 981
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In theory our RTL should work on4096-waySGI* UV 1000 on Windows (http://www.sgi.com/products/servers/uv/specs.html). Are there anyvolunteersto check?:)
--Vladimir
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
When your test program starts, and runs up to, but before OpenMP starts, your virtual memory address space is something like this (order may differ
(4KB reserved) at 0x00000000
(static data) at +4KB
(code)
(initial heap)
(unmapped address) 2GB/3GB less above and below items
(reserved 4KB)
(main thread stack)
--------------------
0x80000000 or 0xC0000000 to 0xFFFFFFFF system address space of your virtual memory
If/when the heap expires prior to or following additional thread allocations, additional heaps are mapped/allocated/reserved from the unmapped address space (a portion thereof), assuming there is available address space.
Now then, when a new thread is allocated/created (the surmise part):
The O/S checks the unmapped address space to see if it has sufficient space for:
thread stack (default 1MB, you may specify differently)
guard page (4KB on x32)
optional thread context information (?KB)
These addresses come out of the virtual memory address space (assuming address space available)
*** Now then, until something is pushed onto the thread stack, more specifically a thread stack page (4KB page granularity), that formerly was an untouched page (4KB) of the thread's stack, had a reservaton of 4KB of the virtual address, but until touched, did not require physical memory nor page file space. The attempted touch causes (would cause) a page fault, then the O/S would map the page (assuming available page file space). A similar thing happens each time you add an additional heap (expand the heap).
What this means is your 981 threads have:
981x (default thread stack + 4KB guard) virtual address space consumed (~1GB)
981x (4KB touched stack + 4KB guard) RAM/pagefile space consumed (~8MB)
When the program attempts to allocate the 982nd thread there is no available virtual address space.
At least this is my assessment as to what you are observing.
As TimP ponted out, in OpenMP, creating more threads than you have logical processors is generally counter-productive.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In theory our RTL should work on4096-waySGI* UV 1000 on Windows (http://www.sgi.com/products/servers/uv/specs.html). Are there anyvolunteersto check?:)
I would be glad to verify it.
I finally resolved it and my test application created more than 16,384 threads. A maximum number of threads I was able
to see was18,623!
I'll provide more details later today.
Best regards,
Sergey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you for the Test-Case.
Best regards,
Sergey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes, you can (by reducing the stack size) but on x32 what is the point?
In a compute bound system, more software threads than available hardware threads, is generally counterproductive. There may be a few outlier cases where a bad algorithm may see better performance (I should say may work). An example might be a poorly written mesh filter where node progress is blocked by waiting for other node(s) to complete. A better way to write this type of program would be to use a tasking based system where the software thread migrates from task to task as opposed to having more threads.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The "problem" was related to OMP_STACKSIZE environment variable. By default it is set to 2MB for 32-bit platforms
inIntel OpenMP library.I've changed the OMP_STACKSIZEto a minimal valueanda test application created significantly more OpenMP threads.
Screenshots are enclosed.
Best regards,
Sergey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

You can see that the test application crashed as soon as all available memory was allocated.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- for(uiNumThreads=1;uiNumThreads<10000;uiNumThreads+=1)
- {
- h[uiNumThreads]=CreateThread(NULL,0,(LPTHREAD_START_ROUTINE)thread_routine,(LPVOID)uiNumThreads,0,0);
- if(h[uiNumThreads]==NULL){
- printf("Kernelobjectlimitis%d\n",uiNumThreads);
- break;
- }
Hi Vladimir,
Here are a couple of questions:
How many threads did it create on your system?
Is it a32-bit or 64-bit system?
By default your example creates Win32 threads with a 1MBstack size.
I'll provide results of my tests obtained with my own Test-Case some time later. I alsowould be glad to see
your results for a 64-bit system!
Best regards,
Sergey

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page