Intel® Moderncode for Parallel Architectures
Support for developing parallel programming applications on Intel® Architecture.
1696 Discussions

Relationship between number of threads in OpenMP application and memory used

SergeyKostrov
Valued Contributor II
1,775 Views

I needed to evaluate memory requirements for an OpenMP application with different number of
threads in the parralel region. As a result of my R&D project I created that table:

# of threadsmemoryused

8 3.2 MB
16 3.4 MB
32 3.8 MB
64 4.6 MB * Limit for Microsoft's OpenMP DLLs
128 6.2 MB
256 9.4 MB
512 15.8 MB
1024 28.6 MB
2048 54.2 MB
4096105.4 MB
8192 207.8 MB
16384 412.6 MB
32768 822.2 MB * Limit for Intel's OpenMP DLLs
65536 1,641.4 MB 1.64 GB ** Extrapolated
131072 3,279.8 MB 3.28 GB ** Extrapolated
262144 6,556.6 MB 6.56 GB ** Extrapolated

It clearly shows that on a 32-bit Windows platform up to 65,536 threads could be createdin a simpleOpenMP application.

A Test-Case was based on thecode from a post:

http://software.intel.com/en-us/forums/showthread.php?t=103375&o=a&s=lr

0 Kudos
29 Replies
SergeyKostrov
Valued Contributor II
1,188 Views
Please sumbit your results if you will be able to verify a relationship between number of threads in OpenMP
application and memory used.

Best regards,
Sergey
0 Kudos
Vladimir_P_1234567890
1,188 Views
Hi Sergey,

Don't you want to add some practical workload to your example like calculating pi or fibonacci numbers?
--Vladimir
0 Kudos
jimdempseyatthecove
Honored Contributor III
1,188 Views
Sergey,

OpenMP is designed as a tasking system as opposed to a thread system.

In a tasking system you generally set up the application where the number of software threads == number of hardware threads. Creating excess threads introduces excess thread context switching. OpenMP task switching occues between an exit of a parallel region and entry into the next parallel region. This can occure relatively rapidly as the operating system is generally not involved (unless the time interval between exit of parallel region to entry of next parallel region exceeds a tune-able threshold).

The only purpose of allocating more softwarethreads than hardware threads is when (some) software threads may get blocked (for I/O or lock). Note, waiting for locks tends to be compute bound (oversubscription is counter productive).

Jim Dempsey
0 Kudos
SergeyKostrov
Valued Contributor II
1,188 Views
Don't you want to add some practical workload to your example like calculating pi or fibonacci numbers?


Hi Vladimir,

I don't considercalculation of PI or Fibonacci numbers aspractical or useful in my case. Wasn't it done
before? Yes, and many times. Did I personallyprogrammed that? Yes, many years ago as a matter of
learninghow arecursion works. When somebody has a free timeitwould be a nice programmingexercise.

I would be more interested to seeyour results and compare with my results. Thanks in advance.

Best regards,
Sergey

0 Kudos
Michael_K_Intel2
Employee
1,188 Views
Hi Sergey,

Comparing the memory consumption of an OpenMP program that does nothing is not very useful. As Jim pointed out, OpenMP generally used in a model where the number of OpenMP threads match the number of (logical) cores in the system. Oversubscription can be done, but is not very useful in most cases when programming OpenMP. Please do not confuse "heavy-weight" threads that OpenMP (and other threading models like TBB) use with the light-weight threads of an OpenCL-style program.

In summary, OpenMP is very memory-efficient for large number of threads. A minimal thread in OpenMP just consumes a couple of bytes in memory (thread descriptors, some meta-data from the OpenMP runtime, and a small stack that contains a few function frames). The main memory consumption of an OpenMP thread comes from the private data (which can be as large a GBs if you allocate private arrays in an OpenMP region). Hence, without knowing (Wladimir pointed that out) the application and the memory demands of the parallel algorithm, it does not make much sense to investigate the memory consumption.

Does that help?

Cheers,
-michael
0 Kudos
Vladimir_P_1234567890
1,188 Views
Sorry Sergey
I concur with Jim here and can't find a rationale to get memory usage for application that does not take into account HW concurrency and runs >1000x slower than serial version.
Do you have one?
If you need more details of openmp memory model I can point to "The OpenMP Memory Model" article by Jay P. Hoeflinger and Bronis R. de Supinskior "Complete Formal Specification of the OpenMP Memory Model" article by Greg Bronevetsky and Bronis R. de Supinski
thanks.
Vladimir
0 Kudos
SergeyKostrov
Valued Contributor II
1,188 Views

To all guys who responded: Thank you.Could you do areal evaluation?



Quoting Michael Klemm (Intel)
...
Comparing the memory consumption of an OpenMP program that does nothing is not very useful...

[SergeyK] This is exactly what I need to evaluate memory requirements.

...A minimal thread in OpenMP just consumes a couple of bytes in memory (thread descriptors, some meta-data from the OpenMP runtime,
and a small stack that contains a few function frames)...

[SergeyK] This is wrong andit depends on implementation.Pleasetake a look:


Default ThreadStack Size for Intel OpenMP:

IA-32 architecture : 2M
Intel 64 architecure: 4M

Default ThreadStack Size forMicrosoft OpenMP:

IA-32 architecture : ~256KB
Intel 64 architecure: No dataat the moment

As I stated, I appreciate if you spend a couple of minutes with testing in a realapplication instead of
spending time on almost theoretical discussionsand provide some numbers. Thanks in advance.

Best regards,
Sergey

0 Kudos
Michael_K_Intel2
Employee
1,188 Views
Hi Sergey,

The 4M stack size is the maximum stack size that a thread may grow to. The pages to backup the stack is only allocated when you actually touch the data. Before that the stack size might be close to zero (or one page of 4K).

Regarding a real application: I have applications that are close to zero and I have an application that needs about 250 MB of stack per thread. Without knowing which application tyoe you're after, there is not much sense for us to provide data because this data will likely be just wrong for your application. My reply to you stated, in principle, that theoretical thoughts are useless. So, we're on the same page here :-).

Cheers,
-michael
0 Kudos
SergeyKostrov
Valued Contributor II
1,188 Views
Hi Michael,

I provided a link to a test case in my initial post. Please take a look as soon as you have time.

Best regards,
Sergey
0 Kudos
Bernard
Valued Contributor I
1,188 Views
>>>It clearly shows that on a 32-bit Windows platform up to 65,536 threads could be createdin a simpleOpenMP application.>>> IIRC M.Russinovich book "Windows Internals" states that the maximal number of threads cannot exceed 2^16.
0 Kudos
Bernard
Valued Contributor I
1,188 Views
>>>IIRC M.Russinovich book "Windows Internals" states that the maximal number of threads cannot exceed 2^16.>>> Sorry I was wrong, it should be written that maximal number of GUI objects cannot exceed 2^16. Regarding max number of created threads it is probably depends on the available resources i.e each thread's stack.If the granularity is 64kb thus for theoriticaly max number of created threads should be 31250 threads.
0 Kudos
SergeyKostrov
Valued Contributor II
1,188 Views
>>...If the granularity is 64kb thus for theoriticaly max number of created threads should be 31250 threads... Note: It has to be for a 32-bit platform There are actually so many things that affect that limit. Is 31250 threads for: - Windows, or Linux, or another OS? - Debug or Release configuration? - Intel, or Microsoft, or MinGW, or GCC C++ compilers? Here are results of my testing with different C++ compilers on a 32-bit platform: [cpp] // Operating System: Windows XP 32-bit / Release configurations // In bytes // C++ compiler MSC BCC MGW ICC // #define _STACK_SIZE 0 // Threads created: // #define _STACK_SIZE 1024 // 30,548 30,575 30,716 30,533 // #define _STACK_SIZE 2048 // 30,548 30,575 30,716 30,533 // #define _STACK_SIZE 4096 // 30,548 30,575 30,716 30,533 // #define _STACK_SIZE 8192 // 30,548 30,575 30,716 30,533 // #define _STACK_SIZE 16384 // 30,548 30,575 30,716 30,533 // #define _STACK_SIZE 32768 // 30,548 30,575 30,716 30,533 // #define _STACK_SIZE 65536 // 30,548 30,575 30,716 30,533 // #define _STACK_SIZE 131072 // 15,735 15,750 15,823 15,727 // #define _STACK_SIZE 262144 // 7,985 7,995 8,032 7,980 // #define _STACK_SIZE 524288 // 4,019 4,027 4,047 4,017 #define _STACK_SIZE 1048576 // 2,016 2,021 2,031 2,015 // #define _STACK_SIZE 2097152 // 1,006 1,011 1,015 1,005 // #define _STACK_SIZE 4194304 // 501 504 507 500 [/cpp]
0 Kudos
SergeyKostrov
Valued Contributor II
1,188 Views
>>...Here are results of my testing with different C++ compilers on a 32-bit platform... Here is a test-case: #if ( defined ( _WIN32_BCC ) || defined ( _WIN32_MGW ) ) #define STACK_SIZE_PARAM_IS_A_RESERVATION 0x00010000 #endif CrtPrintf( RTU("Sub-Test 39\n") ); RTuint uiStackSize; // #define _STACK_SIZE 0 // #define _STACK_SIZE 1024 // #define _STACK_SIZE 2048 // #define _STACK_SIZE 4096 // #define _STACK_SIZE 8192 // #define _STACK_SIZE 16384 // #define _STACK_SIZE 32768 // #define _STACK_SIZE 65536 // #define _STACK_SIZE 131072 // #define _STACK_SIZE 262144 // #define _STACK_SIZE 524288 #define _STACK_SIZE 1048576 // #define _STACK_SIZE 2097152 // #define _STACK_SIZE 4194304 uiStackSize = _STACK_SIZE; if( uiStackSize == 0 ) uiStackSize = 1048576; RTuint uiNumOfThreads = 0; HANDLE hThread = RTnull; RTuint uiLastError; while( RTtrue ) { hThread = ::CreateThread( RTnull, uiStackSize, ( LPTHREAD_START_ROUTINE )ThreadRoutine, RTnull, CREATE_SUSPENDED | STACK_SIZE_PARAM_IS_A_RESERVATION, RTnull ); if( hThread == RTnull ) { uiLastError = ::GetLastError(); break; } uiNumOfThreads += 1; } CrtPrintf( RTU("Number of Win32 Threads created: %5ld with a Stack Size: %5ld\n"), uiNumOfThreads, ( RTint )_STACK_SIZE ); CrtPrintf( RTU("System Error : %5ld\n"), uiLastError );
0 Kudos
Bernard
Valued Contributor I
1,188 Views
>>>Here are results of my testing with different C++ compilers on a 32-bit platform:>>> Thanks for the results, very interesting. Yes I agree with you,but I think that maximal number of threads created by various C/C++ compilers on 32-bit Win platforms should be dependent solely on OS process and thread management API. Moreover OS must alloocate and reserve user mode and kernel mode address space for EPROCESS structures and ETHREAD structures. Many interesting information is contained in these structures. If you are interested I can post dumps of various EPROCESS and ETHREAD structures.
0 Kudos
SergeyKostrov
Valued Contributor II
1,188 Views
>>... I think that maximal number of threads created by various C/C++ compilers on 32-bit Win platforms should be dependent >>solely on OS process and thread management API... You could easily verify my results on your computer. My point of view is based on real results and these numbers actually depend on a quality of code generation of a C/C++ compiler and a number of dependent DLLs mapped to the address space of the test application. MinGW and Borland C/C++ compilers are creating very compact binary codes with minimal number of dependent DLLs. Take a look at a last set of numbers: ... #define _STACK_SIZE 4194304 ... Number of threads created with MSC = 501 Number of threads created with BCC = 504 Number of threads created with MinGW = 507 Number of threads created with ICC = 500 By the way, MinGW C/C++ compiler for a Windows platform by design doesn't rely on some Microsoft's CRT-like DLLs. Almost the same applies to Borland C/C++ compiler. Unfortunately, Intel C/C++ compiler's overhead is higher and that is why it allowed to create only 500 threads in the last test.
0 Kudos
Bernard
Valued Contributor I
1,188 Views
I have read an article written by M.Russinovich where he states that 32-bit process can create at maximum 2048 threads. link:http://blogs.technet.com/b/markrussinovich/archive/2009/07/08/3261309.aspx
0 Kudos
Bernard
Valued Contributor I
1,188 Views
>>>Number of threads created with MSC = 501 Number of threads created with BCC = 504 Number of threads created with MinGW = 507 Number of threads created with ICC = 500>>> Do not you think that these different result varying only in a few threads can be dependent on the momentary state of the OS which can vary between variuos compilers test cases. >>>By the way, MinGW C/C++ compiler for a Windows platform by design doesn't rely on some Microsoft's CRT-like DLLs>>> But it must call kernel32.exe exports. Does MinGW replace MSVCRT libraries with its own? .
0 Kudos
Bernard
Valued Contributor I
1,188 Views
>>>My point of view is based on real results and these numbers actually depend on a quality of code generation of a C/C++ compiler and a number of dependent DLLs mapped to the address space of the test application>>> Yes , but please take into account also that those compilers can produce more or less compact code , but in the end OS must create and manage those threads and also allocate space for example for internal thread ETHREAD structures needed to represent a thread and this is not dependent on the compiler currently beign used.And this can also add increased memory usage to the number of threads created. I have forgotten to add that all threads in some process share that process address space and mapping of DLL's is done at process address space resolution(granularity).
0 Kudos
SergeyKostrov
Valued Contributor II
1,188 Views
>>...Yes , but please take into account also that those compilers can produce more or less compact code... Iliya, I've done lots of development with these C/C++ compilers and I'm using these compilers for a very long time. I really don't understnad how somebody could talk about quality of code generation of all these C/C++ compilers without using, testing, analyzing, etc, of binaries for some test-cases? Best regards, Sergey
0 Kudos
Bernard
Valued Contributor I
1,067 Views
Sergey Kostrov wrote:

>>...Yes , but please take into account also that those compilers can produce more or less compact code...

Iliya,

I've done lots of development with these C/C++ compilers and I'm using these compilers for a very long time. I really don't understnad how somebody could talk about quality of code generation of all these C/C++ compilers without using, testing, analyzing, etc, of binaries for some test-cases?

Best regards,
Sergey

I agree with you, but one thing I cannot understand. How one of those compilers can affect thread and process creating,management and tear down mechanism. I think that everything relatec to thread and process management is at exclusive control of OS and without global-wide modification of the internal OS mechanism compiler will not be able to optimize its code for max nunber of creating threads. @Sergey I respect your knowledge and I learn a lot by reading and discussing with you on these forum.
0 Kudos
Reply