I needed to evaluate memory requirements for an OpenMP application with different number of
threads in the parralel region. As a result of my R&D project I created that table:
# of threadsmemoryused
8 3.2 MB
16 3.4 MB
32 3.8 MB
64 4.6 MB * Limit for Microsoft's OpenMP DLLs
128 6.2 MB
256 9.4 MB
512 15.8 MB
1024 28.6 MB
2048 54.2 MB
8192 207.8 MB
16384 412.6 MB
32768 822.2 MB * Limit for Intel's OpenMP DLLs
65536 1,641.4 MB 1.64 GB ** Extrapolated
131072 3,279.8 MB 3.28 GB ** Extrapolated
262144 6,556.6 MB 6.56 GB ** Extrapolated
It clearly shows that on a 32-bit Windows platform up to 65,536 threads could be createdin a simpleOpenMP application.
A Test-Case was based on thecode from a post:
OpenMP is designed as a tasking system as opposed to a thread system.
In a tasking system you generally set up the application where the number of software threads == number of hardware threads. Creating excess threads introduces excess thread context switching. OpenMP task switching occues between an exit of a parallel region and entry into the next parallel region. This can occure relatively rapidly as the operating system is generally not involved (unless the time interval between exit of parallel region to entry of next parallel region exceeds a tune-able threshold).
The only purpose of allocating more softwarethreads than hardware threads is when (some) software threads may get blocked (for I/O or lock). Note, waiting for locks tends to be compute bound (oversubscription is counter productive).
I don't considercalculation of PI or Fibonacci numbers aspractical or useful in my case. Wasn't it done
before? Yes, and many times. Did I personallyprogrammed that? Yes, many years ago as a matter of
learninghow arecursion works. When somebody has a free timeitwould be a nice programmingexercise.
I would be more interested to seeyour results and compare with my results. Thanks in advance.
Comparing the memory consumption of an OpenMP program that does nothing is not very useful. As Jim pointed out, OpenMP generally used in a model where the number of OpenMP threads match the number of (logical) cores in the system. Oversubscription can be done, but is not very useful in most cases when programming OpenMP. Please do not confuse "heavy-weight" threads that OpenMP (and other threading models like TBB) use with the light-weight threads of an OpenCL-style program.
In summary, OpenMP is very memory-efficient for large number of threads. A minimal thread in OpenMP just consumes a couple of bytes in memory (thread descriptors, some meta-data from the OpenMP runtime, and a small stack that contains a few function frames). The main memory consumption of an OpenMP thread comes from the private data (which can be as large a GBs if you allocate private arrays in an OpenMP region). Hence, without knowing (Wladimir pointed that out) the application and the memory demands of the parallel algorithm, it does not make much sense to investigate the memory consumption.
Does that help?
To all guys who responded: Thank you.Could you do areal evaluation?
Quoting Michael Klemm (Intel)
Default ThreadStack Size for Intel OpenMP:
IA-32 architecture : 2M
Intel 64 architecure: 4M
Default ThreadStack Size forMicrosoft OpenMP:
IA-32 architecture : ~256KB
Intel 64 architecure: No dataat the moment
As I stated, I appreciate if you spend a couple of minutes with testing in a realapplication instead of
spending time on almost theoretical discussionsand provide some numbers. Thanks in advance.
The 4M stack size is the maximum stack size that a thread may grow to. The pages to backup the stack is only allocated when you actually touch the data. Before that the stack size might be close to zero (or one page of 4K).
Regarding a real application: I have applications that are close to zero and I have an application that needs about 250 MB of stack per thread. Without knowing which application tyoe you're after, there is not much sense for us to provide data because this data will likely be just wrong for your application. My reply to you stated, in principle, that theoretical thoughts are useless. So, we're on the same page here :-).
Sergey Kostrov wrote:I agree with you, but one thing I cannot understand. How one of those compilers can affect thread and process creating,management and tear down mechanism. I think that everything relatec to thread and process management is at exclusive control of OS and without global-wide modification of the internal OS mechanism compiler will not be able to optimize its code for max nunber of creating threads. @Sergey I respect your knowledge and I learn a lot by reading and discussing with you on these forum.
>>...Yes , but please take into account also that those compilers can produce more or less compact code...
I've done lots of development with these C/C++ compilers and I'm using these compilers for a very long time. I really don't understnad how somebody could talk about quality of code generation of all these C/C++ compilers without using, testing, analyzing, etc, of binaries for some test-cases?