Intel® Moderncode for Parallel Architectures
Support for developing parallel programming applications on Intel® Architecture.
1696 Discussions

Relationship between number of threads in OpenMP application and memory used

SergeyKostrov
Valued Contributor II
2,815 Views

I needed to evaluate memory requirements for an OpenMP application with different number of
threads in the parralel region. As a result of my R&D project I created that table:

# of threadsmemoryused

8 3.2 MB
16 3.4 MB
32 3.8 MB
64 4.6 MB * Limit for Microsoft's OpenMP DLLs
128 6.2 MB
256 9.4 MB
512 15.8 MB
1024 28.6 MB
2048 54.2 MB
4096105.4 MB
8192 207.8 MB
16384 412.6 MB
32768 822.2 MB * Limit for Intel's OpenMP DLLs
65536 1,641.4 MB 1.64 GB ** Extrapolated
131072 3,279.8 MB 3.28 GB ** Extrapolated
262144 6,556.6 MB 6.56 GB ** Extrapolated

It clearly shows that on a 32-bit Windows platform up to 65,536 threads could be createdin a simpleOpenMP application.

A Test-Case was based on thecode from a post:

http://software.intel.com/en-us/forums/showthread.php?t=103375&o=a&s=lr

0 Kudos
29 Replies
Bernard
Valued Contributor I
1,025 Views
>>> code generation of all these C/C++ compilers without using, testing, analyzing, etc, of binaries for some test-cases?>>> I was not talking about the quality of the code generated by those compilers,I simply was not able to understand how one of those compiler can affect OS system internal structures and mechanism without performing some kind of systm wide modification.For example hooking and intercepting CreateThread function and rewriting memory manager routines responsible for memory allocation neded for the thread creation.
0 Kudos
SergeyKostrov
Valued Contributor II
1,025 Views
>>...without global-wide modification of the internal OS mechanism compiler will not be able to optimize its code for max >>nunber of creating threads... Case 1: C/C++ compiler A creates a very compact ( with little overhead! ) binary codes and when these codes are loaded into memory they won't take additional amount of memory that could be used for a stack allocation when new threads are created. Let's say 555 threads will be created. Case 2: C/C++ compiler B creates a less compact ( with lots of overhead! ) binary codes and when these codes are loaded into memory they will take additional amount of memory that could be used for a stack allocation when new threads are created. Let's say 444 threads will be created. So, it is a 100% memory related issue and take a look at a table I posted. If you don't believe me try to run a test-case I've provided for MS and Intel C/C++ compilers and you will see how it works in a real environment. Check both test executables with MS Depends in order to see differences in a number of dependent DLLs for both compilers. Modern C/C++ compilers have more overhead compared to legacy compilers and some Intel C++ compiler users are complaining about it ( including me ). Borland and MinGW won "the race" because they have less overhead and less dependent on some DLLs. There is nothing else related to why these "max thread numbers" are different.
0 Kudos
Bernard
Valued Contributor I
1,025 Views
@Sergey Thanks for the explanation.It seems that I have completely misunderstood your post when you wrote about code compacting.
0 Kudos
SergeyKostrov
Valued Contributor II
1,025 Views
Hi Iliya, I'll create and upload a Visual Studio project for 32-bit and 64-bit platforms with the test-case. I hope that it will help everybody to clear as many as possible things with regard to that subject. Best regards, Sergey PS: I really would like to see numbers for a 64-bit Windows 7 Professional OS!
0 Kudos
Bernard
Valued Contributor I
1,025 Views
>>>I'll create and upload a Visual Studio project for 32-bit and 64-bit platforms with the test-case. I hope that it will help everybody to clear as many as possible things with regard to that subject. Best regards, Sergey PS: I really would like to see numbers for a 64-bit Windows 7 Professional OS!>>> Thanks Sergey I will run your test case and post the results.Meanwhile I'm testing a FFT algorithm and I'm having very strange results with VS2010 compiler.For example FFT of 4096 sin function elements is complited in 160245 msec for 1e4 loop iterations , the same test compiled with VS2010 compiler executes the same code in 140451 msec and Intel C/C++ compiler is able to outperform VS2010 compiler at whooping speed of 905 msec per 1e4 loop iterations. Here is the link :
0 Kudos
Bernard
Valued Contributor I
1,025 Views
0 Kudos
Bernard
Valued Contributor I
1,025 Views
>>> I really would like to see numbers for a 64-bit Windows 7 Professional OS!>>> I have 64-bit Win 7 Pro russian edition it is installed as vmware appliance.If you are interested please prepare test case and I will run it.
0 Kudos
SergeyKostrov
Valued Contributor II
1,025 Views
>>... I'm testing a FFT algorithm and I'm having very strange results... I simply would like to ask 'Please don't pollute the thread with unrelated problems, issues, etc'. Thanks in advance. Sometimes I have the same problem!
0 Kudos
Bernard
Valued Contributor I
1,025 Views
>>>I simply would like to ask 'Please don't pollute the thread with unrelated problems, issues, etc'.>>> Sorry for that I know that I should have had to create already a new thread solely for the purpose of FFT testing.Today I will do it.
0 Kudos
Reply