Intel® C++ Compiler
Community support and assistance for creating C++ code that runs on platforms based on Intel® processors.

open_MP number of threads problem

Kazik
Beginner
538 Views

Hello,

I'm not an expert in parallelization business but, since I have 4-core Intel processor, when compiling my program I use -openmp -parallel options that make all 4 cores running (which I check with the command htop in a linux system that I use). I have also an opportunity to run my programs on 8-core machine from time to time, so I log into that 8-core machine, I compile my code over there (no environmental variables concerning max. number of threads are set) and run the program. However, there's always only one processor running for 100% of performance. There are clearly 8 processes that have started but 7 of them are practically idle, like e.g. using 2% of a core.

So here comes my question: why is that? The code that I compile is completely the same, the configuration of both of the computers is exactly the same, the Intel compiler version is the same. I'm really confused with this.

Any help would be appreciated,

--

Kind regards,

Kazik

0 Kudos
11 Replies
Michael_K_Intel2
Employee
538 Views

Hi,

I have some questions to make it easier to track down your problem:

  • Did you use the -par-report command line option to get a report on what the compiler auto-parallelized? If not, can you please do so and see if the compiler reports out on code fragments that have been auto-parallelized?
  • Are there any OpenMP constructs (e.g. parallel for) in your code that would create threads? If this is not the case and the compiler did not report any auto-parallelized code fragments, it is likely that your program is still running sequentially.

Can you please check this and get back to us?

Cheers,

-michael

0 Kudos
Kazik
Beginner
538 Views

Hi,

1) To answer the first question:

Yes, I do use -par-report and I get some info that some parts of my code have been vectorized/auto-parallelized. It looks like this:

Main.cpp(104): (col. 3) remark: LOOP WAS AUTO-PARALLELIZED.
Main.cpp(57): (col. 3) remark: LOOP WAS VECTORIZED.
Main.cpp(65): (col. 3) remark: LOOP WAS VECTORIZED.
Main.cpp(67): (col. 3) remark: LOOP WAS VECTORIZED.
Main.cpp(105): (col. 5) remark: LOOP WAS VECTORIZED.
Main.cpp(105): (col. 5) remark: LOOP WAS VECTORIZED.
1D_FFTW_Defs.cpp(30): (col. 3) remark: LOOP WAS VECTORIZED.
1D_FFTW_Defs.cpp(36): (col. 5) remark: BLOCK WAS VECTORIZED.
1D_FFTW_Defs.cpp(15): (col. 3) remark: LOOP WAS VECTORIZED.
realtimeEv_1Dsolitons_Defs.cpp(155): (col. 5) remark: LOOP WAS AUTO-PARALLELIZED.
realtimeEv_1Dsolitons_Defs.cpp(240): (col. 5) remark: LOOP WAS AUTO-PARALLELIZED.
realtimeEv_1Dsolitons_Defs.cpp(269): (col. 5) remark: LOOP WAS AUTO-PARALLELIZED.
realtimeEv_1Dsolitons_Defs.cpp(156): (col. 7) remark: LOOP WAS VECTORIZED.
realtimeEv_1Dsolitons_Defs.cpp(170): (col. 7) remark: LOOP WAS VECTORIZED.
realtimeEv_1Dsolitons_Defs.cpp(176): (col. 7) remark: LOOP WAS VECTORIZED.
realtimeEv_1Dsolitons_Defs.cpp(194): (col. 2) remark: LOOP WAS VECTORIZED.
realtimeEv_1Dsolitons_Defs.cpp(221): (col. 7) remark: BLOCK WAS VECTORIZED.
realtimeEv_1Dsolitons_Defs.cpp(244): (col. 2) remark: LOOP WAS VECTORIZED.
realtimeEv_1Dsolitons_Defs.cpp(270): (col. 7) remark: LOOP WAS VECTORIZED.
realtimeEv_1Dsolitons_Defs.cpp(287): (col. 7) remark: LOOP WAS VECTORIZED.
realtimeEv_1Dsolitons_Defs.cpp(298): (col. 7) remark: LOOP WAS VECTORIZED.
realtimeEv_1Dsolitons_Defs.cpp(315): (col. 7) remark: LOOP WAS VECTORIZED.
realtimeEv_1Dsolitons_Defs.cpp(156): (col. 7) remark: LOOP WAS VECTORIZED.
realtimeEv_1Dsolitons_Defs.cpp(244): (col. 2) remark: LOOP WAS VECTORIZED.
realtimeEv_1Dsolitons_Defs.cpp(270): (col. 7) remark: LOOP WAS VECTORIZED.
realtimeEv_1Dsolitons_Defs.cpp(17): (col. 3) remark: LOOP WAS VECTORIZED.

2) Hoping that I got you correcly:

No, I don't use any explicit openMP constructs, i.e. I don't create threads within my program or anything like that, just pure C/C++ programming, no openMP directives.

Thank you for your reply and still hope for some solution.

--

Kind regards,

Kazik

0 Kudos
Michael_K_Intel2
Employee
538 Views

Hi Kazik,

Can you confirm that the parallelized loops cover a large fraction of your program? It somewhat looks like most parts of your program are still running sequentailly. Maybe it's a good idea to look at your program with the VTune performance analyzer and see what the 8 threads launched do while your program runs.

Would that be possible for you?

Cheers,

-michael

0 Kudos
Kazik
Beginner
538 Views

Hi,

ok, I'll ask my system administrator tomorrow if he'd be happy with installation of VTune performance analyzer and I'll let you know. Anyway, wouldn't it be strange what you suggest? Coz, if I get that correctly, it would mean something like that one core of 8-core machine would be as powerfull as all four cores of 4-core machine, wouldn't it?

--

Best,

Kazik

0 Kudos
Michael_K_Intel2
Employee
538 Views

Hi Kazik,

that's apiece of information I've lost when reading your text. :-( Please accept my apologies.

Please try to set OMP_NUM_THREADS on the 8-core machine explicitly. I have seen cases, where something in the enviroment prevented the OpenMP runtime to utilize all threads.

Cheers

-michael

0 Kudos
Kazik
Beginner
538 Views

I just wrote explicitly:

export OMP_NUM_THREADS=8

in the bash file but unfortunately it didn't help.

--

Regards,

Kazik

0 Kudos
Om_S_Intel
Employee
538 Views

The autoparallelizer report says that certain portion of your code is parallelized. But that segment may be running only for small duration and rest of the code is runnnig serialized.

It would be better to introduce theads using OpenMP pragmas. The load need to be balanced. You may use Intel Thread profiler to know the details of wat is happening in your application including load imbalance.

0 Kudos
Kazik
Beginner
538 Views

Ok, I'll have Thread Checker and VTune installed on Monday. Could you tell me what information shall I provide you (coming from this programs) to help me out with the issue I've described?

@Om Sachan:

I still don't understand how is it possible, assuming your hypothesis about most of my code running non-parallel is true, that I can really see all four cores of a four-core machine running all the time for 100% each? This program takes about 12 hours and within this time it's 400% of CPU all the time. From this perspective, the fact that on 8-core machine it's 100% of one core about 1%-2% of each of the remaining 7 cores, it's little bit surprising, isn't it? Please, don't get me wrong, I don't claim it's not true what you suggest, I just would like to understand what's going on.

--

Kind regards,

Kazik

0 Kudos
aazue
New Contributor I
538 Views
Hi
I don't know if your type operating system is Unix or Microsoft
move to this link and build choice is appropriated.

vc_ficat9.cc.txt (Microsoft type)
gnu_ficat9.cc.txt (Unix Type)

http://software.intel.com/en-us/forums/showthread.php?t=70585&o=d&s=lr

With this sample you can understand exactly how thread running and how time can be divide by 2.
Remarks:
Problem are ICC, VC2010,MINGW,GCC given difference format driving results .
For make serious work that can be accepted engineering control quality is largely
better to work threading at bottom level. (You can also use TBB is an little better)
I think same product is not perfectly ready.
A good source program must work perfectly all machine and all operating system
same result....

Also Icc is very very slow (time *2) if you must using asynchronous (OpenMp) short step tasks.
Just run sample Icc compared with other compiler you understand.

Kind regards
0 Kudos
Dale_S_Intel
Employee
538 Views

Kazik, am I correct in understanding that it takes about 12 hours on a four-core system? Does it take considerably longer on the 8 core system? I.e. if it's really only running on 1 of the 8 procs, I'd expect it take at least more than twice as long. Is that the case?

Offhand I don't know of any reasons in the compiler that would prevent parallel operation on the 8-core system. Can you tell a little more about the two systems? Might there be some environmental diffs? Are they essentially identical except for the number of cores?

Thanks!

Dale

0 Kudos
Grant_H_Intel
Employee
538 Views
Kazik,

This problem usually has to do with environment settings or machine configuration. Please set the following environment variables in the shell you run the program on the 8-core machine and post the output that is printed to standard error in the shell. Then, we can probably tell you what is going on.

KMP_VERSION=true
KMP_SETTINGS=true
KMP_AFFINITY=verbose

Sorry for the late reply.
0 Kudos
Reply