Re: Program speed and cpu usage with Core 2 Duo

marc_ba · ‎10-30-2006

Hello everyone,

I have used OpenMP to speed up a program (video encoding) execution on a Core 2 Duo. This program has 2 execution behaviors that results in paradoxical measurements.

- Full speed encoding. The encoder encodes as many frames as possible in a given time period. Gains are very interesting using 2 cores (roughly 80% speed up), and I can see the CPU load raising 100% during the whole encoding process (while it previously raised 50% when single-threaded). I was very happy with that :)
- Normal speed encoding. Another thread gives a frame to encode to the encoder every 40 ms, thus resulting in a 25 hz video. What is really strange is that using the single threaded app. it took about 15% CPU on total (so 30% on the first core and roughly 0% on the other), while the multi-threaded (OpenMP) version takes about ... 70% of the total CPU ! I can see in the process explorer that most of these 70% is kernel-time ...

What is very paradoxical is that multi-threaded version does go faster that the single-threaded (as I said before, about 80 % enhancement), but windows is telling me that it takes 4 - 5 times more CPU on a Core 2 Duo !!!

Everything is normal on a multi-processor (not multi-core) architecture : in that case, the program takes the same amout of CPU in its single threaded or multi-threaded version (using normal speed encoding), and we also roughly have a 80% gain in full speed encoding. That's what I expected.

I tried to figure out what was going wrong using Intel Thread profiler but it's telling me that my application is wonderfully parallelized and that I only spend 2% time in thread sync. while 98% is usefully spent to encode (which is a very good ratio).

So my questions are:
1. Did you understand my problem :) ?
2. Is Windows cpu usage measurment reliable using Core 2 Duo ?
3. If the answer to 2. is yes, can someone explain me how a program can take more CPU and go faster ?! It the answer to 2. is no, does anyone know a reliable way to measure CPU with a Core 2 Duo ?
4. Why is there such a difference between Core 2 Duo and Multi-Processor ? I know the arechitecure is not the same at all (cache sharing etc.), but ...

Any help appreciated ;)

Marc

jimdempseyatthecove · ‎10-30-2006

Marc,

1) Yes I understand your problem - been there.

2) The Windows only indicates the average CPU time used by your application _including_ any overhead. A better judge of performance is to time the runtime of your application. If you do see too much overhead (you did) then you need to investigate your application. Use vTune.

3. It can run faster using proportionaly more CPU time because you had CPU time to spare to begin with (i.e. 15% of 2, or 30% of 1). If your application were using 100% of one CPU, and if you had the same overhead issue then you might have seen a slowdown.

It looks like you may need to experiment with altering the OpenMP environment variables:

KMP_BLOCKTIME=nnn miliseconds
KMP_LIBRARY= throughput or turnaround

You will have a trade-off between how fast your application is verses how much time is available to run other programs on your system.

The other things to look at:

a) Are you trying to parallelize too small of work units?
b) Can your application be parallelized from the outer levels inward as opposed to from the inward levels outward.

By b)I mean...

If your application has multiple objects and each object has at some point an inner loop that consumes most of the time then there are two ways to parallize (this portion of) the application. a) parallelize in inner loop, or b) parallelize the outer object-by-object loop.

If you have more objects than CPUs then b) might work better than a).

Jim Dempsey

marc_ba · ‎10-31-2006

Jim,

Thanks a lot for your answer. The problem is not easy to describe and - I guess - not easy to understand nor to answser :)

I used VTune to analyse what was going on. On the Core 2 Duo, the program spends most of its execution time in 4 functions: __kmp_x86_pause, __kmp_check_stack_overlap, __kmp_wait_sleep, and __kmp_yield. It has to be noted that on a multi-processor system, these functions do not appear in the hotspots.

From what I understood, these 4 functions are present because the program suffers from very bad thread sync. What I really fail to understand is why do these function not appear while excuting the program in full speed mode ?!

Maybe another important thing: I force the thread count by calling omp_set_num_threads() because I have to control the threading behavior (because of data dependancies in my parallelized loop).

I tried to play with KMP_BLOCKTIME and KMP_LIBRARY but I couldn't get any good results (by "good", i mean "different from what I previously had" ;)).

1. Do you know where do these methods (__kmp_xxx) come from ?
2. Do you know why they do not appear on multi-processor system ? (while they do appear with multi-core systems ?)

Thanks a lot,

Marc

jimdempseyatthecove · ‎10-31-2006

The four functions you listed will get called (depending on theversion of IVF)

a) Entering !$OMP CRITICAL sections
b) On older versions had contention for !$OMP ATOMIC (as well as REDUCTION)
c) Waiting at !$OMP BARRIER
d) Waiting on !$OMP ORDERED
e) Depending on settings of KMP_BLOCKTIM and KMP_LIBRARY, these functions are called between exit of one parallelconstruct and entering the next parallel construct.

Unfortunately, VTune will not show the address of the caller to these subroutines. This would be a nice feature for them to add some day (psst. Intel, are you reading this).

If you are using more than one!$OMP CRITICAL sectionthenname the sections.For mutualy exclusive critical sections use the same name. For independent critical sections choose different names. If you use no name then all critical sections are mutualy exclusive.

If you can isolate where the problem is and fix it using different (OpenMP) programming techniques then do so.

On Windows you can use QueryPerformanceCounter to get a high precision runtime counter. You can insert some conditional compiled code to time the various sections and find the bottlenecks.

module YourModule
...
#ifdef _TimeOpenMP
    ! define a helper union
 type T_LARGE_INTEGER_OVERLAY
 union
 map
 type(T_LARGE_INTEGER) :: li
 end map
 map
 integer(8) :: i8
 end map
 end union
 end type T_LARGE_INTEGER_OVERLAY
   ! declare list of counters for your application
 type(T_LARGE_INTEGER_OVERLAY) :: SubOneCount
 type(T_LARGE_INTEGER_OVERLAY) :: SubTwoCount
    ...


#endif
...
end module YourModule

subroutine YourInit
...
#ifdef _TimeOpenMP
 ! clear performance counters
 SubOneCount.i8 = 0
 SubTwoCount.i8 = 0
 ...
#endif
...
end

subroutine YourExit
...
#ifdef _TimeOpenMP
 ! clear performance counters
 write(*,*) 'SubOneCount', SubOneCount.i8
 write(*,*) 'SubTwoCount', SubTwoCount.i8
 ...
#endif
...
end subroutine YourExit
...
subroutine YourSubOne
 ...
#ifdef _TimeOpenMP
 type(T_LARGE_INTEGER_OVERLAY) :: PerformanceCountStart
 type(T_LARGE_INTEGER_OVERLAY) :: PerformanceCountEnd
 type(T_LARGE_INTEGER_OVERLAY) :: PerformanceCountElapsed
#endif
 ...
#ifdef _TimeOpenMP
 ! Time how long it takes to enter the following critical section
 ! Get test start time
 if(QueryPerformanceCounter(PerformanceCountStart.li) .eq. FALSE) STOP 'QueryPerformanceCounter'
#endif

!$OMP CRITICAL
#ifdef _TimeOpenMP
 ! insert this immediately following the entry of the critical section
 ! Get test start time
 if(QueryPerformanceCounter(PerformanceCountEnd.li) .eq. FALSE) STOP 'QueryPerformanceCounter'
 ! compute elapsed 
time
 PerformanceCountElapsed.i8 = PerformanceCountEnd.i8 - PerformanceCountStart.i8
 ! update max count if required
 SubOneCount.i8 = max(SubOneCount.i8, PerformanceCountElapsed.i8)
#endif
 ...
!$OMP END CRITICAL
 ...
end subroutine YourSubOne

Compile with Preprocessor with _TimeOpenMP defined (or not defined)

On a Windows based system consider using some of the Windows synchronization functions:

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dllproc/base/setcriticalsectionspincount.asp

If you get frustrated then you may need to seek outside professional help.

Jim Dempsey

marc_ba · ‎11-02-2006

[IMPORTANT EDIT]
I found something interesting. I tried to call kmp_set_blocktime(0), to see what's going on ... The fact is that it solves my problem, while lowering performances by a small (and acceptable) 5% at full speed.

Now, do you know what calling kmp_set_blocktime(0) means ?
[/IMPORTANT EDIT]

Thanks again, Jim. It's a pleasure talking with you!

My program only have one parallel section which consists of a very simple for loop:

#pragma omp parallel for private (mby)
for (int i = 1; i {
// Some heavy work that can be parallelized
}

FinalizeData(); // Some work that can't be parallelized

As you can see, noting very difficult and nowhere to dig to discover the problem ...

I'll try to see with Intel Support, thanks for your precious help.

jimdempseyatthecove · ‎11-02-2006

The block time is the time delay spent in a test/compute loopafter the exit of a parallel construct.

I imagine you are asking yourself "Why would sane programmer want to do this?"

Consider

#pragma omp parallel for private (mby) 
for (int i = 1; i { 
// Some heavy work that can be parallelized 
}

#pragma omp parallel for private (mby) 
for (int i = 1; i { 
// Some differentheavy work that can be parallelized 
}

With block time set at 0 all the threads, except for the master
thread, will be suspended at the exit of the first loop. 
Then immediately restarted at the beginning of the second loop.
Thread suspension/resumption is costly. If your program the above
once (or a few times) the wait time is insignificant.

With block time set at a large number as a thread exits the
first loop it burns CPU time waiting for the other threads in
the team to complete the first loop. Then all the threads can enter
the second loop running. This eliminates thread suspend/resume
calls to the O/S.

Consider


for (int iteration=1;iteration<=iterations;++iteration) {
  #pragma omp parallel for private (mby) 
  for (int i = 1; i { 
  // Some heavy work that can be parallelized 
  }

  #pragma omp parallel for private (mby) 
  for (int i = 1; i { 
  // Some different heavy work that can be parallelized 
  } 
}

When iterations is large you can eliminate significant thread
stop/start time by setting the block time to an appropriate
value. The value you choose depends on the requirements at hand.
Your-speed vs. time for other applications.

Also, as per my prior copious response, the CPU time guage and
chart of the Task Manager is not a complete assesment of the
efficency of your programming. Wall clock time is a better
measure for your application. Reduce CPU time is a better measure
for time available to other applications while your application
takes more wall clock time to run.

Jim Dempsey

marc_ba · ‎11-02-2006

Jim,

My program only has one (and only one) OpenMP for loop by image, with a fairly small iteration count (usually 2). I suppose that in that case, setting the block time to 0 is appropriate (as threads dont' have to be resumed beforethe next picture is given to the encoder, i.e about 40 ms). In pseudo-code, that would look like:

do {

pFrame = GetFrameFromWebcam();
encodeFrame(pFrame);// <= OpenMP code inside that function
Sleep(40);

} while(1);

I don't rely on CPU time gauge to evaluate program performance, but I really don't like to see 65% CPU on a threaded program and 2% with its single thread version ! And neither do our customers which would be scared to see such a huge CPU consumption !

FIY, I tried to compile the same code with Visual .Net 2005, and the behavior is not the same:
- kmp_set_blocktime() is unsupported
- I can't reproduce the problem we're dealing with ! The program behavior is much more "normal" to me ... I don't know if Microsoft implementation is better, but the fact is that in my case, It solved the problem ...

Kind regards

jimdempseyatthecove · ‎11-03-2006

Marc,

If you only have 2 sections then consider using

call OMP_SET_NUM_THREADS(2)
c$OMP PARALLEL SECTIONS
c$OMP SECTION
call YourSectionSub(1)
c$OMP SECTION
call YourSectionSub(2)
c$OMP END PARALLEL SECTIONS

In your case us the C++ #pragmas for this

Jim