Intel® Moderncode for Parallel Architectures
Support for developing parallel programming applications on Intel® Architecture.

How do I profile a short period

mattsdad
Beginner
507 Views

I want to use the Thread Profiler to determine why a simulation runs 2000+ simulation seconds in 40 wall clock seconds then takes 60 wall clock seconds to do the next 30 simulation seconds.

How do I start the Profiler action at the transition? Then how do I end the Profiler action at the next transition?

PS. When I turn on the Call Graph data collectionmy simulation slows down by a factor of about 100X. Is that normal?

0 Kudos
11 Replies
jimdempseyatthecove
Honored Contributor III
507 Views

This looks like a memory leak. After 2000 iterations you consume all of resident RAM then start to swap virtual memory. Run the Task manager and watch your Page File usage.

And

Does your simulation use feed-back to govern the integration step size?

Depending on state of simulation are computational intensive features enabled?

I am sorry that I cannot answer your Thread Profiler (as I use AMD tools).

Jim Dempsey

0 Kudos
mattsdad
Beginner
507 Views

This is definately NOT due to page swapping. I am on a new machine with 4GB of ram. The extraneous processes are using 462MB of ram. The task manager performance monitor indicates that I am using520MB of ram throughout the whole execution of the simulation, a constant 72MB out of 4GB. No page swapping is occuring.

The task manager performance monitor also indicates that before the point of interest in the simulation,it isat 100% CPU usage on all four processors. After the point of interest, it drops to 25% spread over all four processors.

The issue is not page swapping.

0 Kudos
mattsdad
Beginner
507 Views

Does your simulation use feed-back to govern the integration step size?

No

Depending on state of simulation are computational intensive features enabled?

No. The whole simulation is computationally intensive.

I am running a multi-body simulation with the following code in the main routine. (Subroutine propagate, subroutine check_for_event_transition and all the subroutines that they call are pure subroutines.)

DO

WHILE(check_for_continuation (body, number_of_bodies, MAX_BODIES))

!$OMP PARALLEL DO

DO B = 1,number_of_bodies

DO WHILE (body(B)%exists .AND. .NOT. body(B)%events%event_pending &

.AND. body(B)%state_time_f < next_time_io_f)

CALL propagate (body(B))

CALL check_for_event_transition (body(B))

END DO

END DO

!$OMP END PARALLEL DO

CALL do_all_pending_event_transitions (body, number_of_bodies, MAX_BODIES)

CALL write_output (body, number_of_bodies, MAX_BODIES)

END DO

The time line of the simulation follows. (Where "start #" means that a new body is created and "end #" means the body ceases to exist.)

l2i = start 1 0

s1 = start 2 110

s2 = start 3 212.055

b1 = start 4 212.16

b2 = start 5 212.16

b3 = start 6 212.16

b4 = start 7 212.16

b5 = start 8 212.16

b6 = start 9 212.16

b7 = start 10 212.16

b8 = start 11 212.16

s1 = end 2 360.871

pbv = start 12 1162.07

s2 = end 3 2129.392

b3 = end 6 2149.685

b4 = end 7 2149.685

b2 = end 5 2149.7

b5 = end 8 2149.7

b1 = end 4 2149.72

b6 = end 9 2149.72

b7 = end 10 2149.73

b8 = end 11 2149.73

pbv = end 12 2155.234

l2i = end 1 2183.802

The multi-threaded version runs as expected until time 2149 seconds. All the processors are running consistently at or near 100%. At 2149 seconds in the simulation, the processor use drops to about 25% which is spread over all four processors but mostly on two. The first 2149 simulation seconds take about 30 seconds of CPU time. The last 34 simulation seconds take about 60 seconds of CPU time. Most notably there appears to be a six seconds between the end of body 6 and the end of body 7 which act symmetrically and end simultaneously. They end at the time when the threads stop acting in parallel. There is approximately six seconds between each of the body endings after that point eventhough the simulation time is not the same.

What should I look for in the thread profile to determine what is causing the failure to continue to use all four processors?

The best I can tell from the profile results is that all threads are spending most of their time (in the last 60 seconds of simulation) in a routine called NtWaitForSingleObject.

There are five tasks that appear in the profile, thread_0, thread_2, thread_3, thread_4, and thread_5, but there are only four processors. Threads 3, 4 and 5 each take about 8M mcs. Thread_0 takes about 238M mcs. These four threads all call into the simulation code. Thread_2 only calls WaitForSingleObject which only calls NtWaitForSingleObject. Thread_2 uses 1,499M mcs. That is more than6X as long as Thread_0 and almost 200X as long as the other threads. It never calls any of my code.

Is this normal?

Is there usually an extra thread created by OMP that takes up 85% of the CPU without calling any of the code?

Is there a setting that I can specify that prevents this?

0 Kudos
jimdempseyatthecove
Honored Contributor III
507 Views

Try using

!$OMP PARALLEL DO SCHEDULE(STATIC, 1)

STATIC Divides iterations into contiguous pieces by dividing the number of iterations by the number of threads in the team. Each piece is then dispatched to a thread before loop execution begins.

If

chunk is specified, iterations are divided into pieces of a size specified by chunk. The pieces are statically dispatched to threads in the team in a round-robin fashion in the order of the thread number.

I think you have 2 problems. The above may fix one of the problems.

If the computational work perbody is dissimilar then you do not want to parcel out (# bodies)/(# threads) bodies per thread. What you want then is for each thread to work on the next body. Example, assume you have 4 processors and 4 threads and 400 bodies. The default schedule would result in

bodies 1:100 being given the thread 1,
bodies 101:200 to thread 2,
bodies 201:300 to thread 3
bodies 301:400 to thread 4

The above distribution is preferred when all bodies require equal amount of processing time. Ask youself what happens when bodies 1:350 have virtualy no work to be done, while bodies 351:400 have much work to be done? You guessed it - thread 4 does all the work.

When the work is dissimilar per body, you want your distribution of work to threads to take this into consideration. However, depending on the circumstances chunk size of 1 might not be the best choice.

The second problem is not a deadlock per-se as the program eventually completes. Your problem has all the indicators of a contention problem for a resource. This might be a little tricker to find. Try inserting break points were the wait for single object fails. You can adjust the spinlock count such that you hit the break only after the problem starts to occure.

Jim Dempsey

0 Kudos
allyn_shell
Beginner
507 Views

What does spinlock mean?

0 Kudos
mattsdad
Beginner
507 Views

The SCHEDULE clause did not appear to do anything significant. The simulation slowed down at the same point.

I don't understand what I should be looking for when I get to the breakpoint (in the second recommendation). I do not have access to the WaitForSingleObject source code (only the disassembly).

0 Kudos
jimdempseyatthecove
Honored Contributor III
507 Views

You can search Microsoft's MSDN using Google. Description below

Since your runtime difference in the last section is more than number of threads time difference I would have to guess that one thread is holding the resource (Object of the WaitForSingleObject) and doing nothing (perhaps sleep) while the thread performing the 100% of one CPU is working on trying to obtaing the same object but without sleeping.

If you post your messaging routines something may stand out.

Jim Dempsey

WaitForSingleObject

The WaitForSingleObject function returns when the specified object is in the signaled state or the time-out interval elapses.

To enter an alertable wait state, use the WaitForSingleObjectEx function. To wait for multiple objects, use the WaitForMultipleObjects.

DWORD WaitForSingleObject(
  HANDLE hHandle,
  DWORD dwMilliseconds
);

Parameters

hHandle [in] Handle to the object. For a list of the object types whose handles can be specified, see the following Remarks section.

If this handle is closed while the wait is still pending, the function's behavior is undefined.

The handle must have the SYNCHRONIZE access right. For more information, see Standard Access Rights.

dwMilliseconds [in] Time-out interval, in milliseconds. The function returns if the interval elapses, even if the object's state is nonsignaled. If dwMilliseconds is zero, the function tests the object's state and returns immediately. If dwMilliseconds is INFINITE, the function's time-out interval never elapses.
0 Kudos
jimdempseyatthecove
Honored Contributor III
507 Views

I forgot to add...

Is there something special about Body 1 to account for why it is the last one to end?

0 Kudos
mattsdad
Beginner
507 Views

JimDempseyAtTheCove:
Is there something special about Body 1 to account for why it is the last one to end?

Yes, in most of our models, body 1 is the most enegertic body. All the other bodies are originally parts of body 1 that are divested along the way and usually are of lower energy.

Is that of significance for this issue?

0 Kudos
mattsdad
Beginner
507 Views

JimDempseyAtTheCove:
Since your runtime difference in the last section is more than number of threads time difference I would have to guess that one thread is holding the resource (Object of the WaitForSingleObject) and doing nothing (perhaps sleep) while the thread performing the 100% of one CPU is working on trying to obtaing the same object but without sleeping.

Is it possible the the unreported thread, thread_1, is involved?

JimDempseyAtTheCove:
If you post your messaging routines something may stand out.

I did not write any messaging routines. The only parallel code I added was the OMP PARALLEL DO statement and the corresponding OMP END PARALLEL DO statement. How would I go about getting the at the message passinginformation?

0 Kudos
jimdempseyatthecove
Honored Contributor III
507 Views

I would guess that you wrote or have access to the source of

subroutine propagate (body(B))
and
subroutine check_for_event_transition (body(B))

My guess is the WHILE loop calling those subroutines is not exiting by the threads NOT processing the last body (body 1). And that their checking for event transition is causing delays in execution of body 1.

Alternately body 1 computation time (as written) requires as much processing time as observed. In this case, at the end, there is nothing for the other threads to do. This is why you see only one processor active. What you need to examine is if you are using a critical section or mutex or other sychronizing object in a manner such that the 3 threads that have completed are not in a tight loop issuing "Are we done yet?", "Are we done yet?", "Are we done yet?", "Are we done yet?", "Are we done yet?", ... and which is interfereing with the execution of the last body (body 1).

I suggest placing a test and break point when you return the prior to last body (body 12?). Then see what the threads are doing. Note, you can make use of the Debug | Windows | Threads and then Freeze all threads except for the one you wish to examine. Rotate your examination, check each call stack. If you are in a system call, the call stack might be meaningless but you might be able to step out until you get back to Fortran code.

The additional thread is a monitoring thread. It should not interfere with the execution of your code.

Jim Dempsey

0 Kudos
Reply