Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.

Reserving a core for a thread

dondilworth
New Contributor II
4,384 Views
My multi-threaded program works, but I find that running with several cores actually takes longer than does a single-thread run. Here's what is going on:

1. Several threads are started. They each go into a DO WHILE loop, waiting for a flag in an indexed variable in a named COMMON block.

2. When the flag is set, the thread starts some serious number crunching, using data taken from another indexed set of variables. When the routine finishes, it sets another flag, also in an indexed array.

3. The calling program waits until all threads have finished, checking the flags in another DO WHILE loop, and then reads out the answers from the indexed variables. Then it starts over with another set of input data, and so on.

This should run faster with several threads running in parallel, but it doesn't. Possible reasons:

1. One of the threads is in a core being used by Windows for something else. It cannot complete its task until Windows lets it run, and the calling program has to wait for that.

2. The several threads do not in fact run in parallel. If they go individually, that would also be very slow.

Is there any way to ensure that my threads get their own cores, and they all run in parallel?
0 Kudos
43 Replies
dondilworth
New Contributor II
2,441 Views
The word "broken" scares me.

I know about the overhead issue and importance of putting time-consuming code in the parallel part. In my case that code takes nearly all of the time, depending on user input, so there is potentially a lot to gain. I time my runs by calling a routine that reads the current system clock time. Then, when the test it done, it calls it again and subtracts the two. So I know pretty well how much time is required if I use serial mode or if I employ the multithread option. My whole project amounts to over 600 mb of code. I can of course zip it and make it available, but first I want to be sure that I have not made a dumb mistake.

The subroutine that is called in the DO PARALLEL section uses only labeled common, indexed to the thread, and local variables. There are other common blocks, but the parallel routines do not write into them. Every routine that it calls, and so on, works the same way. So there are no race conditions and no routine should have to wait for any other. The order in which the threads are done makes no difference whatever. This is about as clean a piece of code (in that section) as you could want.

I'm bewildered by your comments about the USE statements. I never heard of OpenMP (or tried multicore programming) before about two weeks ago. So I'm bewildered by lots of things at this point. But I'm making serious progress: Just getting my debug version running correctly and multithreading correctly means I am very close. I tried taking those USE statements out, and the debug version ran about 10 times slower. So back in they went.

I shall put back the warn directives, however. You are correct about that. But I have a huge program and the warnings cover a great many pages. None of them are serious, mostly about jumping into and out of loops, so I thought to get rid of them. This is legacy spaghetti code like you never heard of, originating back when computers ran on vacuum tubes. A complete rewrite according to modern standards is out of the question.

Tell me, if I use the OpenMP parallel features, do I need to enable the recursive and reentrant options? I did that with the debug version, and execution was serial (judging by the timing), even though threads were created (according to the Threads pane in the debugger). When I disabled those options I got the speed improvement I was looking for. I have seen no documentation that addresses this issue. A newbe needs that kind of documentation.

Here are the data:

Debug, serial mode, test case: 1.26 seconds. Multicore with a do parallel of only one trip, 3.42 seconds. With 8 trips (for eight cores), 0.88 seconds. Nice.

Release mode: test case, serial mode, 0.28 seconds. Multithread, one trip, 0.48 seconds. Eight trips, 0.45 seconds. Not nice.

Why would the program work debug and not release? Here's my loop:

IF (INDEX .EQ. NTHREADS) THEN ! ALL LOADED; TRACE RAYS NOW
!$OMP PARALLEL SHARED( /MTRAYS/,/MTCOM/,/MTPTRACE/,/RAY/ ) IF (TEST)
!$OMP DO

DO I = 1,INDEX
CALL MTRAYTRACE(I)
ENDDO

!$OMP END DO
!$OMP END PARALLEL
GO TO 8801
ENDIF

0 Kudos
jimdempseyatthecove
Honored Contributor III
2,441 Views
!$OMP PARALLEL SHARED( /MTRAYS/,/MTCOM/,/MTPTRACE/,/RAY/ ) IF (TEST)
!$OMP DO SCHEDULE(STATIC,1)
DO I = 1,INDEX
CALL MTRAYTRACE(I)
ENDDO

!$OMP END DO
!$OMP END PARALLEL

The above schedule(static,1) means each thread takes one iteration at a time.
Without schedule(static,1), each thread takes adjacient chunks of the iteration space. As to what the chunk count is, this depends on factors external to the !$OMP DO... and may include environment variables as well as what (if)you specify the default scheduling mode (static, dynamic, ...)

When the iteration count is large, and the computation time per itereation is relatively small, then you would want to have each thread take larger counts of iterations per loop trip.

Jim Dempsey
0 Kudos
dondilworth
New Contributor II
2,441 Views
Jim:

I tried the schedule(static,1), and it made no difference. The debug version is still faster in multithread mode, and the release version is still slower. Have you any other suggestions?
0 Kudos
dondilworth
New Contributor II
2,441 Views
Let me supply some more details. In my test case, I am tracing almost 30,000 rays through a lens of 19 elements. The parallel loop farms the job out to a raytrace program, eight rays at a time. When they are all finished, it loads the next set and starts the DO loop all over again with another eight. So there are thousands of trips through that section of code.

I don't know if that logic sits well with the mechanism that creates threads. I gather from an earlier post that once the threads are started they stay defined and just wait for another trip through the parallel loop. If that is the case, then there should not be a whole lot of overhead with all of these trips.

Correct me if I'm wrong, please. Is that a valuable clue?
0 Kudos
IanH
Honored Contributor II
2,441 Views
Quoting dondilworth
Let me supply some more details. In my test case, I am tracing almost 30,000 rays through a lens of 19 elements. The parallel loop farms the job out to a raytrace program, eight rays at a time. When they are all finished, it loads the next set and starts the DO loop all over again with another eight. So there are thousands of trips through that section of code.

This reads as if each thread is starting a separate program (a different exe?) - is that the case?
0 Kudos
dondilworth
New Contributor II
2,441 Views
Not at all. The parallel DO calls a Fortran subroutine, and I want eight of those calls to execute in parallel. When they all finish, I collect the answers, prepare input for another eight, and come down to the DO statement again. Then I want the next eight to run in parallel again, and so on. That subroutine gets called almost 30,000 times. It's always the same routine, in the same exe.
0 Kudos
jimdempseyatthecove
Honored Contributor III
2,441 Views
How much code is executed by:

CALL MTRAYTRACE(I)

Note, I mean by the subroutine itself, not the loop including the call.

Note, from my understanding you have

repeat ~30,000/nThreads
read nThreads # rays
parallel DO nThreads # ways
CALL MTRAYTRACE(I) once per thread
end parallel DO
end repeat

If MTRAYTRACE is a relatively small work load, then the bulk of the time is waiting for the read-in.

Try:

real(8) :: t0, t1, readTime, mtrRayTraceTime(128)
...

readTime = 0.0
mtrRayTraceTime = 0.0
repeat ~30,000/nThreads
t0 = omp_get_wtime()
read nThreads # rays
readTime = readTime+ omp_get_wtime() - t0
! private(t1)
parallel DO nThreads # ways
t1 = omp_get_wtime()
CALL MTRAYTRACE(I) once per thread
mtrRayTraceTime(omp_get_thread_num()) = &
&mtrRayTraceTime(omp_get_thread_num()) &
& + omp_get_wtime() - t1
end parallel DO
end repeat

write(*,*) "readTime ", readTime
do i = 0,omp_get_num_threads()-1
write(*,*) i, mtrRayTraceTime(i)
end do

Jim Dempsey
0 Kudos
IanH
Honored Contributor II
2,441 Views
Hang on - 30000 iterations in 0.28 seconds? So each unit of work takes 9 microseconds? That's tiny.

The multithreading overhead would be swamping that.

You need to batch the job better. Try having each thread process something like 1000 rays (the more the better) before you collate results and restart the loop.
0 Kudos
dondilworth
New Contributor II
2,441 Views
These are all cogent comments, but consider this: If I run a simpler problem, with the same number of rays but a smaller lens, then the time for overhead is exactly the same but the time in the raytrace is much shorter. Then the execution is indeed much faster. That tells me that, for a complicated problem, the raytrace is indeed taking most of the time, as it should for parallel processing to buy anything.

The fact that the debug version runs faster in multithread mode than in serial also tells me something. That version works. Why would the release version not? That's the whole issue, and so far none of the responders have addressed it.
0 Kudos
jimdempseyatthecove
Honored Contributor III
2,441 Views
What we are suggesting you do is (outline)

Reduce the number of { !$omp parallel }
Reduce the number of { !$omp do }
Increase the ratio of {doWork : $omp }
{!$omp end do}
{$omp end parallel}


Presumably all your input data, regardless of lens size, is present. You stated you read this from a file. Therefor, assume you were reading in 1 ray per thread and had 8 threads (for the sake of argument). We are suggesting you:

100x{read in 1 ray per threadfor 8 threads}
!$omp parallel do schedule(static,1)
do i=1,totalNumberOfRaysRead
doWork(i)
end do
!$omp end parallel do

or


100x{read in 1 ray per threadfor 8 threads}
!$omp parallel
nThreads = omp_get_num_threads()
iThread = omp_get_thread_num()
!$ompdo schedule(static,1)
do i=1,100
j = i *nThreads+ iThread
doWork(j)
end do
!$omp end do
!$omp end parallel

or

double buffer your input (overlap reading with processing)

>>The fact that the debug version runs faster in multithread mode than in serial also tells me something.

Debug mode increases the processing time of your doWork(), therefor increases the ratio of doWork : $omp

Jim Dempsey
0 Kudos
dondilworth
New Contributor II
2,441 Views
This is very interesting. You suggest that instead of doing a loop of eight passes, one ray each, all in parallel, I change the bookkeeping so I do a loop of 800 passes, and execute that 1/100 as many times as I do now. But if that is the case, then since I have only eight cores, then each thread would have to trace 100 rays instead of just one. So each thread would be running its 100 in serial, not parallel.

Yes, this could potentially work, but it implies that the overhead of starting a parallel DO is huge and you want to do as much as you can while you are in there. That's the interesting part.

I'll give it a try, and let you know what I find.

0 Kudos
dondilworth
New Contributor II
2,441 Views
Here's what I got when increased the DO loop count from 8 to 100, in release version:

Serial mode only: 0.246 seconds
Parallel mode: .57 seconds

So the release version still takes longer if I run in multithread mode. In fact, those numbers are very similar to what happens with a loop count of 8. I don't see much effect from increasing it.

Even worse, the debug version now takes longer in parallel mode too:

Serial: 2.6 seconds
Parallel: 2.9 seconds

Recall that, with a loop count of 8, the numbers were 1.26 and 0.88. I don't know why the serial mode also takes longer; maybe because of larger array sizes and more memory faults. I see lots of disk activity when it runs.

If I run a very simple problem in release mode, where the loop count is the same but the time in the raytrace is very small, I get this:

Serial: 0.031 seconds
Parallel: 0.135

So the total overhead time is no more than 0.1 seconds. That's less than the time lost in the real parallel case. So the difference cannot be from overhead. Is Windows scheduling things in a way that one thread goes slowly? That would do it, since the program has to wait for them all to finish before moving on. The Resource monitor says that I have 103 threads going, while System has 157.
0 Kudos
anthonyrichards
New Contributor III
2,441 Views
I would suggest using a simple loop to run 30000 times.
what does your routine do if a ray doesn't trace?
Are you splitting rays ( for ghost image analysis etc.)?
what types of curved surfaces are you trying to trace?
Are you sure you have the most efficient algorithm for finding intersections for each type of surface?
What about multiple intersections etc. for surfaces of degree >= 2?
What type of transformation matrices are you using to transform from one surface's local frame to the next surface?
Are you using 3-vectors or 4 vectors (to do rotations and shifts)?

I could go on...but will stop here!
0 Kudos
dondilworth
New Contributor II
2,441 Views
Of course all of those are considerations when one writes a code like this. But they were all addressed many years ago, and the code now is already much faster than the competition's. So those issues really do not bear on the present situation, which I have repeated many times above: Why does the release mode work more slowly in multithread mode than in serial mode? All my tests indicate that the bottleneck is not in the details of the lens or the raytrace algorithm, since the problem persists with all degrees of complexity. It appears to be some kind of scheduling problem in Windows -- which perhaps holds up one or more threads while the others are long finished.

If my guess is correct, and that is the problem, how does one fix it?
0 Kudos
IanH
Honored Contributor II
2,441 Views
Here's a hypothetical explanation - details depend on implementation that I'm not familiar with in the slightest, but at least it should give you some ideas.

At the end of the parallel do construct your program will sit around and wait for all the threads to catch up. If seven of your threads got there more or less immediately, but the eighth was pre-empted (something else in the system got that core for the timeslice) then the clock face time will be governed by the time that it takes for that eighth thread to be rescheduled and finally complete.

Crudely (it depends on how the synchronisation in your program works, what else is happening in the system, what other threads are doing, etc) scheduling will be on the basis of timeslices. A timeslice on Windows is of the order (it varies) of 10 milliseconds. If your unit of work is only 9 microseconds, then the potential overhead in synchonising the threads at the end of the loop could be rather large - you have eight threads, but for 99 percent of their time they sit around waiting for their colleague to finally get its work done. In the serial case, the one core just keeps plugging on, and with eight cores on the system its not likely that it will get preempted, ever.

(Even if the pre-empted timeslice aspect is a bit of a ruse - the message is the same - you want to minimise the amount of synchronisation that needs to happen - hence the suggestion to create much larger batches that can be processed independently.)

Note that it took almost three pages of forum posts to flush out some of the necessary detail. If you had a cutdown variant of your program that demonstrated the problem things would be progressing much faster. I'm still not clear on the relevant structure of your program.

Other notes:

- How many "real" cores do you have? Is it eight? Or is it four, with hyperthreading?

- For some problems the limiting aspect is memory access. Multithreading doesn't help much here - in fact it can make the problem worse.

- The total runtime of your program is still quite short. Could you construct an example that had a runtime of say 30 seconds or more? When a process starts in Windows there is a whole heap of things that go one - having a longer run time should eliminate that overhead. That will also help you understand whether the disk access that you are seeing is due to your program's calculations (if it is then multithreading is far less likely to help because disk access is so slow...).
0 Kudos
dondilworth
New Contributor II
2,441 Views
Yes, it's a messy problem. This program has been under development since the early 60s, and is probably older than many of the people posting here. (First attempts ran on vacuum tubes!) The source package is around 650 mb, and it's not practical to repackage it in a digestible size. I wish I could. I can zip it up and send the whole geshaeft to anyone brave enough to get into it, if that will do the trick. I understand that finding a problem requires a managable test bench.

I have eight real cores. I am not familiar enough with the other things the OS is doing to venture a guess as to the interactions. I was hoping that the writers of OpenMP would have sorted those things out so that their product would work reliably on many systems. If the details of the OS are critical to the success of my project, then just dealing with what's happening on my PC -- even if successful, will not help my customers, who have other setups. This has to work all the time.

I will see about getting an example that runs for a much longer time. That is a wise suggestion.

I see that one of the schedule options is STATIC, CHUNK. But I have not seen any definition of what a chunk is. A single machine instuction, timeslice, trip through a subroutine, loop cycle, or what?

I do not think that all of the disk accessing I see is due to my program. Looking at the Resource Monitor, I see lots of disk activity all the time, whether I am running my code or just sitting there. Looks like Windows is doing its own homework or something. Anyway, I have 16 gigs of real memory, and I assume that the virtual-memory OS will keep all of my code and data in memory or cache, so there should be no disk overhead on my account.

Here's a clue: If I run my test case and run 10 trips at a time through the DO loop, in debug mode it takes 3.3 seconds. The Task Manager shows that four of my cores are doing nothing, while the other four peak up. If I run eight trips (hoping for one core per trip) all eight of the cores peak up. But the run still takes 3.3 seconds. Those figures are somewhat chaotic; lots of runs produce lots of numbers, some requiring more, but rarely less, time. Does this make sense?

I've tried all of the schedule options: static, dynamic, guided, runtime, and auto. They all score about the same, except the last two, which are slightly slower.

But I'm running out of things to try. My numbers show no benefit from OpenMP, in spite of weeks of effort and patient advice from many experts on this (multithreaded!) forum. Is it worthwhile to continue?
0 Kudos
jimdempseyatthecove
Honored Contributor III
2,441 Views
>>But I have not seen any definition of what a chunk is
(8-threads)

!$omp parallel do
do i=1,32000
{first thread i iterates 1:4000}
{second thread i iterates 4001:8000}
...

Each thread's iteration range is determined at beginning of loop.

In the above, the schedule defaults to static and "chunk" is number of iterations/number of threads

The above default works fine when

a) the amount of work per iteration is equal
b) the availability of threads is equal (other apps may be running and taking "your" time).

When work load or thread availability is at issue then you can program using dynamic scheduling.
Dynamic scheduling first partitions the iteration space into smaller chunks than n/nThreads. The first pass is on first come firstserve availablity ofthreads, then the chunk size is halved, picking of chunks first-come-first serve, etc... until no iteration space remains. (you can specify a starting chunk size)

schedule(static, nnn)

Says that each thread consumes the lesser of nnn or remaining iteration space.

You choose the technique that best serves your purpose.

>>> If I run eight trips (hoping for one core per trip) all eight of the cores peak up. But the run still takes 3.3 seconds

Then your serial code is taking the bulk of the time...
or your parallel code contains calls that internally serialize (for most of the processing time)

Do you have a profiler (Intel's VTune)?
If you do not, then you can get a 30-day trial version from Intel
Or... try AMD's Code Analyst which can perfrom Timer-Based analysis on Intel processors.

Use a timer based analysis (as oposed to event or instruction based).

Jim Dempsey

0 Kudos
dondilworth
New Contributor II
2,441 Views
Jim:

I want to thank you for the helpful reply. I found the comment "...parallel code contains calls that internally serialize" interesting. What kind of calls will internally seralize? I call my subroutine, it calls others, and so on. All of them change only local variables, and a given trip through the DO loop, with its string of calls, is supposed to be serial. I never planned on any further division of labor, once that trip starts. But I want lots of those trips in parallel.

I tried the suggestion to run a case that takes a great deal more time. That was telling. Here's what I got:

Serial mode: 12.2 seconds
Multi, DO index 1 to 1: 15 seconds
1 to 2: 9.6
1 to 3: 7.3
1 to 4: 7.7
1 to 5: 7.1
1 to 6: 6.7
1 to 7: 6.8
1 to 8: 17.8
1 to 9: 6.9
1 to 10: 6.6

This is in debug mode. Now it seems that the OpenMP feature is giving me a 50% speed increase! Wow.

But why on Earth would making the number of trips through the DO equal to the number of cores take longer than any other case? Eight cores, eight trips, takes 17.8 seconds. Ugh. Does that make sense?

I'll test the release version next. Stay tuned.
0 Kudos
dondilworth
New Contributor II
2,441 Views
I made an even slower example and ran it in release mode. Here is what happened:

Loop count = number of threads = N

Serial: 2.63 seconds
N = 1: 2.02
N = 2: 1.9
N = 4: 2.1
N = 8: 2.3

Loop = 10, threads = 2: 1.9
Loop = 10, threads = 8: 2.3
Loop = 10, threads = 10: 2.15

So the best case was using only two threads, not eight as I would have guessed. If I watch the Task Manager, I can see that, for this case, five of my eight cores were busy. Weird.
0 Kudos
jimdempseyatthecove
Honored Contributor III
2,408 Views
>>What kind of calls will internally seralize?

allocate/deallocate (even implicitly via temporary array)
rand (or any one at a time random number generator)
file I/O
screen I/O

>>1 to 8: 17.8 ~2x

This does seem odd

I am assuming you are using schedule(static,1) on system with 8 hardware threads

Note also that 9 and 10 do not take additional time???

Is the time you list

a) the time per thread per iteration
b) the time of the parallel do loop
c) the time of the outer loop (~30000) containing the parallel do

Jim Dempsey
0 Kudos
jimdempseyatthecove
Honored Contributor III
2,408 Views
>>Loop = 10, threads = 10: 2.15

So the best case was using only two threads, not eight as I would have guessed. If I watch the Task Manager, I can see that, for this case, five of my eight cores were busy. Weird.

Do not configure OpenMP for more threads than you have hardware threads.
An 8 core processor will have 8 hardware threads without HT, or 16 hardware threads with HT
(more if you are on MIC).

module mod_foo
real:: array(30000)
end module foo

...
use mod_foo
...
array = 0.5
iMaxThreads = omp_get_max_threads()
{if the above does not work due to missing function in library}
!$omp parallel
iMaxThreads = omp_get_num_threads()
!$omp end parallel

do nThreads = 1, iMaxThreads
t0 = omp_get_wtime()
!$omp parallel num_threads(nThreads)
!$omp do
do i=1,size(array)
call doWork(i)
end do
!$omp end do
!$omp end parallel
t1 = omp_get_wtime()
write(*,*) nThreads, t1-t0
end do

! dummy do work subroutine
! *** that does not get optimized out bu the compiler
subroutine doWork(i)
use mod_foo
integer ::i
integer :: j
do j=1,size(array)
if(sqrt(array(j)) .eq. real(i)) array(i) = array(i) + 1.0
end do
end subroutine doWork


Note, the above is memory intensive as opposed to compute intensive
So performance will drop after 3-4 threads
You can add compute statements to the loop but be careful in that the compiler will not remove the code if it appears that the results are not used.

See how the above performs.

Jim Dempsey
0 Kudos
Reply