Reserving a core for a thread

dondilworth · ‎05-31-2012

My multi-threaded program works, but I find that running with several cores actually takes longer than does a single-thread run. Here's what is going on:

1. Several threads are started. They each go into a DO WHILE loop, waiting for a flag in an indexed variable in a named COMMON block.

2. When the flag is set, the thread starts some serious number crunching, using data taken from another indexed set of variables. When the routine finishes, it sets another flag, also in an indexed array.

3. The calling program waits until all threads have finished, checking the flags in another DO WHILE loop, and then reads out the answers from the indexed variables. Then it starts over with another set of input data, and so on.

This should run faster with several threads running in parallel, but it doesn't. Possible reasons:

1. One of the threads is in a core being used by Windows for something else. It cannot complete its task until Windows lets it run, and the calling program has to wait for that.

2. The several threads do not in fact run in parallel. If they go individually, that would also be very slow.

Is there any way to ensure that my threads get their own cores, and they all run in parallel?

IanH · ‎05-31-2012

Are you sure that your synchronisation method (spinning in a do while loop, waiting on a variable) works? Note that you'd need to look at the generated assembly in order to decide. I don't play at that low a level unless I've been a very bad boy, but inherently I doubt that you could write robust and efficient synchronisation methods in straight fortran (pre F2003) - typically there would be a call to an operating system API or some other library.

Others have already done a lot of the hard work in this area - have a look at OpenMP (a good starting point if you have a threaded/shared memory view of the world) or coarrays (part of the standard Fortran language as of F2008).

How are you starting your threads? How many are there? What sort of machine are you running on (how many real cores does it have)? What is the serious number crunching - are the threads trying to read from/write to the same variables?

dondilworth · ‎06-01-2012

Yes, it's a big issue. Threads are started with

DO J = 1,ISFLAGS(171) ! NUMBER TO AUTHORIZE
ARG = -J ! FLAG WHAT TO DO NEXT
MTSTATUS(J) = 1 ! FLAG IS STARTED
THREADHANDLE = CREATETHREAD (SECURITY,STACK,MTFUNCTION,ARG,FLAGS,THREAD_ID)
IRETLOG = SETTHREADPRIORITY(THREADHANDLE,THREAD_PRIORITY_ABOVE_NORMAL)

IRETINT = RESUMETHREAD(THREADHANDLE)
ITH = THREADHANDLE
IF (ITH .EQ. 0) THEN
ISFLAGS(171) = 0 ! DEFEAT ALTOGETHER
CALL MPANIC('ERROR CREATING MULTIPLE THREADS')
RETURN
ENDIF
ITHREADHANDLE(J) = ITH
ENDDO

Each pass through this loop creates a thread that starts at the top of MTFUNCTION. That routine calls another, called DOMTRAYTRACE, which waits with a DO WHILE loop until flagged to start processing. My intent is that the latter program will execute in the thread (and in the core of the thread) which called it.

The number of threads is constrained to be no more than the number of cores - 2. That leaves two for Windows to play with. So on my 8-core system there can be up to six.

Each thread uses its own set of data, placed in an indexed array by the caller, plus its own automatic variables. So no thread directly changes any data used by any other thread.

The number crunching involves tracing a light ray through a lens, and if there are many elements it takes a while. So if I want to trace 100 rays, I can start six of them running in parallel, collect the answers, start a new six, and so on. The right answers come back so each thread is doing its job with its own data.

Every subroutine involved in the operation of the threads is compiled with

USE IFMT
USE IFCORE

which (I hope) makes it run in the same thread as its caller. I have looked for details of the two USE directives, so I would better know what I am doing. But those terms are not in the help file.

The Windows Task Manager Performance tab shows my eight cores, normally most of them idle. When I start my threads, the requested number shoot up to nearly full utilization -- which is what you would expect from the DO WHILE loops. So it looks like they have started and are running as I planned.

But how can I be sure that the routines called in each thread actually execute in the same core and in parallel? I am using IVF Composer XE 2011.

In C++ one has the SetThreadAffinityMask() option to keep a thread in a single core. What does Fortran offer?

jimdempseyatthecove · ‎06-01-2012

Have you tried OpenMP?

Using FPP, it should be relatively easy for you to experiment with the same source code compiled for OpenMP or for your threads system.

Pseudo code:

#ifndef _OPENMP
call createYourThreads()
#endif
do while(fetchRays())
#ifndef _OPENMP
call setYourGoFlags()
call waitForAllToFinish()
#else
!$omp parallel do
do i=1,nRays
call traceRay(i)
end do
!$omp end parallel do
#endif
end do

Where traceRay(i) is calledin your curent code.

I wouldn't worry about reserving one thread for Windows.
It will float from core to core.

You can secify the number of threads for a parallel region in OpenMP.

Jim Dempsey

dondilworth · ‎06-01-2012

Jim:

That's a good suggestion, but I have a question: I have implemented my common blocks for 10 cores, so I cannot do more than that number of loops in parallel. When the OpenMP parallel construct finishes, I collect the data from those 10, and then I presume the OS terminates all of the threads. Then for the next set, it has to create them all over again -- and the overhead then exceeds the savings. That is why I create the threads initially and then reuse them over and over.

Does OpenMP do anything like that? I mean, create some threads and then reuse them over again? If your proposed traceRay() exits, then the thread evaporates, right?

JVanB · ‎06-01-2012

SetThreadAffinityMask is a Win32 API function, not a C++ one so you can use it in Fortran.

jimdempseyatthecove · ‎06-01-2012

OpenMP workswith a pool of threads.

PROGRAM YourProgram
... any data
...
{optional calltoomp_... functionto query/set max threads}
! the following is the 1st time OpenMP is called
!$OMP ...
{hidden code does once only allocation of thread pool}
{note, subsequent nested levels may add additional pool(s)}

IOW the thread pool is created once (unless nested levels, and then only once again per nest level branch)

Subsequent to the first time call, the threads get re-used as your code enters parallel regions.

As your code exits a parallel region, there is an implicit join (unless parallel region exit is attributed with NOWAIT).

Upon exit of parallel region, the (additional) threads either run something else in your application or failing having something else, enter a spinwait (default 100ms-300ms) and will resume immediately should you enter another parallel region. should the spinwait time exire before entering next parallel region, then thread suspends itself (but does not exit). You can change the spinwait time (KMP_BLOCKTIME environment variable or omp_set... library call).

OpenMP should offer you everything you need (from what you describe).

Jim Dempsey

TimP · ‎06-01-2012

For Intel OpenMP, the KMP_BLOCKTIME feature controls how long the thread pool persists after a parallel region is closed, default 200 (milliseconds). It's not a fully portable feature, although it's the same in all Intel OpenMP implementations.

dondilworth · ‎06-01-2012

These are all very pertinent replies, and I'm exploring what OpenMP can do. I have implemented a version much like the example by Jim Dempsey, and it runs as it should -- but is also slower than the single-thread version. Then I checked the Task Manager, to watch how busy my eight cores are. Most of them were doing nothing, even though I supposedly ran the loop for 10 cores.

So I probably need a directive to say how many cores to employ. I tried

MT = OMP_GET_NUM_PROCS()

and got a linker error. How does one get access to those functions? I want to be sure all my cores are running.

TimP · ‎06-01-2012

Intel (and gnu) OpenMP default num_threads to the number of logical processors seen. omp_get_num_procs will work only with the USE OMP_LIB or equivalent. It will only confirm the number of logical processors, will not check or determine the number of threads running. If you have HyperThreading, you should try setting OMP_NUM_THREADS so as to try not more than 1 thread per core, and set KMP_AFFINITY to spread out the threads across cores (e.g. KMP_AFFINITY=compact,1,1 or KMP_AFFINITY=scatter). It's quite difficult to get OpenMP performance from HyperThreading on Windows.

jimdempseyatthecove · ‎06-01-2012

>> is also slower than the single-thread version

Can you post an outline of your code?
Include the OpenMP directives.
Include any code you wrote yourself for thread coordination
(it may still be in there from your prior coding attempt)

I will be out over this weekend but others may help.

Unless the work done in the parallel section is very short
(or you are calling functions performing serialization: allocate/deallocate, rand, R/W, ...)
the OpenMP version should be faster.

Jim Dempsey

dondilworth · ‎06-01-2012

Okay, here's an outline:

SUBROUTINE TRANS(JRET)
USE IFMT
USE IFCORE
USE OMP_LIB
...

INDEX = 1
...

c NP = omp_get_num_procs() (This causes a link error:

error LNK2019: unresolved external symbol _omp_get_num_procs referenced in function _TRANS

so it is commented out.)

9499 (generate individual ray starting data)
...

IF (INDEX .LE. NCORES) THEN ! START THREAD
ISFLAGS(177) = INDEX
CALL ZASABR(JERR,IPRT,IOPD,ICAO,XEN,YEN,HBAR,MX,ICOL,OPD,D2,GBAR) ! LOAD INPUT DATA THERE

IF (INDEX .EQ. NCORES) THEN ! ALL LOADED; TRACE RAYS NOW
!$OMP PARALLEL DO

DO I = 1,INDEX
CALL MTRAYTRACE(I) ! gets data from the indexed array filled by ZASABR()
ENDDO

!$OMP END PARALLEL DO
GO TO 8801 ! read out results and process them sequentially; then set INDEX = 1, start over at 9499
ENDIF

INDEX = INDEX + 1 ! can start yet more threads immediately
GO TO 9499 ! SET UP NEXT RAY; comes back in above
ENDIF

There are two problems: first, why can't I link the omp_... routine? Is there another library I have to declare to the linker? Second, why are many of my cores idle?

dondilworth · ‎06-02-2012

I've checked the Threads window in the debugger, and before I call any of the OMP routines, there is a Main Thread and two Worker Threads. After I get to the first !$OMP PARALLEL DO, the same three threads show up. I would expect as many threads as I have cores. Clearly, the OpenMP feature is not working.

In case it's useful, here are the command lines for Fortran and the linker:

/nologo /debug:full /debug:parallel /Oy- /I"Debug/" /recursive /reentrancy:none /extend_source:132 /warn:none /Qauto /align:rec4byte /align:commons /assume:byterecl /Qzero /fpconstant /iface:cvf /module:"Debug/" /object:"Debug/" /Fd"Debug\vc100.pdb" /traceback /check:all /libs:dll /threads /winapp /c

/OUT:".\Debug\SYNOPSYS200v14.exe" /VERSION:"14.0" /INCREMENTAL /NOLOGO /LIBPATH:"C:\Program Files (x86)\Intel\Composer XE 2011 SP1\\compiler\lib\ia32" /LIBPATH:"C:\Projects\U105136\OpenGL\freeglut 2.4.0 (compiled)" /LIBPATH:"C:\SYNOPSYSV14\Libraries(x64)\Static\MT" "mpr.lib" "SentinelKeys.lib" "wsock32.lib" "freeglut.lib" "Debug\SYNOPSYS200_lib_.lib" /NODEFAULTLIB:"LIBCMT.lib" /NODEFAULTLIB:"msvcrtd.lib" /NODEFAULTLIB:"msvcrt.lib" /MANIFEST /ManifestFile:".\Debug\SYNOPSYS200v14.exe.intermediate.manifest" /ALLOWISOLATION /MANIFESTUAC:"level='asInvoker' uiAccess='false'" /DEBUG /PDB:"C:\SYNOPSYSV14\Debug\SYNOPSYS200v14.pdb" /SUBSYSTEM:WINDOWS /STACK:"3000000" /PGD:"C:\SYNOPSYSV14\Debug\SYNOPSYS200v14.pgd" /TLBID:1 /DYNAMICBASE /NXCOMPAT /MACHINE:X86 /ERRORREPORT:QUEUE

dondilworth · ‎06-02-2012

Progress! I found a page that says I have to add VCOMPD.lib to the linker input. Now I can use the omp_... routines -- but I still don't get more than the usual three threads, even after the !$OMP PARALLEL DO.

So something's still wrong.

dondilworth · ‎06-02-2012

Yes, I'm a newbe, and I didn't know about linking the library VCOMPD.lib, and also setting the compiler options to recognize the omp_ calls. So I got most things working: it makes lots of threads and they all run when I get to the !$OMP PARALLEL DO statement. So far, so good. And the code comes back with the right answers.

But here are my timing specs:

serial mode: 0.371 seconds
10 cores, 10 passes in the DO loop each sequence, which is run about 985 times: 0.546 seconds.

It's hard to believe there is so much overhead in the OpenMP routines. There is some added work in my code, of course; there are 36 assignment statements associated with each core going into the calculation, and several hundred coming out. But if I run a simpler problem, where the calculations are faster but the overhead is the same, I get

serial: 0.156
10 cores: 0.215

So the overhead has to be no more than 0.059 seconds, even if the parallel execution was exactly the same speed as the serial. None of this makes sense.

Is there lots of overhead just triggering each pass through a given thread? That might do it.

TimP · ‎06-02-2012

If you didn't set /Qopenmp (there's a prominent option in Visual Studio GUI as well), your OpenMP directives should be reported with warnings. That would explain your failure to link libiomp5, thus your omp calls won't be resolved.

Anonymous66 · ‎06-02-2012

As Tim said, you need to set the option /Qopenmp. It is under Fortran > Language > Process OpenMP Directives in the properties menu. Without this option, the openmp directives will be ignored.

dondilworth · ‎06-03-2012

I've made some real progress. I now have the debug version running my eight cores with eight threads, and my test case runs 1.56x faster with multithreads enabled than it does in serial mode. A key point was to not enable recursive routines and not generate reentrant code. (This by dumb trial-and-error, since I have not seen that information given anywhere.) This is great news!

But the release version still runs 1.6x slower in multithread mode, and I don't know why. Here are the command lines:

DEBUG: 1.56X FASTER

F

/nologo /debug:full /debug:parallel /Oy- /I"Debug/" /reentrancy:none /extend_source:132 /Qopenmp /Qopenmp-report1 /warn:none /Qauto /align:rec4byte /align:commons

/assume:byterecl /Qzero /fpconstant /iface:cvf /module:"Debug/" /object:"Debug/" /Fd"Debug\vc100.pdb" /traceback /check:all /libs:dll /threads /winapp /c

C++

/ZI /nologo /W1 /WX- /Od /Ot /Oy- /D "WIN32" /D "_DEBUG" /D "_WINDOWS" /D "_VC80_UPGRADE=0x0600" /Gm /EHsc /MTd /GS /Gy- /fp:precise /Zc:wchar_t /Zc:forScope /Fp".\Debug

\SYNOPSYS200.pch" /Fa".\Debug" /Fo".\Debug" /Fd".\Debug" /FR".\Debug" /Gd /analyze- /errorReport:queue

Linker

/OUT:".\Debug\SYNOPSYS200v14.exe" /VERSION:"14.0" /INCREMENTAL /NOLOGO /LIBPATH:"C:\Program Files (x86)\Intel\Composer XE 2011 SP1\\compiler\lib\ia32" /LIBPATH:"C:\Projects

\U105136\OpenGL\freeglut 2.4.0 (compiled)" /LIBPATH:"C:\SYNOPSYSV14\Libraries(x64)\Static\MT" "mpr.lib" "SentinelKeys.lib" "wsock32.lib" "freeglut.lib" "Debug

\SYNOPSYS200_lib_.lib" "VCOMPD.LIB" /NODEFAULTLIB:"LIBCMT.lib" /NODEFAULTLIB:"msvcrtd.lib" /NODEFAULTLIB:"msvcrt.lib" /MANIFEST /ManifestFile:".\Debug

\SYNOPSYS200v14.exe.intermediate.manifest" /ALLOWISOLATION /MANIFESTUAC:"level='asInvoker' uiAccess='false'" /DEBUG /PDB:"C:\SYNOPSYSV14\Debug\SYNOPSYS200v14.pdb"

/SUBSYSTEM:WINDOWS /STACK:"3000000" /PGD:"C:\SYNOPSYSV14\Debug\SYNOPSYS200v14.pgd" /TLBID:1 /DYNAMICBASE /NXCOMPAT /MACHINE:X86 /ERRORREPORT:QUEUE

RELEASE: 1.6X SLOWER

F

/nologo /Oy- /Qipo /I"Release/" /reentrancy:none /extend_source:132 /Qopenmp /Qauto /align:rec4byte /align:commons /assume:byterecl /Qzero /fpconstant /iface:cvf

/module:"Release/" /object:"Release/" /Fd"Release\vc100.pdb" /check:none /libs:dll /threads /winapp /c

C++

/Zi /nologo /W2 /WX- /O2 /Ot /Oy- /D "WIN32" /D "NDEBUG" /D "_WINDOWS" /D "_VC80_UPGRADE=0x0600" /GF /Gm- /EHsc /MT /GS /Gy- /fp:precise /Zc:wchar_t /Zc:forScope /GR /openmp

/Fp".\Release\SYNOPSYS200.pch" /Fa".\Release" /Fo".\Release" /Fd".\Release" /FR".\Release" /Gd /analyze- /errorReport:queue

Linker

/OUT:".\Release\SYNOPSYS200v14.exe" /VERSION:"14.0" /INCREMENTAL /NOLOGO /LIBPATH:"C:\Program Files (x86)\Intel\Composer XE 2011 SP1\\compiler\lib\ia32" /LIBPATH:"C:

\SYNOPSYSV14\OpenGL\freeglut 2.4.0 (compiled)" /LIBPATH:"C:\SYNOPSYSV14\Libraries(x64)\Static\MT" "ODBC32.LIB" "ODBCCP32.LIB" "mpr.lib" "SentinelKeys.lib" "wsock32.lib"

"freeglut.lib" "VCOMP.lib" "Release\SYNOPSYS200_lib_.lib" /NODEFAULTLIB:"LIBCMTd.lib" /NODEFAULTLIB:"msvcrtd.lib" /NODEFAULTLIB:"msvcrt.lib" /MANIFEST /ManifestFile:".\Release

\SYNOPSYS200v14.exe.intermediate.manifest" /ALLOWISOLATION /MANIFESTUAC:"level='asInvoker' uiAccess='false'" /PDB:"C:\SYNOPSYSV14\Release\SYNOPSYS200v14.pdb" /SUBSYSTEM:WINDOWS

/STACK:"3000000" /PGD:"C:\SYNOPSYSV14\Release\SYNOPSYS200v14.pgd" /TLBID:1 /DYNAMICBASE /NXCOMPAT /MACHINE:X86 /ERRORREPORT:QUEUE

Anonymous66 · ‎06-03-2012

Do you need the option /fp:precise in C++? This will disable a number of optimizations.

Also, is there a reason you are using /iface:cvf in Fortran? Are you calling libraries compiled with CVF?

dondilworth · ‎06-03-2012

I need the /iface directive. The program was converted from CVF, and although there are no libraries compiled there, I have calling conventions built in all over the place that require that option. The /fp directive results in floating-point answers that are nearly the same as the CVF version, while the other options are often quite different.

I'm not sure the issue is optimization anyway. The program runs faster than the CVF version, even in serial mode. The issue is, why I cannot get the OpenMP services to work in release mode as well as in debug mode. The latter seems to work fine, and I get faster results running parallel than serial. But the release version runs more slowly in parallel mode than in serial, which suggests that it is actually running the parallel DO loop in serial, with extra overhead slowing it down even more.

So that's the problem. Can you think of any way to fix it?

IanH · ‎06-03-2012

Quoting dondilworth

I've made some real progress. I now have the debug version running my eight cores with eight threads, and my test case runs 1.56x faster with multithreads enabled than it does in serial mode. A key point was to not enable recursive routines and not generate reentrant code....

AFAIK, use of those options means that your multithreaded program is now rather broken!

There's an overhead associated with multi-threading. Typically the amount of really independent work that can be done in parallel needs to be over a certain threshold before multithreading becomes worthwhile. Whether your notionally independent work is really independent work, and whether theres enough code to justify the overhead can't be assessed by people not familiar with your code. If you could post examples that would help us understand. Ideally those examples would be a cut-down, self contained and compilable program.

Why do you have /warn:none on your debug build? Outside of certain special use situations the warnings given by the compiler with /warn:all are pretty relevant. Ignore them at your peril. I typically use /warn:all on both debug and release builds.

How are you timing your runs?

Previously you wrote:

Every subroutine involved in the operation of the threads is compiled with

USE IFMT
USE IFCORE

which (I hope) makes it run in the same thread as its caller.

Those USE statements don't change the execution of your program in a single thread/multithreaded sense. at all They simply make variable, type and procedure declarations available to your program in the scope that has the USE statement. Code that you subsequently write that uses those declarations then determines which thread runs what code.