- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
1. Several threads are started. They each go into a DO WHILE loop, waiting for a flag in an indexed variable in a named COMMON block.
2. When the flag is set, the thread starts some serious number crunching, using data taken from another indexed set of variables. When the routine finishes, it sets another flag, also in an indexed array.
3. The calling program waits until all threads have finished, checking the flags in another DO WHILE loop, and then reads out the answers from the indexed variables. Then it starts over with another set of input data, and so on.
This should run faster with several threads running in parallel, but it doesn't. Possible reasons:
1. One of the threads is in a core being used by Windows for something else. It cannot complete its task until Windows lets it run, and the calling program has to wait for that.
2. The several threads do not in fact run in parallel. If they go individually, that would also be very slow.
Is there any way to ensure that my threads get their own cores, and they all run in parallel?
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Others have already done a lot of the hard work in this area - have a look at OpenMP (a good starting point if you have a threaded/shared memory view of the world) or coarrays (part of the standard Fortran language as of F2008).
How are you starting your threads? How many are there? What sort of machine are you running on (how many real cores does it have)? What is the serious number crunching - are the threads trying to read from/write to the same variables?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
DO J = 1,ISFLAGS(171) ! NUMBER TO AUTHORIZE
ARG = -J ! FLAG WHAT TO DO NEXT
MTSTATUS(J) = 1 ! FLAG IS STARTED
THREADHANDLE = CREATETHREAD (SECURITY,STACK,MTFUNCTION,ARG,FLAGS,THREAD_ID)
IRETLOG = SETTHREADPRIORITY(THREADHANDLE,THREAD_PRIORITY_ABOVE_NORMAL)
IRETINT = RESUMETHREAD(THREADHANDLE)
ITH = THREADHANDLE
IF (ITH .EQ. 0) THEN
ISFLAGS(171) = 0 ! DEFEAT ALTOGETHER
CALL MPANIC('ERROR CREATING MULTIPLE THREADS')
RETURN
ENDIF
ITHREADHANDLE(J) = ITH
ENDDO
Each pass through this loop creates a thread that starts at the top of MTFUNCTION. That routine calls another, called DOMTRAYTRACE, which waits with a DO WHILE loop until flagged to start processing. My intent is that the latter program will execute in the thread (and in the core of the thread) which called it.
The number of threads is constrained to be no more than the number of cores - 2. That leaves two for Windows to play with. So on my 8-core system there can be up to six.
Each thread uses its own set of data, placed in an indexed array by the caller, plus its own automatic variables. So no thread directly changes any data used by any other thread.
The number crunching involves tracing a light ray through a lens, and if there are many elements it takes a while. So if I want to trace 100 rays, I can start six of them running in parallel, collect the answers, start a new six, and so on. The right answers come back so each thread is doing its job with its own data.
Every subroutine involved in the operation of the threads is compiled with
USE IFMT
USE IFCORE
which (I hope) makes it run in the same thread as its caller. I have looked for details of the two USE directives, so I would better know what I am doing. But those terms are not in the help file.
The Windows Task Manager Performance tab shows my eight cores, normally most of them idle. When I start my threads, the requested number shoot up to nearly full utilization -- which is what you would expect from the DO WHILE loops. So it looks like they have started and are running as I planned.
But how can I be sure that the routines called in each thread actually execute in the same core and in parallel? I am using IVF Composer XE 2011.
In C++ one has the SetThreadAffinityMask() option to keep a thread in a single core. What does Fortran offer?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Using FPP, it should be relatively easy for you to experiment with the same source code compiled for OpenMP or for your threads system.
Pseudo code:
#ifndef _OPENMP
call createYourThreads()
#endif
do while(fetchRays())
#ifndef _OPENMP
call setYourGoFlags()
call waitForAllToFinish()
#else
!$omp parallel do
do i=1,nRays
call traceRay(i)
end do
!$omp end parallel do
#endif
end do
Where traceRay(i) is calledin your curent code.
I wouldn't worry about reserving one thread for Windows.
It will float from core to core.
You can secify the number of threads for a parallel region in OpenMP.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
That's a good suggestion, but I have a question: I have implemented my common blocks for 10 cores, so I cannot do more than that number of loops in parallel. When the OpenMP parallel construct finishes, I collect the data from those 10, and then I presume the OS terminates all of the threads. Then for the next set, it has to create them all over again -- and the overhead then exceeds the savings. That is why I create the threads initially and then reuse them over and over.
Does OpenMP do anything like that? I mean, create some threads and then reuse them over again? If your proposed traceRay() exits, then the thread evaporates, right?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
SetThreadAffinityMask is a Win32 API function, not a C++ one so you can use it in Fortran.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
PROGRAM YourProgram
... any data
...
{optional calltoomp_... functionto query/set max threads}
! the following is the 1st time OpenMP is called
!$OMP ...
{hidden code does once only allocation of thread pool}
{note, subsequent nested levels may add additional pool(s)}
IOW the thread pool is created once (unless nested levels, and then only once again per nest level branch)
Subsequent to the first time call, the threads get re-used as your code enters parallel regions.
As your code exits a parallel region, there is an implicit join (unless parallel region exit is attributed with NOWAIT).
Upon exit of parallel region, the (additional) threads either run something else in your application or failing having something else, enter a spinwait (default 100ms-300ms) and will resume immediately should you enter another parallel region. should the spinwait time exire before entering next parallel region, then thread suspends itself (but does not exit). You can change the spinwait time (KMP_BLOCKTIME environment variable or omp_set... library call).
OpenMP should offer you everything you need (from what you describe).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
So I probably need a directive to say how many cores to employ. I tried
MT = OMP_GET_NUM_PROCS()
and got a linker error. How does one get access to those functions? I want to be sure all my cores are running.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Can you post an outline of your code?
Include the OpenMP directives.
Include any code you wrote yourself for thread coordination
(it may still be in there from your prior coding attempt)
I will be out over this weekend but others may help.
Unless the work done in the parallel section is very short
(or you are calling functions performing serialization: allocate/deallocate, rand, R/W, ...)
the OpenMP version should be faster.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
SUBROUTINE TRANS(JRET)
USE IFMT
USE IFCORE
USE OMP_LIB
...
INDEX = 1
...
c NP = omp_get_num_procs() (This causes a link error:
error LNK2019: unresolved external symbol _omp_get_num_procs referenced in function _TRANS
so it is commented out.)
9499 (generate individual ray starting data)
...
IF (INDEX .LE. NCORES) THEN ! START THREAD
ISFLAGS(177) = INDEX
CALL ZASABR(JERR,IPRT,IOPD,ICAO,XEN,YEN,HBAR,MX,ICOL,OPD,D2,GBAR) ! LOAD INPUT DATA THERE
IF (INDEX .EQ. NCORES) THEN ! ALL LOADED; TRACE RAYS NOW
!$OMP PARALLEL DO
DO I = 1,INDEX
CALL MTRAYTRACE(I) ! gets data from the indexed array filled by ZASABR()
ENDDO
!$OMP END PARALLEL DO
GO TO 8801 ! read out results and process them sequentially; then set INDEX = 1, start over at 9499
ENDIF
INDEX = INDEX + 1 ! can start yet more threads immediately
GO TO 9499 ! SET UP NEXT RAY; comes back in above
ENDIF
There are two problems: first, why can't I link the omp_... routine? Is there another library I have to declare to the linker? Second, why are many of my cores idle?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In case it's useful, here are the command lines for Fortran and the linker:
/nologo /debug:full /debug:parallel /Oy- /I"Debug/" /recursive /reentrancy:none /extend_source:132 /warn:none /Qauto /align:rec4byte /align:commons /assume:byterecl /Qzero /fpconstant /iface:cvf /module:"Debug/" /object:"Debug/" /Fd"Debug\vc100.pdb" /traceback /check:all /libs:dll /threads /winapp /c
/OUT:".\Debug\SYNOPSYS200v14.exe" /VERSION:"14.0" /INCREMENTAL /NOLOGO /LIBPATH:"C:\Program Files (x86)\Intel\Composer XE 2011 SP1\\compiler\lib\ia32" /LIBPATH:"C:\Projects\U105136\OpenGL\freeglut 2.4.0 (compiled)" /LIBPATH:"C:\SYNOPSYSV14\Libraries(x64)\Static\MT" "mpr.lib" "SentinelKeys.lib" "wsock32.lib" "freeglut.lib" "Debug\SYNOPSYS200_lib_.lib" /NODEFAULTLIB:"LIBCMT.lib" /NODEFAULTLIB:"msvcrtd.lib" /NODEFAULTLIB:"msvcrt.lib" /MANIFEST /ManifestFile:".\Debug\SYNOPSYS200v14.exe.intermediate.manifest" /ALLOWISOLATION /MANIFESTUAC:"level='asInvoker' uiAccess='false'" /DEBUG /PDB:"C:\SYNOPSYSV14\Debug\SYNOPSYS200v14.pdb" /SUBSYSTEM:WINDOWS /STACK:"3000000" /PGD:"C:\SYNOPSYSV14\Debug\SYNOPSYS200v14.pgd" /TLBID:1 /DYNAMICBASE /NXCOMPAT /MACHINE:X86 /ERRORREPORT:QUEUE
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
So something's still wrong.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
But here are my timing specs:
serial mode: 0.371 seconds
10 cores, 10 passes in the DO loop each sequence, which is run about 985 times: 0.546 seconds.
It's hard to believe there is so much overhead in the OpenMP routines. There is some added work in my code, of course; there are 36 assignment statements associated with each core going into the calculation, and several hundred coming out. But if I run a simpler problem, where the calculations are faster but the overhead is the same, I get
serial: 0.156
10 cores: 0.215
So the overhead has to be no more than 0.059 seconds, even if the parallel execution was exactly the same speed as the serial. None of this makes sense.
Is there lots of overhead just triggering each pass through a given thread? That might do it.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
But the release version still runs 1.6x slower in multithread mode, and I don't know why. Here are the command lines:
DEBUG: 1.56X FASTER
F
/nologo /debug:full /debug:parallel /Oy- /I"Debug/" /reentrancy:none /extend_source:132 /Qopenmp /Qopenmp-report1 /warn:none /Qauto /align:rec4byte /align:commons
/assume:byterecl /Qzero /fpconstant /iface:cvf /module:"Debug/" /object:"Debug/" /Fd"Debug\vc100.pdb" /traceback /check:all /libs:dll /threads /winapp /c
C++
/ZI /nologo /W1 /WX- /Od /Ot /Oy- /D "WIN32" /D "_DEBUG" /D "_WINDOWS" /D "_VC80_UPGRADE=0x0600" /Gm /EHsc /MTd /GS /Gy- /fp:precise /Zc:wchar_t /Zc:forScope /Fp".\Debug
\SYNOPSYS200.pch" /Fa".\Debug" /Fo".\Debug" /Fd".\Debug" /FR".\Debug" /Gd /analyze- /errorReport:queue
Linker
/OUT:".\Debug\SYNOPSYS200v14.exe" /VERSION:"14.0" /INCREMENTAL /NOLOGO /LIBPATH:"C:\Program Files (x86)\Intel\Composer XE 2011 SP1\\compiler\lib\ia32" /LIBPATH:"C:\Projects
\U105136\OpenGL\freeglut 2.4.0 (compiled)" /LIBPATH:"C:\SYNOPSYSV14\Libraries(x64)\Static\MT" "mpr.lib" "SentinelKeys.lib" "wsock32.lib" "freeglut.lib" "Debug
\SYNOPSYS200_lib_.lib" "VCOMPD.LIB" /NODEFAULTLIB:"LIBCMT.lib" /NODEFAULTLIB:"msvcrtd.lib" /NODEFAULTLIB:"msvcrt.lib" /MANIFEST /ManifestFile:".\Debug
\SYNOPSYS200v14.exe.intermediate.manifest" /ALLOWISOLATION /MANIFESTUAC:"level='asInvoker' uiAccess='false'" /DEBUG /PDB:"C:\SYNOPSYSV14\Debug\SYNOPSYS200v14.pdb"
/SUBSYSTEM:WINDOWS /STACK:"3000000" /PGD:"C:\SYNOPSYSV14\Debug\SYNOPSYS200v14.pgd" /TLBID:1 /DYNAMICBASE /NXCOMPAT /MACHINE:X86 /ERRORREPORT:QUEUE
RELEASE: 1.6X SLOWER
F
/nologo /Oy- /Qipo /I"Release/" /reentrancy:none /extend_source:132 /Qopenmp /Qauto /align:rec4byte /align:commons /assume:byterecl /Qzero /fpconstant /iface:cvf
/module:"Release/" /object:"Release/" /Fd"Release\vc100.pdb" /check:none /libs:dll /threads /winapp /c
C++
/Zi /nologo /W2 /WX- /O2 /Ot /Oy- /D "WIN32" /D "NDEBUG" /D "_WINDOWS" /D "_VC80_UPGRADE=0x0600" /GF /Gm- /EHsc /MT /GS /Gy- /fp:precise /Zc:wchar_t /Zc:forScope /GR /openmp
/Fp".\Release\SYNOPSYS200.pch" /Fa".\Release" /Fo".\Release" /Fd".\Release" /FR".\Release" /Gd /analyze- /errorReport:queue
Linker
/OUT:".\Release\SYNOPSYS200v14.exe" /VERSION:"14.0" /INCREMENTAL /NOLOGO /LIBPATH:"C:\Program Files (x86)\Intel\Composer XE 2011 SP1\\compiler\lib\ia32" /LIBPATH:"C:
\SYNOPSYSV14\OpenGL\freeglut 2.4.0 (compiled)" /LIBPATH:"C:\SYNOPSYSV14\Libraries(x64)\Static\MT" "ODBC32.LIB" "ODBCCP32.LIB" "mpr.lib" "SentinelKeys.lib" "wsock32.lib"
"freeglut.lib" "VCOMP.lib" "Release\SYNOPSYS200_lib_.lib" /NODEFAULTLIB:"LIBCMTd.lib" /NODEFAULTLIB:"msvcrtd.lib" /NODEFAULTLIB:"msvcrt.lib" /MANIFEST /ManifestFile:".\Release
\SYNOPSYS200v14.exe.intermediate.manifest" /ALLOWISOLATION /MANIFESTUAC:"level='asInvoker' uiAccess='false'" /PDB:"C:\SYNOPSYSV14\Release\SYNOPSYS200v14.pdb" /SUBSYSTEM:WINDOWS
/STACK:"3000000" /PGD:"C:\SYNOPSYSV14\Release\SYNOPSYS200v14.pgd" /TLBID:1 /DYNAMICBASE /NXCOMPAT /MACHINE:X86 /ERRORREPORT:QUEUE
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Also, is there a reason you are using /iface:cvf in Fortran? Are you calling libraries compiled with CVF?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm not sure the issue is optimization anyway. The program runs faster than the CVF version, even in serial mode. The issue is, why I cannot get the OpenMP services to work in release mode as well as in debug mode. The latter seems to work fine, and I get faster results running parallel than serial. But the release version runs more slowly in parallel mode than in serial, which suggests that it is actually running the parallel DO loop in serial, with extra overhead slowing it down even more.
So that's the problem. Can you think of any way to fix it?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
AFAIK, use of those options means that your multithreaded program is now rather broken!
There's an overhead associated with multi-threading. Typically the amount of really independent work that can be done in parallel needs to be over a certain threshold before multithreading becomes worthwhile. Whether your notionally independent work is really independent work, and whether theres enough code to justify the overhead can't be assessed by people not familiar with your code. If you could post examples that would help us understand. Ideally those examples would be a cut-down, self contained and compilable program.
Why do you have /warn:none on your debug build? Outside of certain special use situations the warnings given by the compiler with /warn:all are pretty relevant. Ignore them at your peril. I typically use /warn:all on both debug and release builds.
How are you timing your runs?
Previously you wrote:
Every subroutine involved in the operation of the threads is compiled with
USE IFMT
USE IFCORE
which (I hope) makes it run in the same thread as its caller.
Those USE statements don't change the execution of your program in a single thread/multithreaded sense. at all They simply make variable, type and procedure declarations available to your program in the scope that has the USE statement. Code that you subsequently write that uses those declarations then determines which thread runs what code.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page