OpenMP in IVF & scratch files

maria · ‎04-29-2010

I am experimenting openMPfor my company's software written in Fortran. We use IVF compilier.
After I implemented openMP to our software, I have made sure that the results are correct using openMP.But the problem is that openMP does not save CPU time at all. When I checked where the problem was, I found out it takes much more time in reading and writing to scratch files in our program when openMP.

Our software can be simplied as this:
!$OMP parallel do
DO I = 1, N
CALL subroutineA
.....
CALL subroutineB
.....
CALL subroutine C
ENDO

Inside SubroutineA, a scratch file is opened and written. Then in Subroutine C, data in scratch file is read.Thepart of reading the scratch file can be simplied as following.Array X is allocated and deallocated inside the subroutineC.
******************
J1=1
DO 10 I =1, MAXI
J2=2*I+J1+..
read(scratch file number) (x(J), j=j1,j2)
J1=J2+1
10 continue
*****************************

When I run the program without openMP, this session takes about 12 seconds CPU time when MAXI = 6000.The scratch files are written in the same directory where wehave input files. When I run the sameprogram with openMP implemented, this session takes 45 seconds. It takes 30+s longer to read the scratch file! I have Intel Core2 duo processor. That's the reason that caused no CPU time saving when using openMP.

My question is what went wrong? What's the proper way to improve openMP performance when reading from and writing to scratch files? Thank you.

Martyn_C_Intel · ‎04-29-2010

Do you really need to use scratch files? Couldn't you just use more memory, and keep all the data in shared memory? If necessary, for synchronization, the call to subroutine A (that creates the data) could be in one OpenMP loop, and the calls to subroutines B and C (that consumes the data) could be in a separate OpenMP loop.
I'm assuming that each thread has its own scratch file. But I think the runtime library still does heavy synchronization to ensure the threads don't interfere with each other. Possibly this could be more efficient, but it's far better to avoid I/O in parallel regions.

maria · ‎04-29-2010

We use I/O since we have huge datain computationand the code was developed when more memorywas very expensive. These I/Osare still useful today for some cases. I made sure that each thread has its own scratch file for writing and reading. The results are correct using openMP.

But still I don't understand why I/O can cause significant slow down in openMP. If there is no way to improve it, maybe we will give up openMP. Please advise. Thanks, Maria

jimdempseyatthecove · ‎04-29-2010

modify and run

DO iNumThreads = 1,2
write(*,*) "Number of threads = ", iNumThreads
!$OMP parallel do num_threads(iNumThreads)
DO I = 1, N
CALL subroutineA
.....
CALL subroutineB
.....
CALL subroutine C
ENDDO
ENDDO

The see what the time is for 1 thread and 2 threads
Do not obtain the 1 thread run time by disabling OpenMP.
Obtain the time while running in an OpenMP parallel region with 1 thread.

Report back your findings.

Other than for file read/write, are your subroutines functions that require serialization (random number comes to mind)?

Jim

Martyn_C_Intel · ‎04-29-2010

If you do Jim's test, I think it would still be helpful to run another test with OpenMP disabled in addition. (Removing /Qopenmp would cause the non-threadsafe version of the Fortran library to be linked).
In a multithreaded app, the RTL has to protect against concurrent calls from different threads. You know that each thread accesses a different unit, but the library probably doesn't.

Perhaps there would be less synchronization overhead if you used buffered I/O (the default is unbuffered).
This is /assume:buffered_io.

If the extra time is really all coming fromreading and writing the scratch file, and this can't be eliminated, you may want to consider moving this outside the parallel region, and just doing the computations in parallel. Or since you already have separate scratch files for each thread, your app might map better onto MPI than onto OpenMP, since MPI processes are independent, and each one would have its own copy of the Fortran library, that did not need to worry about the others.

maria · ‎05-01-2010

Jim & Martyn,

Thank you so much for your answers.

First of all, how do I used buffered I/O? what does 'assume:buffered_io' mean?What should I do?
I would like to give it a try.

Following your suggestions, I did three tests:
(1) Test1 with openMP disabled.
(2) Test2 with openMP, set iNumThread=1
(3) Test3 with openMP, set iNumThread=2

The numerical results of (1), (2) and (3) are the same which means the code was changed correctly.
The run time from (1) & (2) are almost the same. However, it takes even longer time to run (3)
instead of time savings I expect.

There are two major subroutines that use most of CPU time of this program. Both subroutines have lots of reading and writing to scratch files as well as lots of algebra operations. In one subroutine, I was able to separate clock time from one reading scratching file to the rest of subroutine.
In (1) or (2), the reading part takes about 11s-13s each time.
while the rest of the subroutine takes about 3-4 seconds. However, in (3), the reading part takes 30s-46s each time while the rest of the program takes about the same time as (1) and (2).

When I started to compile openMP, I run into compiling program. The advice I got was to add LIBCMT.lib in 'Ignore Specific Library'. Does this have anything to do with performance?

Last, in our program most of the arrays are allocatable arrays that are allocated and deallocated inside
each subroutine. There are onlythree arrays thathave to beallocated outside parallel region: A(M,M), B(M,M1),C(M,M1,M2), where mcan range from 100-10,000 in general. M1, M2 < 10. In test case I reported here, M=6000+. TO make
these array thread independent, they are changed as A(m,m,iNumThread), B(M,M1,N), C(M,M1,M2,N),
N is the number of cycles in do loop of parallel region. Do you think the change of these three arrays
can cause program slow down significantly somehow?

I am trying to think of everything that could cause openMP
performace issue since eliminating scratch files is a huge task for usif we are going forward, which is the last thing we want to do. Alsowe want to make sure openMP will work after we eliminate scratch files.

Many thanks.

TimP · ‎05-01-2010

Have you looked at the section

Efficient Use of Record Buffers and Disk I/O

in the ifort doc? assume:buffered_io and other topics related to your question are explained there. While that section doesn't mention OpenMP explicitly, the point that you should familiarize yourself with buffered_io is well taken.

Steven_L_Intel1 · ‎05-02-2010

"Ignore specific library" is almost always the wrong solution for any problem. Why did you add that?

As Jim and Martyn have said, doing I/O in multiple threads requires the library to do synchronization which hurts performance. If you are seeing unacceptable performance with I/O in threads, look for ways to confine the I/O to a single thread.

I doubt that assume:buffered_io will help here - that will reduce the number of OS write operations but won't eliminate synchronization overhead.

jimdempseyatthecove · ‎05-02-2010

I have some suggestions for you:

If you are not doing this already, use unformatted write/read of the entire array as binary data. Increase buffer size on the I/O channel. This will reduce the number of serialization operations and improve performance.

RE: To make these array thread independent, they are changed as A(m,m,iNumThread), B(M,M1,N), C(M,M1,M2,N)

consider using subroutines with dummy arguments of A(m,m), B(M,M1), C(M,M1,M2) and pass the array slice in (verify that the compiler did not generate code to use a temporary).

Orconvertmodule/common as ThreadPrivate pointer

REAL, POINTER :: A(:,:), B(:,:), C(:,:)

Then each thread, inside the major loop, can test once for NULL pointer and if NULL initialize pointer to proper slice of larger dimensioned array, or if using array of arrays, initialize the pointer to the proper array.

There is a little bit of overhead in dereferencing the address of the array descriptor in thread private area, but this is offset against the overhead of performing the 3rd index operation. Also, when passing a thread private array (descriptor) to a subroutine, the subroutine will not incur the overhead in dereferencing of the thread private pointer (done once at call).

By using the pointers, in place of thread numbers, you can now have an array of arrays. On small memory systems you may have to write out arrays and read back arrays due to memory constraints. On large memory systems, the array of arrays can be enlarged such that you have no file write and reads of temporary files. The choice of use of temp files or larger memory arrays is isolated from the processing loops, they only receive pointers to complete arrays. The code is ambivalent to as whether the data progressed through a temp file or resided within memory.

In later posts (after implementing some of these suggestons) we can discuss how you can seperate the temp file creaton from the temp file consumption, from the processing of data within the temp files.

Jim Dempsey

maria · ‎05-02-2010

Our software is compiled with several dynamic librarys attached. One of them is written in C++ provided by another company, while the others are written in Fortran. Our compiler is IVF 10.1. The compiler is OK without openMP. However, with openMP, when I compile our software without ignore the library. Igot the following errors. I also attached BuildLog.htm file here. When I ignore LIBCMT.lib, everything is fine. Any idea what's wrong?

Error1 error LNK2005: __invoke_watson already defined in MSVCRT.lib(MSVCR80.dll)LIBCMT.lib
Error2 error LNK2005: __amsg_exit already defined in MSVCRT.lib(MSVCR80.dll)LIBCMT.lib
Error3 error LNK2005: __initterm_e already defined in MSVCRT.lib(MSVCR80.dll)LIBCMT.lib
Error4 error LNK2005: _exit already defined in MSVCRT.lib(MSVCR80.dll)LIBCMT.lib
Error5 error LNK2005: __exit already defined in MSVCRT.lib(MSVCR80.dll)LIBCMT.lib
Error6 error LNK2005: __cexit already defined in MSVCRT.lib(MSVCR80.dll)LIBCMT.lib
Error7 error LNK2005: __configthreadlocale already defined in MSVCRT.lib(MSVCR80.dll)LIBCMT.lib
Error8 error LNK2005: __encode_pointer already defined in MSVCRT.lib(MSVCR80.dll)LIBCMT.lib
Error9 error LNK2005: __decode_pointer already defined in MSVCRT.lib(MSVCR80.dll)LIBCMT.lib
Error10 error LNK2005: ___xi_a already defined in MSVCRT.lib(cinitexe.obj)LIBCMT.lib
Error11 error LNK2005: ___xi_z already defined in MSVCRT.lib(cinitexe.obj)LIBCMT.lib
Error12 error LNK2005: ___xc_a already defined in MSVCRT.lib(cinitexe.obj)LIBCMT.lib
Error13 error LNK2005: ___xc_z already defined in MSVCRT.lib(cinitexe.obj)LIBCMT.lib
Error14 error LNK2005: __XcptFilter already defined in MSVCRT.lib(MSVCR80.dll)LIBCMT.lib
Error15 error LNK2005: __unlock already defined in MSVCRT.lib(MSVCR80.dll)LIBCMT.lib
Error16 error LNK2005: __lock already defined in MSVCRT.lib(MSVCR80.dll)LIBCMT.lib
Error17 error LNK2005: _mainCRTStartup already defined in MSVCRT.lib(crtexe.obj)LIBCMT.lib
Error18 error LNK2005: ___set_app_type already defined in MSVCRT.lib(MSVCR80.dll)LIBCMT.lib
Error21 fatal error LNK1169: one or more multiply defined symbols foundRelease\wamit.exe

jimdempseyatthecove · ‎05-02-2010

Use the Ignore library for LIBCMT.LIB

The problem arrises from the (a)C++ library being linked (lib'd) together with components of LIBCMT.LIB while other libs or obj's being built with a dependency on LIBCMT.LIB. The linker does not know that LIBCMT.LIB (or a version thereof) was brought in with another library.

I see this a lot. Some times it is a different library.

Jim

Steven_L_Intel1 · ‎05-03-2010

As Jim says, there is what we call "Mixed C Library Syndrome". The ideal solution is to build all the sources with the same specification for the runtime libraries, static vs. DLL, debug vs. non-debug. Fortran code doesn't care if you switch library types at link time, but C/C++ code does - if you link against the "wrong" library type, you'll get link errors for missing symbols.

My preference would be to change the Fortran build to specify what the other library was built with. But if ignoring LIBCMT works for you, then that's ok as a second choice.

jimdempseyatthecove · ‎05-03-2010

Steve,

It is not a case of errors for missing symbols. Rather it is a case of an error with duplicate symbols. Many of the system libraries are built with

"if you link me, you must also link default library xxxx"

When the program is linked with a user .lib file or with a 3rd party library that contains library xxxx then you get the duplicate symbols.

The programmer cannot simply request "Ignore Default Libraries" since then they need to specify 10's of system library files individualy.

The problem as I see it is

"if you link me, you must also link default library xxxx"

should be change to

"if you link me, and if you get undefined symbols, then try to find them in default library xxxx"
(also parse this in command line then path given order, repeatedly after new undefineds are discovered)
MUNG

But this is a MS linker or librarian problem not an IVF problem

Jim Dempsey

Steven_L_Intel1 · ‎05-03-2010

Jim,

We're talking past each other here.

In the case of multiple C libraries, yes, you'll get the duplicate symbol definition issues. Part of that is because MS C/C++ uses different external names for its global variables, such as errno, depending on whether you compile with the /MT (static) or /MD (DLL) library option. This in turn causes two different sets of modules to be found in the C libraries resulting in the multiple definition errors.

My comment about missing symbols is that if you use "ignore library", you may get undefined symbol errors because the C code wants one set of these globals but the library you linked to has the other.

It isn't a case of library search rules - it's because of the way Microsoft chose to structure its C library and use different names for global symbols.

Linking to a library, by itself, is harmless. If there are no unresolved symbols, the library will be ignored. Where you get into trouble is pulling in a module from library A which defines symbols X and Y, and then a reference To Z is satisfied from a module in library B which defines Y and Z. The linker will then complain about the duplicate symbol Y.

jimdempseyatthecove · ‎05-04-2010

>>Linking to a library, by itself, is harmless. If there are no unresolved symbols, the library will be ignored. Where you get into trouble is pulling in a module from library A which defines symbols X and Y, and then a reference To Z is satisfied from a module in library B which defines Y and Z. The linker will then complain about the duplicate symbol Y.

In this instance the case needs further qualification.

3rdPartyMumblePhrase.lib contains V1.2.3OfLIBCMTfuncitions

Some other component of application says "Link with LIBCMT.LIB"
e.g.
#pragma comment(lib,"LIBCMT")
or
#pragma comment(linker, "/INCLUDE:someSymbolFromLIBCMT.LIB")

The link order may affect the behavior (error reporting)

IOW

This is not a case of generalization of conflicting symbol names rather it is a case of specific conflict of expected or required or suppliedversion of "standard" library function.An example iswhen 3rd party library contains "printf" in an _ASSERT and theirlibrary contains a copy of the standard library printf functions and data. A potential fix for this is to rearrange the link order or LIB path order. (or "Ignore Library:LIBCMT.LIB")

The root cause (IMHO) is sloppy coding or a comprimize to attain better link times.

Jim

Steven_L_Intel1 · ‎05-04-2010

Nice ideas. Unfortunately, the MS linker does not offer that detail of control. You can name symbols, you can name libraries, but you can't, as far as I know, choose which libraries a given symbol comes from.

maria · ‎05-05-2010

I will report you affter I made changes as you suggested.

maria · ‎05-10-2010

Jim,

We already use unformatted read and write in our code.
I also changed I/O buffering to /assume: buffered i_o. This did help our code.

As for changing array A(m,m) to pointer A(m,m), I got confused.In our code,array A(m,m) is initialized and changed in subroutine A andchanged again (many time) in subroutine B. Pointers are for refer to target data other then data to be changed. Is this correct? How shouldI do it? Thanks.

jimdempseyatthecove · ‎05-10-2010

>>As for changing array A(m,m) to pointer A(m,m), I got confused.In our code,array A(m,m) is initialized and changed in subroutine A andchanged again (many time) in subroutine B. Pointers are for refer to target data other then data to be changed. Is this correct? How shouldI do it? Thanks.

Earlier you wrote:

To make these array thread independent, they are changed as A(m,m,iNumThread), B(M,M1,N), C (M,M1,M2,N), N is the number of cycles in do loop of parallel region.

This states that you have a 3 dimensional array with the right most dimension the thread number.

3 dimensional arrays are slower to access than 2 dimentional arrays. My suggestion was to use a 2 dimension pointer to point to a slice of your 3 dimensional array

! in a module
REAL, TARGET:: Acomposite(:,:,:) ! m,n, nThreads

type YourThreadContext
REAL, POINTER :: A(:,:) ! m,n
! other thread context data here
end type YourThreadContext

type(YourThreadContext) :: ThreadContext
COMMON /CONTEXT/ ThreadConte4xt
!$OMP THREADPRIVATE(/CONTEXT/)
...
! in INIT
allocate(Acomposite(m,n,nThreads))
!$OMP PARALLEL
A => Acomposite(:,:,omp_get_thread_num()+1) ! thread private 2D pointer into 3D array
!$OMP END PARALLEL

....
Anywhere in program

A(I,J) = ... ! this references my thread's slice of Acomposite as 2D array

There are alternate ways of declaring thread private storage, the above is one such way.

Caution omp_get_thread_num() returns the 0-based thread team member number and not a cardinal thread number for the application. Only when NOT using OpenMP nested levels will the OpenMP team member number == the OpenMP thread number. With Nested levels you will have team member number collisions. To fix this, carry a cardinal thread number within the thread private storage.

Jim

maria · ‎05-11-2010

Jim,

Thanks for your suggestions. Instead of using your suggestions, I am using followings.

---------------------------------------------------------
allocate (A(m,m,nThreadNum)

!$OMP parallel do private(nthrd)
DO I = 1, NP
NTHRD =omp_get_thread_num()+1
SubroutineA (A(1,1,NTHRD),...)
subroutineB (A(1,1,NTHRD),....)
ENDDO

While inside subroutineA and subroutineB,
A(I,J) = ...
---------------------------------------------------------
I have done this and the results are correct. Will this to be considered as using two dimensional array instead of three dimensions since inside the subroutines these are all two dimensions?
One more question,
if I use your suggestions, that means that there will be target array Acomposite(m,m,nThreadNum) and pointer array A(m,m). The number m tends to be large number, that means there is additional memory required for pointer if I use your suggestion?

jimdempseyatthecove · ‎05-11-2010

>>Will this to be considered as using two dimensional array instead of three dimensions since inside the subroutines these are all two dimensions?

Yes.

Be cautious when using omp_get_thread_num() when running nested parallel regions

call omp_nested(1) ! enable nested parallel regions
!$OMP parallel sections num_threads(2)
! first section implicit
!$OMP parallel do private(nthrd) num_threads(4)
DO I = 1, NP
NTHRD =omp_get_thread_num()+1
SubroutineA (A(1,1,NTHRD),...)
subroutineB (A(1,1,NTHRD),....)
ENDDO
!$OMP END parallel do
!$OMP SECTION
! second section
!$OMP parallel do private(nthrd) num_threads(4)
DO I = 1, NP
NTHRD =omp_get_thread_num()+1
SubroutineC (A(1,1,NTHRD),...)
subroutineD (A(1,1,NTHRD),....)
ENDDO
!$OMP END parallel do
!$OMP END parallel sections

On an 8 core machine (8 hardware threads)

The above parallel sections establishes 2 teams, one team of 4 threads processing the first loop in the first section, and a second team of 4 threads processing the second loop in the second section.

**** here is the kicker
**** each thread team has team member numbers 0:3

The above caution should be kept in mind when writing code using omp_get_thread_num()
I think this should have been called omp_get_team_member_number()

>>if I use your suggestions, that means that there will be target array Acomposite(m,m,nThreadNum) and pointer array A(m,m). The number m tends to be large number, that means there is additional memory required for pointer if I use your suggestion?

There is no additional memory outside of ~40 byte descriptor

*** provided the compiler does not make a temporary copy of the data.

As you have sketched your code, the compiler should not be creating a temporary copy of the array data.
==================================================
The dummy argument method essentially creates the array descriptor (pointer) for you. *** assuming it does not create a temporary copy of the data. As you have sketched your code, the compiler should not be creating a temporary copy of the array data.

Note, if you have many subroutine levels, each passing the array descriptor, passing the pointer might yield faster code (depending on how the interface is declared).

Jim Dempsey