I see where you are coming - Page 2

David_P_1 · ‎06-14-2017

Hello,

I'm somewhat new to Fortran and I had several questions. I'm currently trying to integrate a generated Fortran codebase to operate off a segment of memory passed to it from a C++ environment. One of my goals is to make as little change to the Fortran codebase as possible. I've been able to get several small scale examples of this method working fine with the C_PTR type with a derived type and also passing a byte array, but the problems arise with the compile times once I use the full size of the memory to be shared. The amount of memory I'm trying to share is about 15MB in size with about 800,000 variables of different types mapped against it. I'm pretty sure my approach of mapping these variables to be compatible with the generated codebase is the cause of the long compile times which last for days. I also imagine I'm not using the best practices for this problem.

For my first method, I created a large derived type containing all the variables that is passed through an exposed library function into the Fortran library from an identically defined struct in the C++ environment. To make that derived type compatible with the generated Fortran codebase I created a file that is #include'd so that the preprocessor replaces variables names that are #define'd to refer to a member of the large shared derived type. This compilation lasts for days and I never saw it to completion.

For my second method I defined a large byte array that's defined in a module along with an equivalence statement for every variable contained in the library to be mapped against. I pass the byte array from C++ through an exposed library method, and then the passed array is copied to the global array, and once all operations are finished in that subroutine the global module array is copied back to the subroutine's passed array so that changes are reflected in the passed C++ byte array. This compilation also lasts days. I tried to reduce the compile times by using the multiprocess compiler option and breaking the array into 5 separate arrays but that doesn't seem to have much of an impact. I'm also concerned about the runtime efficiency of this method since every time interact with the Fortran library a large array copy takes place instead of operating off a reference to the memory in the previous method.

To reduce compilation times, mostly in the second method, I've tried using equivalence statements in the Fortran source files only for the variables contained but this isn't allowed due to restrictions on using equivalence outside of the global array module definition. To get around that I've tried defining the passed array as COMMON but it seems like using common on subroutine arguments isn't allowed either. I could define the define a common array which is used in every source file, with equivalence statements only as needed, but I would still have to copy that array back and forth like described earlier, which worries me.

There might be an obvious solution to all this, but I am pretty new and some of the documentation/examples relating to this use case are a little sparse. I can provide some example code to show what I'm trying to do at a small scale if that might help with any suggestions.

Thanks,

David

andrew_4619 · ‎06-18-2017

An example

module stuff
 implicit none
 real(8) :: var1
 Logical(1) :: var2
 Logical(1) :: var3
 character(len=20) :: var4
 Logical(4) :: var5
 character(len=20) :: var6
 real(8) :: var7
 character(len=20) :: var8
 Logical(1) :: var9
.
.
.
.
 Logical(4) :: var99993
 character(len=20) :: var99994
 integer(4) :: var99995
 Logical(1) :: var99996
 integer(4) :: var99997
 real(8) :: var99998
 real(4) :: var99999
 real(8) :: var100000
end module stuff

so the logical conclusion is with 100000 the compile takes "forever" but with 10 x 10000 lines it takes 10 x a few seconds

andrew_4619 · ‎06-19-2017

I had a further coffee break and made 10 modules each of which had 10000 variables and put them all in the same source file. It look a handful of minutes to compile. A second source file that created a module that USEed all 10 modules compiles fairly instantly.

So the answer is that the compile time with a large number of variables is not linear and there is a practical limit (with respect to time) on the number of variables (or maybe lines of declarations) in a module. The solution for the OP if he continues along the same lines is to have a method that splits the problem into chunks of a manageable size.

It might be worthwhile someone at Intel looking at this for a short while just to check that the "infinite" compile time is not down to the compiler doing something that is a bit dumb that could be enhanced in a future version maybe.....

jimdempseyatthecove · ‎06-19-2017

Good work Andrew.

While this may not be a problem with human written code, it certainly can be a problem with machine written code. This is not a new problem as I have seen this before in these forums. Though the earlier reports related to the code section as opposed to the data section of the machine written code.

Jim Dempsey

mecej4 · ‎06-19-2017

Almost thirty years ago, I encountered a similar problem on the then new Unix-386 using the Green-Hills Fortran compiler. A program had a large array, and ran fine until I decided to add a DATA statement in which only the first few elements of the array were set. The change caused the compile+link time to blow up.

JVanB · ‎06-21-2017

I tried the code in Quote #21 and with lots = 100000 it took ifort 16.0 about 3 minutes to compile lots.f90 to lots.obj and stuff.mod; with lots = 200000, about 15 minutes. Processor Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz, 3401 Mhz, 4 Core(s), 8 Logical Processor(s) 16 GB RAM.

Compilation only used one core and perhaps 500 MB RAM. Are there hash table collisions?

Edit: Oh my, that post seems to have really aged my avatar...

andrew_4619 · ‎06-21-2017

Interesting, you are much faster. the machine I used was a similar spec ( I will check and also on build of options). I also compiles is vs.

jimdempseyatthecove · ‎06-22-2017

>>Oh my, that post seems to have really aged my avatar...

Are you sure it is not you that has aged? That's been my experience.

RO, are you compiling off of SSD?

Jim

JVanB · ‎06-22-2017

Quote #27: I compile with ifort /nologo /c bigeq.f90 from command line. Note that I had to edit that code in Quote #21 to add a right parenthesis to line 7.

Quote #28: Couldn't be me. That's Oscar Wilde's experience. My PC does have an SSD in it, but I don't think it's being used for this compilation. Think about how small the data structures are for lots = 200000.

I tried changing the variable names to base 36 to see if that would help with hash table collisions, but it didn't. What really helped was compiling lots.f90 in gfortran. Andy Vaught once told me that g95 uses red-black trees for its data structures, so access should be consistently O(N*LOG(REAL(N))) and in fact lots.f90 compiles in just a couple of seconds with lots = 200000. Probably the gfortran developers left well enough alone in this area. So my guess is that ifort is building some big data structure with hash tables or worse and access is growing as O(N**2).

andrew_4619 · ‎06-22-2017

Repeat Offender wrote:
Quote #27: I compile with ifort /nologo /c bigeq.f90 from command line. Note that I had to edit that code in Quote #21 to add a right parenthesis to line 7.

Yes I edited that after posting and must have deleted the ")". I have now corrected #19.

Yes I can think of many reasons why ifort chokes on this. For the OP with 800000 vars there is clearly going to be a problem....

FortranFan · ‎06-22-2017

andrew_4619 wrote:

.. For the OP with 800000 vars there clearly going to be a problem....

As you suggested upthread, a divide-and-conquer strategy to do the memory mapping only as needed and also with a just-in-time (JIT) strategy might be something OP wishes to consider.

800,000 variables are really, really "lots"!!! I wonder if they are all "global" in scope or if many of these are automatic scalar variables and in which case a consideration will be whether they all really need to be mapped to some memory set aside by a caller. It'll be really interesting to see a snippet of generated code and get some information on the nature of these 800,000 variables in terms of global/local/static, scalar/arrays, etc.!

jimdempseyatthecove · ‎06-22-2017

The code appears to be computer generated where what would normally be in arrays is now in scalars (with sequential suffixes).

Jim Dempsey

JVanB · ‎06-22-2017

I thought the sequential order might make a difference, so I modified the code in Quote #21 to write out the suffixes in forward, backward, bit-reversed, and gray code order:

program make_lots_of_vars
    implicit none
    integer :: lots
    integer, parameter :: ntypes = 6
    character(len=*), parameter :: gfmt_txt='(A)' 
    character(20) :: gfmt_var
    character(len=*), parameter :: gpd='   ' 
    character(len=*), parameter :: fmt_file = '(A,i0,A)'
    integer :: iunf, iunb, iunr, iung, nbits, ndig, istat,l1, itype
    real    :: rty
    character(20) :: filename, varname

    write(*,'(a)',advance='no') 'Enter the number of variables:> '
    read(*,*) lots
    nbits = bit_size(lots)-leadz(lots)
    ndig = ishft(nbits+3,-2)
    write(gfmt_var,'(*(g0))') '(A,Z',ndig,'.',ndig,')'
    
    CALL RANDOM_SEED()    ! seeds using time/date
    write(filename,fmt_file) 'lots',lots,'f.f90'
    open(newunit=iunf, file=filename, status='replace', iostat = istat)
    if (istat /= 0 ) stop 'forward open fail'
    write(filename,fmt_file) 'lots',lots,'b.f90'
    open(newunit=iunb, file=filename, status='replace', iostat = istat)
    if (istat /= 0 ) stop 'backward open fail'
    write(filename,fmt_file) 'lots',lots,'r.f90'
    open(newunit=iunr, file=filename, status='replace', iostat = istat)
    if (istat /= 0 ) stop 'bit-reversed open fail'
    write(filename,fmt_file) 'lots',lots,'g.f90'
    open(newunit=iung, file=filename, status='replace', iostat = istat)
    if (istat /= 0 ) stop 'gray code open fail'
    
    write(iunf,gfmt_txt) 'module stuff'
    write(iunf,gfmt_txt) gpd//'implicit none'
    write(iunb,gfmt_txt) 'module stuff'
    write(iunb,gfmt_txt) gpd//'implicit none'
    write(iunr,gfmt_txt) 'module stuff'
    write(iunr,gfmt_txt) gpd//'implicit none'
    write(iung,gfmt_txt) 'module stuff'
    write(iung,gfmt_txt) gpd//'implicit none'
    
    do l1 = 1 , lots
       CALL RANDOM_NUMBER(rty)
       rty = rty * real(ntypes) 
       itype = int( rty ) + 1
       selectcase (itype)
           case(1)
              write(iunf,gfmt_var) gpd//'Logical(1) :: var',l1  
              write(iunb,gfmt_var) gpd//'Logical(1) :: var',lots+1-l1
              write(iunr,gfmt_var) gpd//'Logical(1) :: var',br(l1,nbits)
              write(iung,gfmt_var) gpd//'Logical(1) :: var',ieor(l1,ishft(l1,-1))
           case(2)
              write(iunf,gfmt_var) gpd//'Logical(4) :: var',l1
              write(iunb,gfmt_var) gpd//'Logical(4) :: var',lots+1-l1
              write(iunr,gfmt_var) gpd//'Logical(4) :: var',br(l1,nbits)
              write(iung,gfmt_var) gpd//'Logical(4) :: var',ieor(l1,ishft(l1,-1))
           case(3)
              write(iunf,gfmt_var) gpd//'real(4) :: var',l1 
              write(iunb,gfmt_var) gpd//'real(4) :: var',lots+1-l1
              write(iunr,gfmt_var) gpd//'real(4) :: var',br(l1,nbits)
              write(iung,gfmt_var) gpd//'real(4) :: var',ieor(l1,ishft(l1,-1))
           case(4)
              write(iunf,gfmt_var) gpd//'real(8) :: var',l1 
              write(iunb,gfmt_var) gpd//'real(8) :: var',lots+1-l1
              write(iunr,gfmt_var) gpd//'real(8) :: var',br(l1,nbits)
              write(iung,gfmt_var) gpd//'real(8) :: var',ieor(l1,ishft(l1,-1))
           case(5)
              write(iunf,gfmt_var) gpd//'integer(4) :: var',l1 
              write(iunb,gfmt_var) gpd//'integer(4) :: var',lots+1-l1
              write(iunr,gfmt_var) gpd//'integer(4) :: var',br(l1,nbits)
              write(iung,gfmt_var) gpd//'integer(4) :: var',ieor(l1,ishft(l1,-1))
           case(6)
              write(iunf,gfmt_var) gpd//'character(len=20) :: var',l1  
              write(iunb,gfmt_var) gpd//'character(len=20) :: var',lots+1-l1
              write(iunr,gfmt_var) gpd//'character(len=20) :: var',br(l1,nbits)
              write(iung,gfmt_var) gpd//'character(len=20) :: var',ieor(l1,ishft(l1,-1))
           case default
              write(iunf,gfmt_var) gpd//'real(4) :: var',l1 
              write(iunb,gfmt_var) gpd//'real(4) :: var',lots+1-l1
              write(iunr,gfmt_var) gpd//'real(4) :: var',br(l1,nbits)
              write(iung,gfmt_var) gpd//'real(4) :: var',ieor(l1,ishft(l1,-1))
           end select
    enddo
    write(iunf,gfmt_txt) 'end module stuff'   
    !write(iunf,gfmt_txt) 'program use_stuff' 
    !write(iunf,gfmt_txt) gpd//'use stuff'    
    !write(iunf,gfmt_txt) gpd//'implicit none'
    !write(iunf,gfmt_txt) 'end program use_stuff'
    close(iunf)   
    write(iunb,gfmt_txt) 'end module stuff'   
    close(iunb)   
    write(iunr,gfmt_txt) 'end module stuff'   
    close(iunr)   
    write(iung,gfmt_txt) 'end module stuff'   
    close(iung)   

    CONTAINS
        function br(i,n)
            integer br
            integer n, i
            integer j
            br = iany([( ishft( ibits(i,j,1) ,n-1-j ) ,j=0,n-1)])
        end function br
end program make_lots_of_vars

But the compile times for lots100000f.f90, lots100000b.f90, lots100000r.f90, and lots100000g.f90 were all about the same: just slightly under 3 minutes.

andrew_4619 · ‎06-23-2017

But the compile times for lots100000f.f90, lots100000b.f90, lots100000r.f90, and lots100000g.f90 were all about the same: just slightly under 3 minutes.

These things do crate an interesting diversion at times! I guess the compiler is not doing any thing clever with indexed tables, just looping around lists of variables each time a new one is added, a list that keeps getting longer, and maybe it loops through the complete source many times also.

jimdempseyatthecove · ‎06-23-2017

>>so the logical conclusion is with 100000 the compile takes "forever" but with 10 x 10000 lines it takes 10 x a few seconds

This leads be to think that the compiler uses a 16-bit hash. Collisions will get aggravated at ~85% of 65,536 (~55,700) entries.

Jim Dempsey

andrew_4619 · ‎06-23-2017

I see where you are coming from now. It would not be surprising if the index method in the compiler was not designed to work well with stupidly large numbers of variables though.

jimdempseyatthecove · ‎06-23-2017

IIF the issue is indeed too small of hash key, the fix would be trivial. Though it might be wise on 32-bit systems to use the smaller key.

I do not see any reason why the code could not use a VTable and start out with 16-bit key, then if necessary, convert to 24-bit, then when necessary convert to 32-bit, ...

Note, performing multi-file IPO on 32-bit system can be memory intensive, and as such using memory conservative internal structures is desired.

Jim Dempsey

FortranFan · ‎06-23-2017

andrew_4619 wrote:

.. the compiler was not designed to work well with stupidly large numbers of variables though.

Indeed it's a case of "stupidly large numbers of variables" with no rhyme or reason as to why it really needs to be that way.

Hopefully, a member of the Intel Fortran team will read this and will consider updating their documentation on compiler limits and include some blurb corresponding to situations like this:

https://software.intel.com/en-us/fortran-compiler-18.0-developer-guide-and-reference-compiler-limits

Equivalence statements and compile times