I/O slower in Intel Fortran 2016

SHUTAO_L_ · ‎02-01-2016

Hi,

I have updated the Intel Fortran from 2013 SP1 to 2016 recently and found the I/O speed of 2016 is very slower than 2013. To read a bigger txt file( 3.67G), the program compiled with Intel Fortran 2013 costs about 2 minutes and the one compiled with Intel Fortran 2016 costs about 4 minutes. I have used the same codes and same compiler options. The following are the compiler options:

/nologo /MP /O2 /Ob0 /assume:buffered_io /heap-arrays0 /I"./header/com" /assume:nocc_omp /f77rtl /fpscomp:nolibs /warn:none /Qsave /names:uppercase /iface:cvf /module:"x64\Release64\\" /object:"x64\Release64\\" /Fd"x64\Release64\vc120.pdb" /check:none /libs:dll /threads /c

The option "/assume:buffered_io" has been turned on. Have you any suggestion to fix such problem?

By the way, Intel Fortran 2016 has a memory leak when doing file I/O. It is fixed in 2016 Update 1.It cost me a lot of time to find it. I hope you can provide us the stable version of Intel Fortran compiler.

Steven_L_Intel1 · ‎02-01-2016

Would you please provide us with the source program that shows the problem? You haven't given us enough details to be able to help you.

Yes, the resource leak is fixed in Update 1.

SHUTAO_L_ · ‎02-02-2016

I have used the performance and diagnostics in VS 2013 to check the performance and found the slow comes from the function READ( CARD, '(8E1.0)', ..... ). See attached picture. It costs almost three times than the READ function from file. Would you please check it?

Steven_L_Intel1 · ‎02-02-2016

There's nothing I can do with a picture of a code fragment. Are you really reading a multi-gigabyte file with formatted I/O?

mecej4 · ‎02-02-2016

The code does unnecessary formatted reads from and internal file. There are two nested DO loops in the code shown in the image. An 80-character line is read from unit NDAT and then an internal read is done with format 8E10.1 into stress_ele(1:8). There is a similar read in the outer DO loop. All except the very last of these formatted READs can be skipped, and I expect the I/O burden to become considerably less if this is done. In other words, why waste time decoding formatted data when you are going to overwrite the same memory locations with newer values without ever using the previously read values?

jimdempseyatthecove · ‎02-02-2016

The READ(NDAT,...) CARD has relatively low computation. This is a copy while scanning for EOL. I do not think that the I/O wait time (when this occurs) is accounted for in the % runtime. Whereas the READ(CARD,... is relatively high overhead. In addition to scanning the same amount of data inside CARD as was the read into CARD, it has the additional overhead of parsing the data for tokens, and then converting the text into (I assume) reals. The fact that this takes 2x to 2.5x as long is not unreasonable.

Jim Dempsey

LRaim · ‎02-02-2016

Ref. last comment by Jim Dempsey.

"The fact that this takes 2x to 2.5x as long is not unreasonable"

What ?? So ... I will wait a couple of years to move from Intel Fortran 2015 to 2016

Steven_L_Intel1 · ‎02-02-2016

The image you posted tells us nothing about the program as a whole. If the program does very little itself, I/O time will tend to dominate. If you'll provide us a test case we can run, we'll be glad to investigate. So far, all we can do is guess.

SHUTAO_L_ · ‎02-02-2016

To Steve:

Our customers need to import the larger files sometimes. The older version compiled with Intel Fortran 2013 is very fast. The problem comes from Intel Fortran 2016.

To mecej4: Thanks for your suggestion. If Intel will not fix it in Intel Fortran 2016, I have to read data from the file directly. But it is dangerous because maybe some lines beginning with "$" are comment line not data.

To Jim,

READ(CARD in the older compiler is very fast. I do not know what happened.

Steven_L_Intel1 · ‎02-02-2016

We can't fix something without an example to look at.

I constructed my own example of reading 8 10-character reals using E10.0 format from a character variable, repeated 100000 times. I compared to 15.0 and 16.0 and the times were nearly identical, though they were also very small (less than .2 seconds to read that many values.)

Whatever problem you're having with performance is somewhere else. I think you are being misled by what you see in Visual Studio, which is why I am asking for an actual test case.

andrew_4619 · ‎02-02-2016

program fred
    implicit none
    integer, parameter :: nrec = 100000
    character(len=80)  :: gcard(nrec)  
    integer :: l1, J
    real(4) :: rcard(8,nrec)
    real(4) :: delta_t, t0
    
    CALL RANDOM_NUMBER(rcard)
    rcard = rcard*10000. ! populate with random reals from 0 to 10000 range
    do l1 = 1 , nrec     ! populate text buffers 
        write(gcard(l1), '(8F10.0)' ) rcard(:,l1) !(rcard(J,l1),j=1,8)
    enddo
    
    t0 = SECNDS (0.0)
    do l1 = 1 , nrec     ! now read the text buffers
        read(gcard(l1), '(8E10.0)',err=2560 ) (rcard(J,l1),j=1,8)
    enddo
    delta_t = SECNDS (t0)
    write(*,*) 'Delta Time (S)', delta_t
    2560 continue ! test the data exists
        read(*,*) l1
        write(*,*) rcard(:,l1)
    goto 2560 
end program

I made a quick test program that times doing 100000 lines of internal reads with the format you use. For an x32 release build there was no difference between compiler 14.0.3.202 and 16.0.1.146 it varies from run to run but the range was 0.9 to 1.5 seconds in both cases. I guess the differences run to run will be due to the background load on the PC not being constant.

andrew_4619 · ‎02-02-2016

I notice Steve did something similler.... Perhaps the OP can run the test program.

jimdempseyatthecove · ‎02-02-2016

Luigi R. wrote:

Ref. last comment by Jim Dempsey.

"The fact that this takes 2x to 2.5x as long is not unreasonable"

What ?? So ... I will wait a couple of years to move from Intel Fortran 2015 to 2016

This is for the formatted internal read that converts the text of 8 real variables. The read into card is unformatted text with no conversions. When the entire record is inside the I/O buffer (tunable), the READ is very fast. This happens most of the time.

Now then, if you are saying the internal read on an earlier compiler was faster than an internal read on the newer compiler, then why do you not post the .jpeg of the other compiler statistics. Please use the same input file and compiler options.

Steve,

There was a different post a few weeks ago where an internal read was causing a memory leak. I know that there is nothing in this thread for you to go on, but do you recall the particulars of that thread?

Jim Dempsey

Steven_L_Intel1 · ‎02-02-2016

Jim, that was https://software.intel.com/en-us/forums/topic/591225 and dealt with internal WRITE (not sure about READ) leaking a handle, not memory. This was fixed in Update 1.

Greg_T_ · ‎02-03-2016

I'll contribute my two cents. Our software reads large finite element analysis (FEA) result files, usually 1 GB to 10 GB, for several different FEA solvers and file formats. Using Intel Fortran XE2016 the large file reading runs very quickly, so I think it would be worth investigating further and not just reverting to XE2015.

Regards,
Greg

SHUTAO_L_ · ‎02-03-2016

Thanks for your effects. I am too busy recently. Now, I have prepared the codes for your test. The test program together with the file libifcoremd.dll in Intel Fortran 2013 runs about 48 seconds. But if it runs with the file libifcoremd.dll in Intel Fortran 2016, it will cost 132 seconds. It is more than 2 times. Please make sure that the test program runs with the different dll file. Attached please find the source codes and run time compare on my machine. Hope you can find the reason and fix it.

Steven_L_Intel1 · ‎02-03-2016

Thanks for providing a test program. It isn't doing what we thought.

I can see the difference, but I am having difficulty identifying what the key is. In one environment if I set /reentrancy:none, which tells the I/O library to not worry about being thread-safe, the 15.0 speed returns, but if I do this in Visual Studio it has no effect. It is an unusual program that does quite this much I/O and nothing else. On my several-year-old system I see times of 25 and 50 seconds. Even 50 seconds to process ten million records doesn't seem unreasonable.

I will ask the developers if they can figure out what changed - I do know that we have had issues with thread-safety and maybe the fixes for that resulted in more "locking" of data structures. I was able to bring the time down considerably by reading 10 cards at a time from the file into an array of "cards" and then reading the data from each array element, but that may not be a technique you can use. Any particular reason you're using the character variable as an intermediate step? Why not just read from the file directly? If you replace the read into CARD with:

READ(NDAT,'(8E10.0)',err=960,end=9000)(stress_ele(J),J=1,8)

It goes twice as fast.

mecej4 · ‎02-03-2016

Steve Lionel (Intel) wrote:

Any particular reason you're using the character variable as an intermediate step? Why not just read from the file directly? If you replace the read into CARD with:

READ(NDAT,'(8E10.0)',err=960,end=9000)(stress_ele(J),J=1,8)

It goes twice as fast.

Shutao L said in #9 that some of the data lines start with '$' and contain comments.

SHUTAO_L_ · ‎02-04-2016

Steve Lionel (Intel) wrote:

Thanks for providing a test program. It isn't doing what we thought.

I can see the difference, but I am having difficulty identifying what the key is. In one environment if I set /reentrancy:none, which tells the I/O library to not worry about being thread-safe, the 15.0 speed returns, but if I do this in Visual Studio it has no effect. It is an unusual program that does quite this much I/O and nothing else. On my several-year-old system I see times of 25 and 50 seconds. Even 50 seconds to process ten million records doesn't seem unreasonable.

/quote]

I wonder why it is so faster on your machine. Have you used the different compile options? The followings are the compile options I used:

/nologo /O2 /Ob0 /assume:buffered_io /assume:nocc_omp /f77rtl /Qsave /module:"x64\Release\\" /object:"x64\Release\\" /libs:dll /threads /c

and:

/OUT:"x64\Release\TestIntelIO2008.exe" /INCREMENTAL:NO /NOLOGO /MANIFEST /MANIFESTFILE:"D:\dir_codes\TestIntelIO\TestIntelIO\x64\Release\TestIntelIO2008.exe.intermediate.manifest" /MANIFESTUAC:"level='asInvoker' uiAccess='false'" /SUBSYSTEM:CONSOLE /LARGEADDRESSAWARE /IMPLIB:"D:\dir_codes\TestIntelIO\TestIntelIO\x64\Release\TestIntelIO2008.lib"

Steven_L_Intel1 · ‎02-04-2016

I didn't play with anything more than /reentrancy. (and added /libs:dll /threads). Everything else was default. I may have a faster system overall than you (though my system is really quite old - a Nehalem-generation processor and no SSD.)