Solved: Performance of Defined Input/Output Procedure - Page 2

hentall_maccuish__ja · ‎08-12-2020

Hello,

I’m new to defined input/output procedures and I find that the one I have written is eating up 80% of my runtime. I’m trying to saving a lot of data to disk but this still doesn’t seem correct to me; I am generating and processing all this data in the program as well, so just saving it to disk taking 80% of the runtime seems disproportionate. I can see why my I/O procedure would induce a lot of loops and be slow; however, being new to defined I/O procedures I’m not sure what I can do about it. Any suggestion would be greatly appreciated. The defined type I am trying to save with its routine is below.

Thanks,

Jamie

    Module Policy
    implicit none
    type sparseCOOType
        integer :: col
        integer :: row
        real :: val
    end type sparseCOOType

    type policyType
        type(sparseCOOType), allocatable :: COO(:)
    contains
    procedure :: write_sample => write_container_sample_impl
    procedure :: read_sample  => read_container_sample_impl

    generic   :: write(unformatted) => write_sample
    generic   :: read(unformatted) => read_sample
    end type policyType
    contains

    subroutine write_container_sample_impl(this, unit, iostat, iomsg)
    class(policyType), intent(in)    :: this
    integer, intent(in)         :: unit
    integer, intent(out)        :: iostat
    character(*), intent(inout) :: iomsg
    integer :: i

    write(unit, iostat=iostat, iomsg=iomsg) size(this%COO)
    do i=1,size(this%COO)
        write(unit, iostat=iostat, iomsg=iomsg) this%COO(i)%col
        write(unit, iostat=iostat, iomsg=iomsg) this%COO(i)%row
        write(unit, iostat=iostat, iomsg=iomsg) this%COO(i)%val
    end do
    end subroutine write_container_sample_impl

    subroutine read_container_sample_impl(this, unit, iostat, iomsg)
    class(policyType), intent(inout) :: this
    integer, intent(in)         :: unit
    integer, intent(out)        :: iostat
    character(*), intent(inout) :: iomsg
    integer :: i, sizeCOO

    read(unit, iostat=iostat, iomsg=iomsg) sizeCOO
    allocate(this%COO(sizeCOO))

    do i=1,sizeCOO
        read(unit, iostat=iostat, iomsg=iomsg) this%COO(i)%col
        read(unit, iostat=iostat, iomsg=iomsg) this%COO(i)%row
        read(unit, iostat=iostat, iomsg=iomsg) this%COO(i)%val
    end do

    end subroutine read_container_sample_impl
    end module

Arjen_Markus · ‎08-12-2020

Your current procedure writes and reads all the data elements in turn. But the sparse matrix COO is an array of a simple derived type that just contains three scalars. Such a type can be written directly via the default facilities. So, you could write your policy object "this" via:

write(lun) this%COO

That will make the writing (and similarly the reading) much faster

View solution in original post

hentall_maccuish__ja · ‎08-12-2020

Thank you Arjen. I thought I couldn't do this because COO is itself allocatalbe but I re-read the standard that had been quoted to me that made me think that (reposted below) and I think I would only need user defined I/O if one of the elements of COO is allocatable. Is that right?

That should speed it up a fair amount. My guess is that the bigger problem is the number of calls from the array policy but I will definitely use this

“If a list item of derived type in an unformatted input/output statement is not processed by a defined input/output procedure (12.6.4.8), and if any subobject of that list item would be processed by a defined input/output procedure, the list item is treated as if all of the components of the object were specified in the list in component order (7.5.4.7); those components shall be accessible in the scoping unit containing the data transfer statement and shall not be pointers or allocatable. If a derived-type list item is not processed by a defined input/output procedure and is not treated as a list of its individual components, all the subcomponents of that list item shall be accessible in the scoping unit containing the data transfer statement and shall not be pointers or allocatable.”

mecej4 · ‎08-12-2020

The key point to note is that if the IOlist contains only members of a derived type, and each of the members is one of the intrinsic types, you do not need to write defined I/O procedures.

hentall_maccuish__ja · ‎08-13-2020

Thanks Arjen, this simple change cut the runtime in half! Down to 4.5 hours

andrew_4619 · ‎08-17-2020

Does setting "buffered" on the open statement help?

hentall_maccuish__ja · ‎08-21-2020

Hi Andrew,

Thanks, that's a good suggestion. I'm currently locked out of my development environment but will try this when I get in a couple of weeks.

Bernard · ‎08-13-2020

Hello,

I'm late to this discussion, and seems that you received a satisfactory answer.

The consumption of 80% of your runtime is not surprising. Your code calls three Fortran runtime library functions named: for_write_seq in tight loop. The OS name is not stated and in case of Windows for_write_seq would call probably NTDLL WriteFIle which in turn calls NTOSKRNL Zw/NtReadFile operating in the kernel space and calling low level disk access driver hierarchy. The large overhead is expected in this case and more accurate value might be found by using VTune profiler.

By reading your post I'm not sure if kernel overhead of disk data writing was accounted for 80% consumption time

--Bernard

hentall_maccuish__ja · ‎08-21-2020

Hello Bernard,

The 80% figure was pulled out of VTune profiler. If memory serves the time was consumed by the write statements inside the user defined I/O routine so I guess by the writes and not kernel overhead, or is there a better way to check this in VTune profiler?

Thanks,

Jamie

Bernard · ‎08-21-2020

Hi Jamie,

I presume that 80% was a total time of execution containing both user mode and kernel mode part. I'm sure that kernel mode part will dominate the execution time anyway. VTune GUI by default will set the counters to count both user and kernel event triggers, so probably it is as mentioned the total value.

I'm using personally the VTune CLI version where I can tweak more the performance events and its modifiers. I think that you can do the same in GUI version by creating the custom analysis and choosing the relevant performance events mainly the fixed counter events.

there a better way to check this in VTune profiler?

Maybe hotspot analysis with kernel code cycle counting and the callstack collection would suffice to visualize the distribution of the load per user/kernel modules

hentall_maccuish__ja · ‎08-21-2020

Thanks, I'll look into this and give it a try when I get access back to my development environment (I'm locked out for a few weeks)

Bernard · ‎08-21-2020

Hi,

Looking back at your example I came to conclusion that mentioned 80% may contain the only user mode processing time. Most of the time is spent in for_write_seq and further in ReadFile functions and not even crossing the kernel boundary.