int8 vs int32 for Derived Data Type Component

Scott_Boyce · ‎03-28-2023

This is to start a general discussion to get opinions on using an integer types other than the standard integer(int32) type. I frequently have derived data types that contains flags that usually span from 0 to 10. Usually I use the default integer, but it seems silly when the flag is within the range of one byte or even use BTEST with IBSET and IBCLR to have each bit represent flags.

The following is an example of this (note this is not meant to be good code, just to illustrate the flag). This derived data type could be created with a dimension(:) making the integer size contribute significantly to the overall array size.

use, intrinsic:: iso_fortran_env, only: int8, int16, int32, int64

type dynamic_int
   integer:: int_type = 0             ! 0: null; >0 allocated-> 1: int8, 2: in16, 3: int32, 4: int64
   class(*), pointer:: pnt => null()
end type

type dynamic_int8
   integer(int8):: int_type = 0_int8  ! 0: null; >0 allocated-> 1: int8, 2: in16, 3: int32, 4: int64
   class(*), pointer:: pnt => null()
end type

type dynamic_char
   character:: int_type = ' '  ! ' ': null; >0 allocated-> '1': int8, '2': in16, '3': int32, '4': int64
   class(*), pointer:: pnt => null()
end type

type(dynamic_int ), dimension(1024):: dyn1
type(dynamic_int8), dimension(1024):: dyn2
type(dynamic_char), dimension(1024):: dyn3

My question is, is there an issue with using the int8 over the default integer (int32) or even using a single byte character. I was always told that int32 is better because of alignment, but I am not sure how that would effect a derived data type that is then made into an array. I've never had much success with testing/timing this, so I thought I ask the greater community for their opinions. My interest is in speed, in addition to saving on memory space.

Also, this derived data type might frequently change the int_type component or copy it to another array. I was not sure if the int8 vs int32 would effect operations on the component. For example:

use, intrinsic:: iso_fortran_env, only: int8, int16, int32, int64

type dynamic_int
   integer:: int_type = 0             ! 0: null; >0 allocated-> 1: int8, 2: in16, 3: int32, 4: int64
   class(*), pointer:: pnt => null()
end type

type(dynamic_int), dimension(1024):: dyn1
type(dynamic_int), dimension(1024):: dyn2

integer:: i

! assume dyn1 is set to a values
do i=1, 1024
   dyn2%int_type = dyn1%int_type
   dyn2%pnt => dyn1%pnt
end do

-------------------------------------------------------------------------------------------------------------------

Lastly, would their be any benefit to using bit operations? I have used a little bit of them for my random number generators, but I was not sure if there was any speed penalty since its a relatively new to Fortran (compared to C/C++).

For example, the following are methods for checking if a number is odd:

integer:: num
logical:: is_odd

num = 3

! All three tests check if num is odd

is_odd = btest(num, 0)     ! odd number if least significant bit is set  
is_odd = iand(num,1) == 1
is_odd = modulo(num, 2) == 1

Is there any speed benefit to using bit operations over the modulo? To be honest, I mostly use this for fast remainder calculations, such as:

use, intrinsic:: iso_fortran_env, only: int32, int64
integer:: rem
integer(int64):: val

val = 17_int64

rem = int( iand(val, 7_int64), int32) ! Get remainder of VAL/8

Thanks in advance for the comments from the group,

Scott

Steve_Lionel · ‎03-28-2023

First, my general advice - don't waste your time micro-optimizing stuff like this. If you find performance isn't what you want, use a profiler such as Intel VTune to find out where the program is spending the time - it's rarely where you think it is.

I recommend using 32-bit integers here unless you will be using so many of them that memory bandwidth starts to become an issue. Alignment DOES matter. I know that in the past every byte counted with small memory systems, but that isn't an issue nowadays.

As for an odd/even test, I'd expect the bit test to be faster, but better is to write readable, maintainable code. Theoretically, a smart compiler could notice a modulo-2 test and optimize that into a bit test - I don't know if the Intel compiler does that. But, again, I wouldn't spend any time worrying about this unless it became an issue down the road.

FortranFan · ‎03-29-2023

@Scott_Boyce wrote:

.. My interest is in speed, in addition to saving on memory space...

@Scott_Boyce ,

Given what you write above re: speed, consider a design other than one involving run-time polymorphism, particularly unlimited polymorphic types.

jimdempseyatthecove · ‎03-29-2023

Consider that if you are using polymprphism it comes with two costs: performance (needs code to disambiguate classes) and memory (needs class identifiers). The benefit may be with the ability to write abstract code.

Please note, as present, we do not have templates. Therefore, each procedure implementation may need select class/type and individual code sections. Programming in this manner does not yield abstract code.

An alternative, is for each class, write your procedure (subroutine or function) header (e.g. subroutine with dummy declarations) but then use INCLUDE "bodyOfProcedure.i90". IOW the procedure code is the same for each class, but the types are different.

Then you use a generic interface to tie the procedures to a generic name.

Jim Dempsey

Scott_Boyce · ‎03-29-2023

I agree with you about the slowdown with polymprphism and typically use interfaces to make code that has specific types. An example is my sorting routine that does all the major types, but also has a "wild" sort that requires the user to supply routines to aid in sorting unknown objects ( https://code.usgs.gov/fortran/bif/-/tree/main/src/sort ).

I just picked that as a simple example cause I often have derived types with integer flags that rarely need more than a byte of space. It just seemed silly to use 4 bytes of space when only 1 is needed. Especially, if its something that will become an array so the extra three bytes start to add up. I know memory is not much of an issue now, but I still find people tend to bloat their codes that results in slower execution cause of memory buss limitations. I just found it hard to come up with any meaningful tests (or really the time to truly test it) that would indicate weather its better to use the number of bytes needed compared to standard int32.

FortranFan · ‎03-29-2023

@Scott_Boyce wrote:

.. I often have derived types with integer flags that rarely need more than a byte of space. It just seemed silly to use 4 bytes of space when only 1 is needed. Especially, if its something that will become an array so the extra three bytes start to add up. I know memory is not much of an issue now, but I still find people tend to bloat their codes that results in slower execution cause of memory buss limitations. ..

As mentioned upthread, alignment matters with components of derived types, particularly with Intel Fortran. Chances are high any apparent gains in terms of "space" will be lost with padding that can come about due to alignment. Thus you would need to be very careful to get it right, if at all that were to be possible. So it is up to you to decide whether it's worth the effort.

jimdempseyatthecove · ‎03-30-2023

>>alignment matters with components of derived types

This (generally) can be accounted for by ordering by size (large to small) the alignment sensitive types (this applies to scalars). These (load critical) sizes are typically powers of two. Arrays need a little more care. For example REAL :: VEC(3) has a byte size of 12 (4 * 3).

This can mean that order of variables in the UDT might not be in a preferred order. e.g. a single byte code for condition/state which might be visually preferred first in the UDT may computationally be best last in the UDT.

Another consideration is to place computationally close values physically close, such as to be located in the same or adjacent cache lines.

Think of the UDT as a shipping container as opposed to the manifest of what is inside the container. The objectives could be prioritized:

1) speed of load/unload (performance)

2) packing density (size)

3) inspection ease (in logical descriptive order of the manifest).

Some may prioritize in a different order.

Jim Dempsey

Scott_Boyce · ‎03-30-2023

Thanks for the info.

Do you mean the number of bytes should be a power of two or array dimensions? I have read elsewhere that its best to have array dimensions divisible by 32. Give a power of two, then it would seem that a INT16 would be about the same as an INT32, which are both better than an INT8.

For the packing of a data type, does the ordering matter or do you mean its best to have the number of bytes be a power of 2?

For a lot of my data types, I tend to nest lots of vectors (1D arrays), would it be best to move vectors that are used together next to each other in the data type, for example an excerpt from https://code.usgs.gov/modflow/mf-owhm/-/blob/main/src/fmp/crop_data.f90#L40

type crop_prop
    integer:: n=z, id, ld
    !
    integer,         dimension(:,:),allocatable:: rc !(2,NCROP) defines row, col
    integer,         dimension(:),  allocatable:: t_concept !0 - no t, 1 root pressure, 2 linear uptake, 3 no anoxia
    !
    double precision,dimension(:),  allocatable:: root
    double precision,dimension(:),  allocatable:: frac
    !
    double precision,dimension(:),  allocatable:: kc
    double precision,dimension(:),  allocatable:: cf
    double precision,dimension(:),  allocatable:: cu
    !
    double precision,dimension(:),  allocatable:: ftr  
    double precision,dimension(:),  allocatable:: fei
    !
    double precision,dimension(:),  allocatable:: demand
end type

The components that are operated together a lot should be closer (such as, cu = kc + cf). Also, the arrays are usually all dimensioned the same length, so would it improve speed if I insure they are padded to the nearest power of 2 or divisible by 32. The dimension can get quite large dependent on the region simulated (say around 10 to 20 million per array).

Steve_Lionel · ‎03-30-2023

Ideally you want each member of a derived type to be "naturally aligned" - at an offset that is a multiple of its size. But when you have pointer and allocatable components, what's in the type is the descriptor, so the type is not relevant. You should try to have these at offsets that are multiples of eight bytes. However, the Intel compiler will, in the absence of SEQUENCE, add padding as needed.

In your example above, if we ignore the allocatable attributes, you have five 32-bit integers, so each of the following things will start misaligned at offset 20, which would not be good. If we include the allocatable attributes, then the descriptors start out misaligned in the absence of padding.

I suggest you research the concepts of "AOS vs. SOA" - that is, Array of Structures vs. Structures of Arrays. Depending on access pattern, one approach can be better than another.

jimdempseyatthecove · ‎03-30-2023

Allocatable arrays (allocatable scalars, allocatable UDT) are not inside the UDT, their array descriptor (or equiv of pointer) are there, their data come off the heap.

The alignments we were discussing relate to data elements contained inside the UDT.

The alignment of the allocatables "depends".

Default alignment is a characteristic of the heap manager. Minimally it would be the size of a pointer (4 or 8 bytes), however, usually it is two of these (iow a node header consisting of a pointer and size as intptr_t). Expect default of 16-byte, *** but this can differ ***

You can impose a desired alignment for allocatables (as well as local variables) through the use of an attribute. But please note that you cannot attribute alignment to non-alocatable components within a UDT as the UDT may or may not be aligned, or in the case of an array of UDT, the sizeof(UDT) might not be a multple of the largest desired item you desire to have alignment.

type yourUDT
  real(16) :: X  ! place largest here
  real(16) :: VecX(3) ! each element same sizeof as X
  real(8) :: Y   ! next largest here
  real(8) :: vecY(3,3)
  integer(8) :: i8
  real(4) :: R
  real(4) :: VecR(9)
  integer(4) :: i4
  logical(4) :: l4
  integer(2) :: i2
  logical(2) :: l2
  real(2) :: r2 ! if this is ever supported
  integer(1) :: i1
  logical(1) :: l1
  character(len=1234) :: buff
end type yourUDT

You can align a (stack) local UDT, a module UDT and a scalar allocatable UDT, an array of UDTs can only enforce alignment to the 1st element. The alignment of the remainder allocatable UDT's of the array depend on the sizeof(UDT). So you may need to count bytes and then add pad variable.

Also, you may want to arrange the variables, not only in descending size, but also keeping in mind of the computational locality to improve cache hit ratio.

type Body
  real(8) :: mass        ! keep these 7 variables together
  real(8) :: velocity(3) ! in the same cache line
  real(8) :: position(3) ! Note, acceleration in local variable
  real(8) :: somethingElse
  real(8) :: Omega(3)    ! Angular velocity vector in next cache line
  ...
end type Body

As to if that yields a tad better performance, you would need to run VTune to get the statistics.

Jim Dempsey