Re: technique to recognize difference between integers and real

davinci · ‎05-03-2009

I read datafiles which may contain integers -or- reals.
I don't know this in advance, so I read these data as characters and add a decimal dot before aninternal read.
I.e. after reading in I see what I get and thentread each number as real further on,
while I would like to make the distinction between integer and real.

The data processingmust beextremely fast, so I cannot split my whole program in parts...
"if number = integer then process large parts of the program this way..."
"if number = real then process large parts of the program that way..."
Without these "if"s I want toprocess these numbers in the correct way

Is there a (binary) datatype- or pointer-technique to solve this problem ?
Something like

type BothNum
integer(4) :: IntNum
real(4):: RealNum
integer(4), pointer :: PointNum ! pointing to the correct datatype as soon as I read and parsed a number
end type BothNum

Thanks, Clemens (davinci)

jimdempseyatthecove · ‎05-03-2009

Clemens,

Are the "process large parts of the program this way" expressing the same algorithm?

That is to say one pathusing integers and the other using reals. Such that the results forinput 1234 taking the integer path) are the same for input 1234. taking the floating point path. But where the integer path executes faster.

What is the largest integer?
What is the largest real?
What is the smalest real?
Are negative values in your input other than for termination?
Does your input contain 0 (or 0.0) other than for termination?

Can you make the integer/real determination once, then run two seperate paths through your program (reduce all the "if number ==..." to one test for the life of the number?

Jim Dempsey

davinci · ‎05-05-2009

Quoting - jimdempseyatthecove

Clemens,

Are the "process large parts of the program this way" expressing the same algorithm?

That is to say one pathusing integers and the other using reals. Such that the results forinput 1234 taking the integer path) are the same for input 1234. taking the floating point path. But where the integer path executes faster.

What is the largest integer?
What is the largest real?
What is the smalest real?
Are negative values in your input other than for termination?
Does your input contain 0 (or 0.0) other than for termination?

Can you make the integer/real determination once, then run two seperate paths through your program (reduce all the "if number ==..." to one test for the life of the number?

Jim Dempsey

Jim, thanks for your thoughts,

There is one hughe algoritm only to process both integer(4) or real(4).
Both unexpected on each input line and than have to be recognized at reading in and parsing the input.

Int(4) and real(4) only, both could be negative too and none astermination value.

The algorithm is very complex and will be in a further development stage for a while.
No way to determine the int/real situationfor large parts of the program, the mixed solutionserves the whole part.
Copying the algoritm into 2 parts (int/real) is therefore difficult, the input is the problem.

Therefore my thought to read-in in an uniformal way (see the type constructin my previous suggestion, or some otherbinary form). Then to parse the input and "point" to the correct datatype and do the next processing using this pointed mechanism.

It is the processing part after reading the data which is too complex to split in an integer and real part.

OK, I know this question is something for Donald Knuth.
(I followed his lecturesquite a while ago,but forgot to ask this question).
It's an interesting one for the theme "Algoritms & Datastructures in Fortran".

Cheers Clemens (Davinci).

jimdempseyatthecove · ‎05-05-2009

Clemens,

What is the range of the valid integer numbers?
Does 0 and 0.0 produce the same results (i.e. are interchangable).

Excepting for 0 having the same binary pattern as 0.0 (for same sized numbers integer(4)/real(4), or integer(8)/real(8)) the floating point normal ranged numbers appear to have rather large absolute values when viewed as integers.

For REAL(4), anything that is not a "funny" number (NaN, infinities, Denormalized, etc...) will have an absolute magnitude .ge. 2**23.

So, if your integer numbers are within this range then a union of the INTEGER(4) and REAL(4) can be used in place of your three variable structure. This reduces data size, and improves cache hit ratios.

type compositeVar
union
map
integer :: asINT4
end map
map
real :: asREAL4
end map
end union
end type compositeVar

make an inline logical function to determine if the variable is integer or real

logical functionisReal(v)
type(compositeVar)::v
isReal = (iand((v%asINT4+Z'00800000), Z'FF000000') .ne. 0)
end function isReal

the above may be faster than isReal = (iabs(v%asINT4) .ge. Z'0800000'))
but you can try both ways.

Run rigorous test to verify your full input range does not exceed the limitations

Then in your main code

if(isReal(VAR%asINT)) then
! do real part using VAR%asREAL4
else
! do integer part VAR%asINT4
endif

if 0 and 0.0 produce different results and need to be seperated
Then does 0.0 and -0.0 produce the same results?
if so, when you read 0.0, set the integer part to Z'80000000' this is a float -0.0.
Do not try to set VAR%asREAL4 = -0.0 as this will set it to 0.0.

You can also make inline conversion functions

real function asReal(v)
type(compositeVar)::v
if(isReal(v)) return v%asREAL4
return REAL(v%asINT4)
end function asReal

and the other one for asINT4

Jim Dempsey

jimdempseyatthecove · ‎05-05-2009

Clemens,

Here is an alternate suggestion, although it is unconventional it will get the speed you want.

Take your current subroutine, e.g. FOO.F90, and copy it to FOO.INC

Convert the FOO.INC into a polymorphic source file who's identity changed by use of Fortran Preprocessor (FPP)
#define xxx yyy
and
#ifdef ...

With one define, the single source file compiles into the integer path
With the second define, the single source file compiles onto the REAL path.

You now can useFPP to create both files

subroutine foo(v)
use mod_composite
if(isREAL(v)) then
call fooREAL(v%asREAL4)
else
call fooINT(v%asINT4)
endif
end subroutine foo

subroutine fooREAL(v)
real(4) :: v
#define USE_INT
#include "FOO.INC"
#undef USE_INT
end subroutine fooREAL

subroutine fooINT(v)
integer(4) :: v
#define USE_INT
#include "FOO.INC
#undef USE_INT
end subroutine fooINT

Jim Dempsey

WSinc · ‎05-06-2009

Maybe I am overly simplifying this- but assuming they come in as CHARACTER types first:

Why not look for a decimal point FIRST?

If the numbers are integers (either real or integer type) you can safely assume they can be treated as integers, which would be fastest. You can process the file once to either add or not add a decimal point if they look like integers.

So one pass thru the file would be sufficient. If they are mixed decimal point or no decimal point, then it's safest to assume they are ALL real. If their magnitude is quite large you can use REAL (kind=8) or (kind=16).

Otherwise treat them as ALL integer.

Do we know the range? That would determine the KIND= part of the integer statement.

I had a similar problem when I worked for JPL, but I just treated everything as REAL (kind=8).

WSinc · ‎05-06-2009

Here is a little routine that does what I suggested.
It is VERY fast. I assume a 40 character input record.

[cpp]      subroutine real_or_int(rec)
      implicit NONE
      integer (kind=8)inum
      integer (KIND=2) i
      real (kind=8) xnum
      character (len=40) rec
101 format(I40)
102 format(G40.12)
      do i=1,40
        if(rec(i:i).eq.'.')go to 20
      end do
! here if NO decimal point
      read(rec,101)inum
      print *,inum
      return
! here if dec pt. found
20  read(rec,102)xnum
     print *,xnum
     end subroutine
[/cpp]

jimdempseyatthecove · ‎05-06-2009

From my understanding of the original poster's request, the data file has a mix of integers and reals.

Depending on if the input is integer (sans . and sans maybe E+nn) or real (with . and maybe E+nn) the program behaves entirely different.

The user does not wish to write two seperate subroutines (or collection of subroutines) as this make maintenance difficult. The two code paths are similar but not the same (from my understanding). The user stated (per query from me) that different results occur when input has an integer as opposed to a real of the same numeric value as the integer (e.g. 123 and 123.). At least that is what I asked and the different results is the response I think I got.

If the numbers through differing code paths do produce the same results then the user may find better performance by making all input REAL(4) and do away with the numerous flow control statements. His code might be able to pickup additional SSE instruction usage and make up the difference in speed and then some.

Jim Dempsey

WSinc · ‎05-07-2009

That's why I think his best approach is to do a one-pass conversion so that ALL the numbers are of a single type.

If none of the quantities have a fractional part, the quickest convesion would be INTEGERS.

If the numbers appear in the file as BINARY (not as character data) then telling integers from floating point could be very difficult.

For example, take: Z'13422006'
Is that a very small floating point, or a very large integer?

The application that it's USED for would have to answer that question. If the quantity represents some MEASUREMENT, (i.e a physical dimension, atmospheric pressure, position coordinates) then it should be floating point.

Obviously, if it's given as character data and a D or E appears, it must be floating point.

jimdempseyatthecove · ‎05-11-2009

It may help if you examine this document

http://steve.hollasch.net/cgindex/coding/ieeefloat.html

Excepting for 0.0, and -0.0all non-"funny" floating point numbers, that is to say non-(Denormalized Number), have non-0 in the exponent. For REAL(4) there is an 8-bit exponent, just below (right of) the sign bit. Therefore

(IAND(iVal, Z'7F800000) .ne. 0) == REAL
(IAND(iVal, Z'7F800000) .eq. 0) == INTEGER .or. (real value of 0.0)
(IAND(iVal, Z'7F800000) .eq. 0) .and. (IAND(iVal, Z'007FFFFF) .ne. 0) == INTEGER

As explained earlier, if your path through your code with iValue=0 produces the same result as path through your code with rVlaue=0.0 then you can use the middle test above to disambiguate the numbers (and use integer 0 as substitute for 0.0).
This test is very fast since IAND is intrinsic and will result in a single machineinstruction.

The original post, and follow-ups, indicated that inputs of 123 and 123.0 produced different results (different code paths taken).

Please note, study the code _before_ you make an assumption that 123 and 123.0 (and extention to all integral REALs) are equivilent. 123 could be entered when it is known that the the code path should not or will not produce a fractional part, whereas 123.0 could be entered when it is known that the the code path may or may be permitted to produce a fractional part. It would be presumptuous to assume differently.

Jim Dempsey

jimdempseyatthecove · ‎05-11-2009

And as explained earlier, the integer range must lay within 23 bitsof numeric range(with special considerations made for negative integer numbers)

Jim Dempsey

Steven_L_Intel1 · ‎05-11-2009

When I read Clemens' original post, it read to me as if he was always reading the value into a REAL. (I would suggest REAL(8) otherwise some large integers won't convert accurately.) I thought he then wanted to take a different code path if the value read was an integer.

If you're reading a text representation, there is no need to try to figure out, from the bits, whether there is a fractional part or not. (I had a customer ask about reading binary data and trying to figure this out - as others have covered, there is no guaranteed way to do this unless you have strict limits on the values.)

If I was doing this, I'd simply read the value with a Gsomething.0 format (where "something" is the width of the string being read) into a double precision real. I'd then ask the musical question (fraction(x) == 0.0_8). If this is true, then the value can be converted to integer (with INT) and manipulated that way, otherwise leave it as a real. You might want to also test exponent(x) to make sure it is not too large (>31?) for an integer(4).

On the other hand, this all seems more complicated than I think is warranted for the hoped-for performance gains, but I don't know the application.

Les_Neilson · ‎05-11-2009

I must admit I held off adding to this thread, as I got confused about the requirements.

My first impression was that Clemens wanted to :

Read First_Number
If (First_Number == INTEGER) then
readall other numbers as INTEGER
call Integer_Process_Path
else
read all other numbers as REAL
call Real_Process_Path
endif

withoutduplicating the memory space occupied by the data (hencehis equivalence type)

But others replies suggested that the data was mixed integer and real, in which case the problem ismorecomplex and the test has to be done for each number read.

Perhaps Clemens could indicate what he wants to do in a psuedo code style as above ?

Les

davinci · ‎05-21-2009

Quoting - Les Neilson

I must admit I held off adding to this thread, as I got confused about the requirements.

My first impression was that Clemens wanted to :

Read First_Number
If (First_Number == INTEGER) then
readall other numbers as INTEGER
call Integer_Process_Path
else
read all other numbers as REAL
call Real_Process_Path
endif

withoutduplicating the memory space occupied by the data (hencehis equivalence type)

But others replies suggested that the data was mixed integer and real, in which case the problem ismorecomplex and the test has to be done for each number read.

Perhaps Clemens could indicate what he wants to do in a psuedo code style as above ?

Les

All,

thanks for your fast responses and apologies for my very late reaction, I had to be offline for a while.
To formulate my question more generaly;

Many algorithms give the same output, but with different datatypes.
Thesecharacter output matriceshave a mix of integers and reals inevery possible combination per record.
Every record has a fixed size (100 columns ints and reals).
Value sizesare moderate (-10000...0...+10000), zero is a value too.

My program has to process every matrix and record in the same way.
This program is large, complex, still in development.

Yoursuggestions to determine the input first and run separate int/real parts isn't feasible with these combinations.
Billsincl suggestion to convert (character) integers to reals and tread all as real-only is what I did.

This is a workaround, actually I need to distinguish between int and real.
Therefore my search for some combined datatype and processing the proper datatype-part on the fly.
I'm going to experiment with Jim's sugestions (Type-Union-Map...) and post the results.

Thanks all,
Clemens (davinci)

technique to recognize difference between integers and reals