Solved: Speed penalty when mixing real and real8 or double

Wee_Beng_T_ · ‎09-08-2011

Hi,

I'm writing a CFD code. I used to declare real8 for all real variables.

However I found that it's taking too much memory now that my grid size has increased.

I am thinking of only using real8 when I need to assemble the matrix for the iterative matrix solver, while others using real.

Will there be a speed penalty when mixingreal and real8 variables? I will most likely be multiplying real and real8 variables in a 2D real8 array.

I read somewhere that changing real to real8 has no effect, but the reverse will have an effect. However it's for some other older fortran compiler.

Not sure if I should stick to using real8 for all variables...

Also, does using dsin instead of sin makes a difference? I read that nowadays the compiler will automatically switch depending whether the data is real or real8.

Thanks!

Ron_Green · ‎09-08-2011

Yes, mixing datatypes can have an impact on performance, no question. Especially if you have version 11.1 or older compilers. These older compilers will not vectorize loops where you are mixing datatypes in the candidate loops. In 12.x there were improvements in the vectorizer to allow more vectorization in theses mixed datatype cases. So for sure, you need to get the 12.1 compiler. Use the -guide option to determine if the loops are getting vectorized, it should help you determine where you are losing vectorization.

I'd run the original -r8 code through the compiler with -vec-report and save that off to a file. Then try the new modified code with -vec-report. Diff the 2 files to see where you are losing vectorization. Then use -guide on the loops to see what it doesn't like about the new loops and see what suggestions it has. Of course, focus only on the hot loops ( may need to use the new -profile-loops option to profile the code to find your hotspots ).

You'll want the appropriate -x option of course.

The impact will depend on how much your code was able to use vectorization.

Of course, data fetching, data cache effects could come into play. But with less data to move around with single precision, I can only see goodness in the change.

ron

View solution in original post

mriedman · ‎09-08-2011

For the second question: You don't need to care about intrinsic functions. The compiler will do it for you, just use sin.

The first one is difficult as there are various instruction scenarios to consider. I wouldn't be surprised if the compiler refuses todo SSE vectorization when doing mixed precision. I suggest you use -vec-report to check that for the kernel loops of your solver.

But most important thing is: Can you really afford to reduce precision with all the consequences thatwill have (solver convergence behaviour, result deviation, quality assurance) ?

Ron_Green · ‎09-08-2011

Yes, mixing datatypes can have an impact on performance, no question. Especially if you have version 11.1 or older compilers. These older compilers will not vectorize loops where you are mixing datatypes in the candidate loops. In 12.x there were improvements in the vectorizer to allow more vectorization in theses mixed datatype cases. So for sure, you need to get the 12.1 compiler. Use the -guide option to determine if the loops are getting vectorized, it should help you determine where you are losing vectorization.

I'd run the original -r8 code through the compiler with -vec-report and save that off to a file. Then try the new modified code with -vec-report. Diff the 2 files to see where you are losing vectorization. Then use -guide on the loops to see what it doesn't like about the new loops and see what suggestions it has. Of course, focus only on the hot loops ( may need to use the new -profile-loops option to profile the code to find your hotspots ).

You'll want the appropriate -x option of course.

The impact will depend on how much your code was able to use vectorization.

Of course, data fetching, data cache effects could come into play. But with less data to move around with single precision, I can only see goodness in the change.

ron

jimdempseyatthecove · ‎09-08-2011

>>I am thinking of only using real8 when I need to assemble the matrix for the iterative matrix solver, while others using real.

See if your program lends itself to

{real(4), real(4),...real(4)} (batch 1)
{real(4), real(4),...real(4)} (batch 2)
...
{real(4), real(4),...real(4)} (batch n)

and where your computational process can be:
do i=1,nBatches
call convertToReal8(batchAsReal8, batch(i))
call processBatch()
call convertToReal4(batch(i), batchAsReal8)
end do

When processBatch() uses the data in the converted array more than once then this might be a viable option.

Don't forget about possible convergence issues as mentioned in an earlier post. If the convergence is detected wholely within the processBatch() without the data being exported to real(4) and re-imported ad real(8) then you should see little effect. However, when convergence is detected after data being exported to real(4) and re-imported ad real(8) then you will have to take into consideration the effects of the rounding (due to precision differences).

Jim Dempsey

Wee_Beng_T_ · ‎09-09-2011

Thanks ppl!

I will try out Ronald's suggestions to try to find the differences.

I don't really understand Jim's answers. Are you saying I need to add some subroutines for the conversion from real to real8 and vice versa?

Basically, I create a real8 nxn matrix mat.

Then I:

do j=1,n

do i=1,n

mat(i,j) = dist*a*b*c

end do

dist,a,b,c can be real or real8 variables. I just feel that some variables like dist (distance) does not to be real8 variables, so declaring them as real saves storage space.

Thanks!

jimdempseyatthecove · ‎09-09-2011

I don't really understand Jim's answers. Are you saying I need to add some subroutines for the conversion from real to real8 and vice versa?

No, I said if the data is reused several times,

Assume for example you have

n large-ish matricies in real 4

And you wish to perform a transformation via matrix multiply by

one/severallarge-is matricies in real 4

And you want the better precision of the DOT products by using REAL(8) in computation.
(note, in n x n situation, each cell of each inputarraygets referenced n times, IOW reused many times)

In this case, it would be best to convert each matrix from real(4) to real(8), perform the matrix multiply (use MKL if you have it), then convert the results back. This is much better than converting each element from real(4) to real(8) produce an element x element product and sum to real(8) (temp) then store/convert sumfrom real(8) back to real(4) into the results array. Note, although the source code will look simpler with the conversion in the second method, the performance will be significantly faster with conversion of the arrays. Reason being this makes the code more suitable for vectorization.

RE:

do j=1,n

do i=1,n

mat(i,j) = dist*a*b*c

end do

In the above the compiler optimizations in Release build (should) lift this expression outside the loop (including the conversion from real(4) to real(8)).

However, should the variables in the expression vary within the loop, then the conversion codeis forced inside the loop (thus making the code run slower).

In the above simplification you are saving 16 bytes of memory for the data storage but you may be adding that back to include the code for conversion back.

Jim Dempsey

Wee_Beng_T_ · ‎09-10-2011

Thanks Jim, I have a clearer picture.

Guess there's lots of stuff to do to ensure I get the speed improvements and the correct ans!