Performance-Double Precision/ on 32 bit vs 64 bit

TinTin_9 · ‎07-03-2008

I just got my new Intel Quad core xeon 45nm workstation + 8gB RAM , I was very eager to check how my fortran code performs on this new system. I compiled my code using quad precision ; first on a 32 bit interface and then on a 64 bit interface and as expected i did see an speed increase of 35% on using quad precision with 64 bit.

But I had none or insignificant speed increase when I used double precision floating point arthematic on both 32 bit and 64 bit. I was hoping that , also read on intel website that I would have signifcant performance boost using 64 bit computing power on my double precision and quad precision floating point arthematic. Am I missing some thing here, some lack of proper coding technique for 64 bit architecture or some other information ? Also it mite be that some of my compiler options need to be changed , I am using the default maximum speed (/02) option rite now.

Thanks all, Nittin

TimP · ‎07-03-2008

The main purpose of 64-bit mode is to support a larger address space than 32-bit mode, not to raise performance of applications which perform well in 32-bit mode. The most common reason for a performance boost with 64-bit recompilation is in taking advantage of additional registers, as the quad precision support library may do. If you have complicated enough loops for loop transformations to offer an advantage, /O3 may make a difference. If you are using an old compiler, an upgrade may help.

TinTin_9 · ‎07-03-2008

Thanks for the reply, I am already using the latest version of IVF compiler on VS 2008 . I understand your point now, but the O3 option didnt had much impact either. Are there any specific programming techniques for a 64 bit application than a 32 bit ? I know you canot tell me everythng, but if you know any refrence of any sort on this ? it would be really helpfull.

Thanks, Nittin

Steven_L_Intel1 · ‎07-03-2008

There's no difference with floating point between 32-bit and 64-bit mode, assuming that on IA-32 you are using /QxW or a switch to enable SSE2 or higher. As Tim says, the 64-bit environment gives you a larger virtual memory address space. There are also additional registers which can help some, especially with integer calculations.

There is not much to do special other than be careful when calling OS and library routines that expect address-sized integer arguments.

TinTin_9 · ‎07-03-2008

Okay , now I better understand this issue.

A little off topic question I have , a given below 'do loop' takes 2 sec to compute if I dont use open mp directives.

But using open-mp as shown below takes 60 sec! any idea why such a regression in performance ? I have a core 2 duo processor.

i,s,m , j_val , b , track(:,:) are integers.
Perm_mat is a double precision , big dense 25 mb matrix
and h,h1,oneby6 are double precision constants

C(s:m,4) is double precision

Its basically getting the 4 coefficents of a spline polynomial- interpolation
!-----------------------------------------
!$OMP PARALLEL DO PRIVATE(I) DEFAULT(SHARED)
do i=s,m

j_val = track(j_imd(i,body),4) + b

C(i,1) = Perm_Mat(j_val,1)
C(i,2) = ( Perm_Mat((j_val+1),1) - Perm_Mat(j_val,1) )*h1 - h*oneby6*( & 2.00d0*Perm_Mat((j_val),2) + Perm_Mat((j_val+1),2) )
C(i,3) = Perm_Mat(j_val,2)*oneby2
C(i,4) = ( Perm_Mat((j_val+1),2) - Perm_Mat((j_val),2) )*h1*oneby6

end do
!$OMP END PARALLEL DO
!-------------------- -----------------------------------------------------

Thanks Steve and all

TimP · ‎07-04-2008

The most evident problem is that you set j_val as a shared variable, so you have a race condition, and can't expect consistent results. Slowness is due to the bad cache behavior.

TinTin_9 · ‎07-04-2008

Thanks, i see your point. I also checked some stuff on the web regarding open mp.

I guess I have to re-write my code. I tried follwoing the tips here on a small part of my code and guess what I had performacne boost of 43% thats impressive (for that part). Enough, to motivate me to change my whole code.

Thanks tim , and all for the great help!

jimdempseyatthecove · ‎07-05-2008

One other suggestion to improve performance

I noticed on my Q6600 (has SSE3 but not SSE4.1)

If I compile optimized for SSE4.1 and requires SSE4.1 (as opposed to optimized for SSE3 and requires SSE3) that the compiler produced better SSE3 code (15% faster for my applicaiton) and out of 750 source files, only 1 produced code using SSE4.1 using one instruction (and would fail to run with illegal instruction). For that source file only, I compile with optimized for SSE3 and requires SSE3. The rest (except for the main) are compiled with SSE4.1. The main has to be compiled with what is availableon your platform as there is a sanity check in the IVF startup code.

As a note to any Intel people reading this I recommend adding and option like /warn_sse4 that emits a warningwhen SSE4.1 instructions are emitted. This way other users can take advantage of using the better SSE3 optimizations performed with the SSE4.1 enabled.

An alternative,an option used in conjunction with "optimized for SSE3 and requires SSE3" that evokes the following: Compile using SSE4.1 optimizations, but enable warn if SSE4.1 instructions emitted, if enter warn on SSE4.1 detection, abort compilation and restart using older SSE3 only optimizations. Compilation would only be longer when SSE4.1 instructions actually used. In my case this was 1 file out of 750 files would have compiled twice. A good tradeoff against the 15% performance boost for SSE3 instruction sequences.

Jim Dempsey

TimP · ‎07-05-2008

Jim,
I don't think we can make useful generalizations from your most recent comment. I have noticed that certain optimizations, which were reserved for -QxS until recently, are becoming available with -QxP in the latest compiler updates. Until recently, -QxP, in principle, optimized for the Prescott CPU of 4 years ago, and didn't take advantage of all of the more recent developments in the compiler. None of this necessarily works any one way in 100% of examples.
To be specific, where the compiler vectorizes unaligned memory accesses, it sometimes uses full cache line unrolling, using scalar loads across the cache line boundary, so as to avoid straddling cache lines with movups. In my examples, all cases which do this under -QxT or -QxS do it with the latest compilers under -QxP as well. The extra code expansion could sometimes aggravate instruction cache or tlb misses. That problem might occur more often on certain early models in the Prescott series.
Several of my colleagues on the application side have agreed in requesting that older options should perform well on the current CPUs, while maintaining the documented instruction set compatibility with the older CPUs. That appears to fit with your desire.
By a year from now, many more CPUs will be on the market which perform better with unaligned movups, so the trend toward full cache line unrolling will have to reverse when optimizing for those CPUs. It is already possible that the same optimizations don't work consistently for Core 2 Duo and Penryn, or for desktop and laptop CPUs.
Proliferation of compiler switches is not entirely productive.

Steven_L_Intel1 · ‎07-05-2008

In the next major release, we are doing a MAJOR rationalization of instruction set options. Gone are the confusing letters that are meaningless except to those who memorize Intel code names. We're also doing away with the "tune for" options, which have meant less and less in recent years. I like what I am seeing and I think you will too. As Tim says, newer processors handle things such as unaligned loads and stores better.