autoparallelization with test code maxes all cpus but gives zero speedup

nooj · ‎02-25-2011

Hi, I am testing autoparallelization on ifort v. 11.1 20100401. (I know this is not the latest version. Our department will not upgrade the license until it expires in the Fall.)

I've tried using the test problem on http://software.intel.com/en-us/articles/automatic-parallelization-with-intel-compilers/ :

PROGRAMTEST
PARAMETER(N=10000000)
REALA,C(N)
DOI=1,N
A=2*I-1
C(I)=SQRT(A)
ENDDO
PRINT*,N,C(1),C(N)
END

The code is being used on a Dell Poweredge T610 running Ubuntu Linux with 2 six-core, hyperthreading 3.33 GHz Xeon processors and 24 GB of shared memory.

When I compile the program with
ifort -traceback $< -o test-serial

the code runs fine. It uses 100% of one cpu and exits normally.

I then compare that to
ifort -traceback $< -o test-parallel -parallel -par-report3

I see the following output when compiling:
ifort -traceback test.f90 -o test-parallel -parallel -par-report3
procedure: test
procedure: test
test.f90(4): (col. 7) remark: LOOP WAS AUTO-PARALLELIZED.

This time, the program uses 100% of all cpus and exits normally. The problem is that the parallelized version exits in the same amount of time as the serial version!
(parallel cputime) = (num cores) * (serial cputime).

This makes me sad. I have seen the same behavior on a multi-core Mac Pro. How can I utilize autoparallelization?

- Nooj

Ron_Green · ‎02-25-2011

Well I can clear up one point quickly: your license is NOT for a version. Licenses give you access to support and this includes access to the latest compilers. They need only go to https://registrationcenter.intel.com and get the latest compilers.

But getting back to your problem - how are you measuring time? Some timers add up all the individual core times, so it would look like zero speedup. Also, what resolution is the timer? But more to the point, I just ran this code and it's trivial - runs in a fraction of a second. The process startup and tear down is going to dominate, as the computation is down in the noise. Plus, the non-parallelized version does not have to set up a thread pool and then tear down the thread pool. Your costs to simply start the program on the system takes longer than the little loop - 10 million cycles on a cpu that can do 2 gigacycles per second - noise.

You will find a number of threads in this forum about timing and synthetic, homegrown 'benchmarks'. Bottom line - the only true way to test -parallel is to either use a high-precision timer like the one in IPP OR use a real application that takes more than a few seconds to run.

ron