STL: efficiency of vector<T>::operator[] on IA64

schorscherl · ‎09-16-2005

Hi all,

using recent versions of icpc (9.0-024), I have done some
benchmarking using the vector triad

for(j=0;j for(i=0;i a=b+c*d;
dummy(a,b,c,d);
}

The performance of this benchmark is best (and as expected,
judging from the hardware specs of the machine) if
a, b, c and d are simple double[] arrays. Using STL vectors,
performance on IA64 breaks down by almost a factor of 5.
On EM64T this effect is much weaker, STL drops by about
30% there. Compiler options for IA64 were

-O3 -g -openmp -Ob2 -fno-alias -fno-exceptions

Dropping the -openmp improves things a little, but there is still a
factor of 4. Why can the IA64 compiler cope so badly with STL vector
access?

Thanks is advance,
Georg.

TimP · ‎09-18-2005

You haven't presented enough information to comment intelligently. The first place to look for an answer would be in your opt_report.

schorscherl · ‎09-19-2005

Ok, you are right of course, but in the presence of inlining it is
quite hard to map the swp reports to the correct loops. Anyway, here
is what the compiler says about the STL loop:

---------------------------------------------------------------------------
Swp report for loop at line 172 in _Z9stl_triadii in file /usr/include/g++/bits/stl_iterator.h

Loop at line 631: unrolled

Resource II = 7
Recurrence II = 1
Minimum II = 7
Scheduled II = 7

Estimated GCS II = 12

Percent of Resource II needed by arithmetic ops = 86%
Percent of Resource II needed by memory ops = 86%
Percent of Resource II needed by floating point ops = 43%

Number of stages in the software pipeline = 14
---------------------------------------------------------------------------

(Why it thinks the loop is in stl_iterator.h I have no idea, but
it is the correct loop.)
This does not look too bad, and the performance drop compared to
the vanilla triad is now only a factor of 2.5. I arrived at this by placing

#pragma unroll(6)
#pragma ivdep

in front of the loop (leaving out either of the two pragmas). Unrolling
alone did not help because the compiler reported loop-carried dependencies
and refused to swp. Unrolling by 6 or 8 gives the best results.
Anything above or below is worse.

For comparison, this is the swp report for the vanilla triad:

---------------------------------------------------------------------------
Swp report for loop at line 201 in _Z9std_triadii in file numa-ctor.cc

Loop at line 202: unrolled loadpair-ver-1

Resource II = 8
Recurrence II = 1
Minimum II = 8
Scheduled II = 8

Estimated GCS II = 11

Percent of Resource II needed by arithmetic ops = 75%
Percent of Resource II needed by memory ops = 88%
Percent of Resource II needed by floating point ops = 50%

Number of stages in the software pipeline = 3
---------------------------------------------------------------------------
---------------------------------------------------------------------------
Swp report for loop at line 201 in _Z9std_triadii in file numa-ctor.cc

Loop at line 202: unrolled loadpair-ver-2

Resource II = 10
Recurrence II = 1
Minimum II = 10
Scheduled II = 11

Estimated GCS II = 14

Percent of Resource II needed by arithmetic ops = 80%
Percent of Resource II needed by memory ops = 90%
Percent of Resource II needed by floating point ops = 40%

Number of stages in the software pipeline = 2
---------------------------------------------------------------------------

Nevertheless, the factor of 2.5 remains. Are there any other steps
I could take apart from the pragmas?

Thanks,
Georg.

TimP · ‎09-19-2005

The compiler is attempting to be helpful, by pointing out where it found the iterator code. You might look for typical STL problems there, such as using (++pointer != pointer_one_past_end) as the loop continuation condition, without a check for sane initial values, such as (pointer < pointer_one_past_end). A loop set up for an unambiguous loop count improves optimization.
Unroll by 8 is a typical strategy for issuing software prefetch at reasonable intervals, without predication.

schorscherl · ‎09-27-2005

Hi,

I have found out that the compiler can optimize the loop perfectly

with OpenMP switched off if a,b,c,d are not full vector

objects but iterators addressing the first element. With OpenMP

activated, the compiler refuses to parallelize without comment.

I'll submit a bug report on premier.

Thanks,

Georg.

Message Edited by schorscherl on 09-27-2005 11:32 AM