help understanding fp performance with L2 cache

rosing · ‎09-02-2005

I'm trying to optimize some fortran 90 code using the 8.1 compiler on Itanium. The kernel is about 2000 lines long and runs over a spatial grid of cells. At each cell it does newton iterations until something converges. When we first measured the performance we were getting 2-3% peak performance (flops/(cycles*4), where there are 4 fp units: 2 adders and 2 multipliers) and noticed lots of fp pipeline stalls. I vectorized the code so it would run over a number of cells at a time. This raised the performance to about 15% peak if I made the vector length about 100. So, making this vector longer makes it easier for the compiler to get performance. However, once the vector gets too big not everything fits in L2 cache and the performance starts dropping off rather drastically. Unfortunately we have problems where the vector length has been pushed down to around 25 (more data to work on).

First question: any suggestions?
Second question: Since there is no L1 cache for fp and the L2 cache has a 6 clock cycle latency, how can I get better than 1/6th the performance of the processor? I realize I need to have more locality but that goes against the need for vectorization. Since there's no L1 I have to fit everything in registers. How many registers are there and how do I get the compiler to make better use of them?

Thanks.

TimP · ‎09-02-2005

I posted a response to the first part of your question. May try again if it doesn't show up.

2nd question. You don't try to persuade the compiler to make better use of registers until you have looked into what it is doing already. When your loop is software pipelined, the compiler allows typically 9 cycles for each access to L2 cache to occur without stall, by scheduling pipelined operations. So you pay the cache access penalty the first time through the loop only, assuming you have engaged software prefetching.

TimP · ‎09-02-2005

Easy ways to deal with cache miss:

Take a look at -O3, which is the easy way to get software prefetch with the 8.1 compiler, and try the associated opt_report options.
For short loops, the 9.0 option -mP2OPT_hlo_pref_initial_vals=N may be effective. N is the total number of cache lines you want fetched prior to each loop. 6 may be a good starting point, for single precision.

My usual verbose reply hasn't showed up.