Intel® Moderncode for Parallel Architectures
Support for developing parallel programming applications on Intel® Architecture.

## Access Time to the L1 Data Cache

Beginner
621 Views
Hi,

I have a small piece of code for which I am analyzing the clock cycles required.
The pseudo code is as follows :

loop(){
a0 = random() % 256; a1 = random() % 256;
a2 = random() % 256; a3 = random() % 256;

b0 = random() % 256; b1 = random() % 256;
b2 = random() % 256; b3 = random() % 256;

start = timestamp() // using the rdtsc instruction
x0 = array1[a0]; y0 = array2[b0];
x1 = array1[a1]; y1 = array2[b1];
x2 = array1[b2]; y2 = array2[a2];
x3 = array1[b3]; y3 = array2[a3];
stop = timestamp()

totaltime += (stop - start);
}

1. The array1 and array2 are unsigned char of size 256 bytes (defined globally)
2. I run this loop for 2^23 times, and find the average time.
3. Then I plot a graph of the value of a0 on x-axis and average time taken on the y-axis.

I find that the average time taken is highly dependent on the value of a0.
I am unable to find a reason for this. Does this mean that access times to
the cache depends on the value of the address being accessed.

I am using an Intel Core 2 Duo with 32KB L1 D cache, 8 way associativity,
and 64 byte line size.

Any help in this regard would be greatly appreciated.

Regards

Chet
3 Replies
Honored Contributor III
621 Views
Did you write random()?

If not, there may be some undesirable runtime interaction issues with your program.
Obtain or write your own pseudo random number generator that is short and does not make system calls.

Align array1 at 2048 (or at least 512), make array2 immediately after array1

Now you have consistent memory placement for array1 and array2 and both are within the same VM page.

Perform the per/a0 summation within your inner tight loop

totalsPer_a0[a0] += totaltime

Note, this will add 256*8 bytes to the L1 cache consumption. If the totals array is in the same VM page as the other two pages then this will put less pressure on your TLB cache.

Cache access may be more effective when subsequent reads lay within the same cache line and when subsequent reads are located at higher address. Therefore you will find a varience in time not with placement of a0, but with the relative displacement of the other indexes. (relative to each other as well as to a0)

I wouldn't be surprised if your chart were the summation of two saw-tooth charts. One with a period of 64 and the other with a period of 256.

Jim Dempsey
Beginner
621 Views
Thanks for your reply Jim. I have a few questions to ask.

1. 'Cache access may be more effective when subsequent reads lay within the same cache line '

Are you considering a cache line which is not currently present in cache. So the first read will be a miss and the second will be a hit
or are you talking about a cache line which is already present in the cache. In which case both accesses will be hits.

2. 'Cache access may be more effective ........ when subsequent reads are located at higher address.'

Are you hinting at the data prefetching feature of the L1D cache. If so, then shouldn't decrementing the address also cause the prefetch to happen ? Also since the memory accesses are at random locations, so the prefetcher would never get triggered.

3. 'I wouldn't be surprised if your chart were the summation of two saw-tooth charts. One with a period of 64 and the other with a period of 256.'

The plots doesn't seem to have any particular trend. Why would you say that the saw-tooth periods exist ?

-Chet

Honored Contributor III
621 Views
>>'Cache access may be more effective when subsequent reads lay within the same cache line ' Are you considering a cache line which is not currently present in cache. So the first read will be a miss and the second will be a hit or are you talking about a cache line which is already present in the cache. In which case both accesses will be hits.

pseudo code, a0 initially random()%256, 2nd and later iterations of loop,

a0=(a0 & ~63) + (random()%64) ! i.e. low 6 bits vary
...
xx = array1[a0]

Not that the above is how you coded it, but the above is an example of what I meant.
The above assumes prior cache line of array1[a0] has not been evicted.

>>2. 'Cache access may be more effective ........ when subsequent reads are located at higher address.'
Are you hinting at the data prefetching feature of the L1D cache. If so, then shouldn't decrementing the address also cause the prefetch to happen ? Also since the memory accesses are at random locations, so the prefetcher would never get triggered.

This depends on cache design. When data not in cache:

Some designs fetch only the cache line in which the data resides (or lines in case of split)
Some designs fetch that line plus the following line
Some designs fetch two cache lines aligned at 2x cache line size

In your code snip, after 1st iteration, the data should be in cache so this section would not apply to your situation. That is unless your data were organized such that they required more TLB cache requirements than were available. I think the least size for TLB cache on current x86 is 8 entries (IOW at least up to 8 Virtual Memory pages can be mapped at once). Worst case

Code loop spanning pages 2 = 2
Stack spanning pages +2 = 4
Array1 spanning pages +2 = 6
Array2 spanning pages +2 = 8

And if a0 were global data +1 page = 9
Then add to this the page(s) for your data logging

On a system with TLB size of 8 you could see significant overhead.

>>3. 'I wouldn't be surprised if your chart were the summation of two saw-tooth charts. One with a period of 64 and the other with a period of 256.'
The plots doesn't seem to have any particular trend. Why would you say that the saw-tooth periods exist ?

They may have a shape, could be more pointy tooth /\ or \/ over pointy tooth. As opposed to saw tooth /| or |/ over saw tooth.

Assuming each of your array index sequences were random (code outline said they were).
Assuming that subsequent same cache line access has different timing than different cache line access
Assuming that next cache line access has different timing than prior cache line access
Assuming than forward cache line access (beyond next line) has different timing than backwards reference
IOW placement matters.

However your one byte array size of 256 might make the variance small

When a0 is 0 on interation n probability is high that on iteration n+1 a0 will be larger (1-1/256)
When a0 is255 on interationm probability is high that on iteration m+1 a0 will be smaller
As a0 approaches mid point, 127/128, then probability is 0.5
When a0 > 63 and < 192 then probability is high that next iteration data will be within 0-1 cache lines (3/4)
... with respect to placement within cache line.

Depending on processor design, the processor may use a MUX to select 1 byte out of the 64 byte cache line or the processor may use a barrel shifter to select 1 byte out of 64 byte cache line. The MUX method may have fixed time regardless of position whereas the barrel shifter may vary in timing dependent on number of shifts. On older gen processors you have smaller internal data path and a combination of MUX and barrel shifter or swapper was involved. Therefore as the index varied within the cache line so did the instruction time vary.

I would expect some pattern to emerge. Note you will have some noise in your data due to three additional random indexes. The juxtaposition between indexes as well as individual displacements between iterations will have an effect on the timing. Data logging this though may trash your statistics.

Jim Dempsey