>>'Cache access may be more effective when subsequent reads lay within the same cache line ' Are you considering a cache line which is not currently present in cache. So the first read will be a miss and the second will be a hit or are you talking about a cache line which is already present in the cache. In which case both accesses will be hits.
pseudo code, a0 initially random()%256, 2nd and later iterations of loop,
a0=(a0 & ~63) + (random()%64) ! i.e. low 6 bits vary
...
xx = array1[a0]
Not that the above is how you coded it, but the above is an example of what I meant.
The above assumes prior cache line of array1[a0] has not been evicted.
>>2. 'Cache access may be more effective ........ when subsequent reads are located at higher address.'
Are you hinting at the data prefetching feature of the L1D cache. If so, then shouldn't decrementing the address also cause the prefetch to happen ? Also since the memory accesses are at random locations, so the prefetcher would never get triggered.
This depends on cache design. When data not in cache:
Some designs fetch only the cache line in which the data resides (or lines in case of split)
Some designs fetch that line plus the following line
Some designs fetch two cache lines aligned at 2x cache line size
In your code snip, after 1st iteration, the data should be in cache so this section would not apply to your situation. That is unless your data were organized such that they required more TLB cache requirements than were available. I think the least size for TLB cache on current x86 is 8 entries (IOW at least up to 8 Virtual Memory pages can be mapped at once). Worst case
Code loop spanning pages 2 = 2
Stack spanning pages +2 = 4
Array1 spanning pages +2 = 6
Array2 spanning pages +2 = 8
And if a0 were global data +1 page = 9
Then add to this the page(s) for your data logging
On a system with TLB size of 8 you could see significant overhead.
>>3. 'I wouldn't be surprised if your chart were the summation of two saw-tooth charts. One with a period of 64 and the other with a period of 256.'
The plots doesn't seem to have any particular trend. Why would you say that the saw-tooth periods exist ?
They may have a shape, could be more pointy tooth /\ or \/ over pointy tooth. As opposed to saw tooth /| or |/ over saw tooth.
Assuming each of your array index sequences were random (code outline said they were).
Assuming that subsequent same cache line access has different timing than different cache line access
Assuming that next cache line access has different timing than prior cache line access
Assuming than forward cache line access (beyond next line) has different timing than backwards reference
IOW placement matters.
However your one byte array size of 256 might make the variance small
When a0 is 0 on interation n probability is high that on iteration n+1 a0 will be larger (1-1/256)
When a0 is255 on interationm probability is high that on iteration m+1 a0 will be smaller
As a0 approaches mid point, 127/128, then probability is 0.5
When a0 > 63 and < 192 then probability is high that next iteration data will be within 0-1 cache lines (3/4)
... with respect to placement within cache line.
Depending on processor design, the processor may use a MUX to select 1 byte out of the 64 byte cache line or the processor may use a barrel shifter to select 1 byte out of 64 byte cache line. The MUX method may have fixed time regardless of position whereas the barrel shifter may vary in timing dependent on number of shifts. On older gen processors you have smaller internal data path and a combination of MUX and barrel shifter or swapper was involved. Therefore as the index varied within the cache line so did the instruction time vary.
I would expect some pattern to emerge. Note you will have some noise in your data due to three additional random indexes. The juxtaposition between indexes as well as individual displacements between iterations will have an effect on the timing. Data logging this though may trash your statistics.
Jim Dempsey