Solved: Avoid cache writing on read?

yotamhc · ‎11-29-2011

Hi,
I have a scenario in which occasionally my code has to access some data in main memory and therefore causes an evacuation of other data (which is needed frequently) that resides in cache. The new data is needed only once.
I was wondering if it is possible to use some instruction or optimization to avoid writing of the value read from main memory to cache, and therefore keep the old values there and not throwing them away. I.e. load the value from main memory to a register, but not to cache.

Thanks!

Patrick_F_Intel1 · ‎11-29-2011

Hmmm... if you don't know the next state pointer until you read it then the prefetchnta isn't going to help.

Is there anyway to reduce the amount of data fetched per state so that, when you go to the unpopular states, you reduce how much data get retrieved?
For instance, if there are just a handful of fields usually used per state, can you put them on the same (64 byte) cache line? Or reorder the layout to put as much of the most frequently used data on the minimum number of cache lines?

This next idea is a long shot but...
Also try to make sure that the distance between state machines isn't a power of 2. Say that the distance between state machines is 4KB. A 32KB L1 data cache can only hold 8 cachelines when the cache lines addresses are separated by exactly 4KB.
Similarly a 256KB L2 can only 64 cachelines whose addresses have (address mod 4KB) == 0.

Also, if this is a multi-threaded app (on a system with more than 1 processor), you probably want to check that you aren't getting false sharing.
A symptom of false sharing is that, as you increase the number of threads, the bandwidth used goes up A LOT. This might indicate that more than 1 thread is hitting the same cache line and so the cache line bounces between socket 1 and socket 2.
Pat

View solution in original post

Patrick_F_Intel1 · ‎11-29-2011

Hello Yotamhc,
There is an instruction 'prefetchnta' which tries to reduce the cache evictions.
On recent Intel processors, prefetchnta brings a line from memory into the L1 data cache (and not into the other cache levels).
On older processors prefetchnta would bring the data into the L2 (and not into other cache levels).
You can read about prefetchnta in the Intel 64 and IA-32 Architectures Optimization Reference Manual.
The Intel manuals are available at http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html.
Searching the pdf for 'prefetchnta', you'll see that the manual describes a changing behavior of prefetchnta but doesn't describe the behavior on current processors.
The current behavior is as mentioned above.
This changing behavior leads some to question the wisdom of including software-based prefetch into code since the behavior can change with processor generations.

There is also the idea that you have to schedule your software prefetches.
See section 7.6.6 'Software Prefetch Scheduling Distance' of the above manual.
You don't want to prefetch the data too far ahead or it may get kicked out before you use it.
You also don't want to prefetch the data too late or the hardware prefetcher may have already fetched the data.

The instruction is easy enough to try (especially with instrinsics) so you can try it and see if it speeds up your code. It seems like HPC codes like to use software prefetching but they generally know the size of the array they are streaming in is bigger than the cache size.
You probably need to check on the performance of the case where the data is already in cache if this will happen sometimes.
Let me know if you have questions,
Pat

yotamhc · ‎11-29-2011

Hello Patrick and thank you for the detailed response!

I understand that this instruction suggests to the CPU to prefetch the data to L1 data only (assuming I use Core i7) and not to L2. However, the prefetch scheduling distance is somewhat a problem for me as I do not know in advance that I am about to fetch this specific address that will not be used later (once I know I am going to need this data, I already have to read it). Will this be a problem to use prefetchnta in such a scenario? Will it have any effect? If not, do you see any other option?

Thanks,
Yotam

Patrick_F_Intel1 · ‎11-29-2011

Let me see if I understand correctly Yotam,
Usually the prefetchnta instruction is most useful if:
1)you have a large array (say bigger than the last level cache)and
2)you are accessing the array sequentially and
3) you don't need to reuse the data in the array anytime soon (so the data is likely to be kicked out anyway) and
4) you have other data that you don't want kicked out of the cache.

It sounds like you are saying either 1) or 2) is not true for your case.
Is that true?

If all the conditions are not true then it will be hard to get performance boosts withprefetchnta.
Pat

yotamhc · ‎11-29-2011

I think that (2) is not true for my case: I have a large array (which represents a state machine). I access a state and finds the next state to go each time. Then I load the next state and do the same.
There is a small set of states in which I happen to land most of the time, and occasionally I have to go to some other state. I cannot know that until I read the next state pointer, and at this point of time - I already have to read the next state.
I found that when the amount of such "non-popular" states visited goes up, I get much higher L2 cache miss rate and the overall performance goes down rapidly. I thought that maybe tuning my code to keep the popular states in L2 will help with this issue. Do you see any option for doing this?

Thanks a lot!
Yotam

Patrick_F_Intel1 · ‎11-29-2011

Hmmm... if you don't know the next state pointer until you read it then the prefetchnta isn't going to help.

Is there anyway to reduce the amount of data fetched per state so that, when you go to the unpopular states, you reduce how much data get retrieved?
For instance, if there are just a handful of fields usually used per state, can you put them on the same (64 byte) cache line? Or reorder the layout to put as much of the most frequently used data on the minimum number of cache lines?

This next idea is a long shot but...
Also try to make sure that the distance between state machines isn't a power of 2. Say that the distance between state machines is 4KB. A 32KB L1 data cache can only hold 8 cachelines when the cache lines addresses are separated by exactly 4KB.
Similarly a 256KB L2 can only 64 cachelines whose addresses have (address mod 4KB) == 0.

Also, if this is a multi-threaded app (on a system with more than 1 processor), you probably want to check that you aren't getting false sharing.
A symptom of false sharing is that, as you increase the number of threads, the bandwidth used goes up A LOT. This might indicate that more than 1 thread is hitting the same cache line and so the cache line bounces between socket 1 and socket 2.
Pat

yotamhc · ‎11-29-2011

Thanks, I was afraid I am about to go into cache lines alignment. Your tips will be a big help for me.

Can I suggest a feature for Intel's chip designers? ;)

SergeyKostrov · ‎12-01-2011

Hi, I have two questions:

- How bigis thedata set?
- In what programming language did you implement your application?

Best regards,
Sergey