events for analyzing the impact of two threads

explore_zjx · ‎12-22-2009

hi, I want to instrument two threads which runs on ashared L2 cace machine.The two threads belongs to a process. One thread is main thread, another thread is prefetch thread, I want to instrumenthow manyL2 cache lines the main threads benifits fromthe prefetch thread. In other words, how many L2 cache lines the main thread is used, which these cache lines is provided by the prefetch thread. For example, the prefetch thread prefetchs 10 cache lines, the main thread used 4 of them. I want to get the data 4 and 10.
Any suggestions?

Peter_W_Intel · ‎12-22-2009

Quoting - explore_zjx

hi, I want to instrument two threads which runs on ashared L2 cace machine.The two threads belongs to a process. One thread is main thread, another thread is prefetch thread, I want to instrumenthow manyL2 cache lines the main threads benifits fromthe prefetch thread. In other words, how many L2 cache lines the main thread is used, which these cache lines is provided by the prefetch thread. For example, the prefetch thread prefetchs 10 cache lines, the main thread used 4 of them. I want to get the data 4 and 10.
Any suggestions?

As I understand your working mode is: one is "master" thread to do data process, anotheris "slave" thread to do data prefetching.

Below are my suggestion:
1) First at all, you have to consider that new workload (extra code) from "slave" thread which should be parallel-working with"master" thread
2) You have to consider sync-object cost for communication between "master"thread and "slave" thread,and "master" thread's wait time.
Above two points, you can use Intel Thread Profiler to measure.

Finally you can use event-based sampling data collection toknow:
1) L2 cache misses without "slave" thread
2) L2 cache misses with "slave" thread.
(Don't careL2cache misses in "slave" thread, which latency will be in other core...)

Regards, peter

explore_zjx · ‎12-22-2009

Quoting - Peter Wang (Intel)

As I understand your working mode is: one is "master" thread to do data process, anotheris "slave" thread to do data prefetching.

Below are my suggestion:
1) First at all, you have to consider that new workload (extra code) from "slave" thread which should be parallel-working with"master" thread
2) You have to consider sync-object cost for communication between "master"thread and "slave" thread,and "master" thread's wait time.
Above two points, you can use Intel Thread Profiler to measure.

Finally you can use event-based sampling data collection toknow:
1) L2 cache misses without "slave" thread
2) L2 cache misses with "slave" thread.
(Don't careL2cache misses in "slave" thread, which latency will be in other core...)

Regards, peter

Hi, peter
You understand partial right.
As you said, the master thread have it's own load instruction. So it's not necessary for the master to wait the slave thread. If the master's data is not in the L2, it will load it by itself. the slave thread is only for prefetching.
So exactly said, I want to instrument the prefetch accuracy.

Peter_W_Intel · ‎12-22-2009

Quoting - explore_zjx

Hi, peter
You understand partial right.
As you said, the master thread have it's own load instruction. So it's not necessary for the master to wait the slave thread. If the master's data is not in the L2, it will load it by itself. the slave thread is only for prefetching.
So exactly said, I want to instrument the prefetch accuracy.

Yes. Now I can understand more - no sync, no wait, even nocommunicatingbetween two threads??? I mean main thread should send signal to slave thread to prepare next bunch of data.

I already have provided you method in my previous post to measure L2 misses.

explore_zjx · ‎12-22-2009

Quoting - Peter Wang (Intel)

Yes. Now I can understand more - no sync, no wait, even nocommunicatingbetween two threads??? I mean main thread should send signal to slave thread to prepare next bunch of data.

I already have provided you method in my previous post to measure L2 misses.

Yes, the slave thread will synchronize with main thread during a period. I want to know about the L2_LD.SELF.ANY.S_STATE events. After the slaver thread loaded the data into L2, then the main thread used it. So what's state of MESI of this cache line? Can I instrument the L2_LD.SELF.DEMAND.S_STATE events which indicates the Counts how many times cache lines in Shared state are accessed, that is when other caches share the cache line. So I want to know the state of the L2 cache line.

Peter_W_Intel · ‎12-23-2009

Quoting - explore_zjx

Yes, the slave thread will synchronize with main thread during a period. I want to know about the L2_LD.SELF.ANY.S_STATE events. After the slaver thread loaded the data into L2, then the main thread used it. So what's state of MESI of this cache line? Can I instrument the L2_LD.SELF.DEMAND.S_STATE events which indicates the Counts how many times cache lines in Shared state are accessed, that is when other caches share the cache line. So I want to know the state of the L2 cache line.

Measure L2 cache miss for main thread, please useMEM_LOAD_RETIRED.L2_LINE_MISS or L2_LINES_IN.SELF.DEMAND for dataload- refer to good article http://assets.devx.com/goparallel/18027.pdf by Dr. David Levinthal PhD. L2 Hit count is to use "MEM_LOAD_RETIRED.L1D_LINE_MISS - MEM_LOAD_RETIRED.L2_LINE_MISS"

Again, use slave thread or not use slave thread - get event count then compare them, you can know if slave thread benefits to main thread.

Hope it helps.

Regards, Peter