Problem with cache misses (intel core i7)

alef_dos · ‎06-06-2009

I have an intel core i7.

I execute a program that multiplies 2 matrix for different sizes (200..5000).I use 4 threads with affinity ( 1 thread x core).The graph with time/iteration grows fast until 2000 (like a step) and then it maintains.This is due to cache misses.

So I calculate with vtune the cache misses (2nd level and 3rd level) for a size of 200 and 2000. The problem is that the % of cache misses (2nd and 3rd level) for 200 are higher than for 2000.

I have followed the guide for intel core i7 but i dont know what Im doing wrong.

Thomas_W_Intel · ‎06-07-2009

Quoting - alef_dos

I have an intel core i7.

I execute a program that multiplies 2 matrix for different sizes (200..5000).I use 4 threads with affinity ( 1 thread x core).The graph with time/iteration grows fast until 2000 (like a step) and then it maintains.This is due to cache misses.

So I calculate with vtune the cache misses (2nd level and 3rd level) for a size of 200 and 2000. The problem is that the % of cache misses (2nd and 3rd level) for 200 are higher than for 2000.

I have followed the guide for intel core i7 but i dont know what Im doing wrong.

Can you post the absolute values for the L2 and L3 cache misses (# of samples and sample-after value)? I also suggest that you double-check that you are accessing the data in the correct order to use all data in a cache line and verify that you don't have false sharing between your threads. Furthermore, I would verify (using VTune) what the hardware prefetchers are doing with your data access pattern.

Kind regards
Thomas

alef_dos · ‎06-07-2009

Quoting - Thomas Willhalm (Intel)

Can you post the absolute values for the L2 and L3 cache misses (# of samples and sample-after value)? I also suggest that you double-check that you are accessing the data in the correct order to use all data in a cache line and verify that you don't have false sharing between your threads. Furthermore, I would verify (using VTune) what the hardware prefetchers are doing with your data access pattern.

Kind regards
Thomas

The first column are events and the second are samples. I use only events to calculate the % of the result. Thanks for the help.

For 200x200

Cache Misses	200x200 (events)	samples

For 2nd level:

MEM_LOAD_RETIRED.LLC_UNSHARED_HIT	137025	9135
MEM_LOAD_RETIRED.OTHER_CORE_L2_HIT_HITM	690217	40601
CPU_CLK_UNHALTED.THREAD	1431000064	530

Result(%)	3,904397659

For 3rd level

MEM_LOAD_RETIRED.LLC_MISS	1237228	26324
CPU_CLK_UNHALTED.THREAD	1431000064	530

Result(%)	15,56261566

For 2000x2000

Cache Misses	2000x2000 (events)	samples

For 2nd level:

MEM_LOAD_RETIRED.LLC_UNSHARED_HIT	3596748	87718
MEM_LOAD_RETIRED.OTHER_CORE_L2_HIT_HITM	2715848	87608
CPU_CLK_UNHALTED.THREAD	12957299712	4799

Result(%)	2,522585255

For 3rd level

MEM_LOAD_RETIRED.LLC_MISS	4071015	62631
CPU_CLK_UNHALTED.THREAD	12957299712	4799

Result(%)	5,655365827

Dny · ‎06-22-2009

Quoting - alef_dos

The first column are events and the second are samples. I use only events to calculate the % of the result. Thanks for the help.

For 200x200

Cache Misses	200x200 (events)	samples

For 2nd level:

MEM_LOAD_RETIRED.LLC_UNSHARED_HIT	137025	9135
MEM_LOAD_RETIRED.OTHER_CORE_L2_HIT_HITM	690217	40601
CPU_CLK_UNHALTED.THREAD	1431000064	530

Result(%)	3,904397659

For 3rd level

MEM_LOAD_RETIRED.LLC_MISS	1237228	26324
CPU_CLK_UNHALTED.THREAD	1431000064	530

Result(%)	15,56261566

For 2000x2000

Cache Misses	2000x2000 (events)	samples

For 2nd level:

MEM_LOAD_RETIRED.LLC_UNSHARED_HIT	3596748	87718
MEM_LOAD_RETIRED.OTHER_CORE_L2_HIT_HITM	2715848	87608
CPU_CLK_UNHALTED.THREAD	12957299712	4799

Result(%)	2,522585255

For 3rd level

MEM_LOAD_RETIRED.LLC_MISS	4071015	62631
CPU_CLK_UNHALTED.THREAD	12957299712	4799

Result(%)	5,655365827

Hi,
For different workloads VTune changes the SAV (Sample After value), so comparing # of samples with different workloads is not correct.
If you want to compare the event with different workload then you need to calculate total # of events and then compare it.
Total # of events = # of Samples * SAV

Hope this will help you.
Thanks,

Regards,
Dny

alef_dos · ‎06-23-2009

Quoting - Dny

Hi,
For different workloads VTune changes the SAV (Sample After value), so comparing # of samples with different workloads is not correct.
If you want to compare the event with different workload then you need to calculate total # of events and then compare it.
Total # of events = # of Samples * SAV

Hope this will help you.
Thanks,

Regards,
Dny

So , the number of total events that appears when the program is run is not correct?

I did it and the result is still the same.For Example:

200x200 (3rd level cache)

MEM_LOAD_RETIRED.LLC_MISS:

Samples: 41207
SAV:41
Total:1730694

CPU_CLK_UNHALTED.THREAD:

Samples: 577
SAV:2700000
Total:1557900000

Result: 19%

2000x2000 (3rd level cache)

MEM_LOAD_RETIRED.LLC_MISS:

Samples: 80742
SAV:61
Total:4925262

CPU_CLK_UNHALTED.THREAD:

Samples: 6472
SAV:2700000
Total:17474400000

Result: 5%

Thomas_W_Intel · ‎06-29-2009

Quoting - alef_dos

So , the number of total events that appears when the program is run is not correct?

I did it and the result is still the same.For Example:

200x200 (3rd level cache)

MEM_LOAD_RETIRED.LLC_MISS:

Samples: 41207
SAV:41
Total:1730694

CPU_CLK_UNHALTED.THREAD:

Samples: 577
SAV:2700000
Total:1557900000

Result: 19%

2000x2000 (3rd level cache)

MEM_LOAD_RETIRED.LLC_MISS:

Samples: 80742
SAV:61
Total:4925262

CPU_CLK_UNHALTED.THREAD:

Samples: 6472
SAV:2700000
Total:17474400000

Result: 5%

As sampling is a statistical method, there is always some variance in your measurements. It is recommended to have not more than 1000 samples per second for not disturbing the measurement too much. For getting statistically relevant data, you should have at least a few thousand samples.

Your sample-after-value for last-level cache misses is very low. On the other hand, you have a lot of samples for cache missed compared to the samples for clock-ticks, especially in the 200x200 case.

I suggest that you disable the calibration run and use a fixed sample-after value. A good way to choose SAVs is to adjust them to the impact of the event that you are measuring. For clock-ticks, a SAV of 2,000,000 is fine, because you will get about 1000 samples per second. For LLC misses, a SAV of 10,000 is appropriate, because a cache miss lasts a few hundred cycles. Again, you should get about 1000 samples per second if LLC misses are a problem in your application. (You will get much fewer samples, if LLC misses are not a problem, but then you don't care about the exact value anyway, right?)

Kind regards
Thomas

Anders_Ø_ · ‎03-03-2016

Dear Intel

For Moore and the 4nanometer enigma, on 137025 faith leap Savant in 2, (simple arethmetrics) suggests 1,10,100,1000,10000 etc.

Best wishes Anders Øbro Ravn validationx2