Analyzers
Talk to fellow users of Intel Analyzer tools (Intel VTune™ Profiler, Intel Advisor)
4975 Discussions

Problem with cache misses (intel core i7)

alef_dos
Beginner
554 Views
I have an intel core i7.

I execute a program that multiplies 2 matrix for different sizes (200..5000).I use 4 threads with affinity ( 1 thread x core).The graph with time/iteration grows fast until 2000 (like a step) and then it maintains.This is due to cache misses.

So I calculate with vtune the cache misses (2nd level and 3rd level) for a size of 200 and 2000. The problem is that the % of cache misses (2nd and 3rd level) for 200 are higher than for 2000.

I have followed the guide for intel core i7 but i dont know what Im doing wrong.
0 Kudos
6 Replies
Thomas_W_Intel
Employee
554 Views
Quoting - alef_dos
I have an intel core i7.

I execute a program that multiplies 2 matrix for different sizes (200..5000).I use 4 threads with affinity ( 1 thread x core).The graph with time/iteration grows fast until 2000 (like a step) and then it maintains.This is due to cache misses.

So I calculate with vtune the cache misses (2nd level and 3rd level) for a size of 200 and 2000. The problem is that the % of cache misses (2nd and 3rd level) for 200 are higher than for 2000.

I have followed the guide for intel core i7 but i dont know what Im doing wrong.

Can you post the absolute values for the L2 and L3 cache misses (# of samples and sample-after value)? I also suggest that you double-check that you are accessing the data in the correct order to use all data in a cache line and verify that you don't have false sharing between your threads. Furthermore, I would verify (using VTune) what the hardware prefetchers are doing with your data access pattern.

Kind regards
Thomas
0 Kudos
alef_dos
Beginner
554 Views

Can you post the absolute values for the L2 and L3 cache misses (# of samples and sample-after value)? I also suggest that you double-check that you are accessing the data in the correct order to use all data in a cache line and verify that you don't have false sharing between your threads. Furthermore, I would verify (using VTune) what the hardware prefetchers are doing with your data access pattern.

Kind regards
Thomas

The first column are events and the second are samples. I use only events to calculate the % of the result. Thanks for the help.


For 200x200

Cache Misses 200x200 (events) samples
For 2nd level:
MEM_LOAD_RETIRED.LLC_UNSHARED_HIT 137025 9135
MEM_LOAD_RETIRED.OTHER_CORE_L2_HIT_HITM 690217 40601
CPU_CLK_UNHALTED.THREAD 1431000064 530
Result(%) 3,904397659
For 3rd level
MEM_LOAD_RETIRED.LLC_MISS 1237228 26324
CPU_CLK_UNHALTED.THREAD 1431000064 530
Result(%) 15,56261566


For 2000x2000


Cache Misses 2000x2000 (events) samples
For 2nd level:
MEM_LOAD_RETIRED.LLC_UNSHARED_HIT 3596748 87718
MEM_LOAD_RETIRED.OTHER_CORE_L2_HIT_HITM 2715848 87608
CPU_CLK_UNHALTED.THREAD 12957299712 4799
Result(%) 2,522585255
For 3rd level
MEM_LOAD_RETIRED.LLC_MISS 4071015 62631
CPU_CLK_UNHALTED.THREAD 12957299712 4799
Result(%) 5,655365827



0 Kudos
Dny
Beginner
554 Views
Quoting - alef_dos

The first column are events and the second are samples. I use only events to calculate the % of the result. Thanks for the help.


For 200x200

Cache Misses 200x200 (events) samples
For 2nd level:
MEM_LOAD_RETIRED.LLC_UNSHARED_HIT 137025 9135
MEM_LOAD_RETIRED.OTHER_CORE_L2_HIT_HITM 690217 40601
CPU_CLK_UNHALTED.THREAD 1431000064 530
Result(%) 3,904397659
For 3rd level
MEM_LOAD_RETIRED.LLC_MISS 1237228 26324
CPU_CLK_UNHALTED.THREAD 1431000064 530
Result(%) 15,56261566


For 2000x2000


Cache Misses 2000x2000 (events) samples
For 2nd level:
MEM_LOAD_RETIRED.LLC_UNSHARED_HIT 3596748 87718
MEM_LOAD_RETIRED.OTHER_CORE_L2_HIT_HITM 2715848 87608
CPU_CLK_UNHALTED.THREAD 12957299712 4799
Result(%) 2,522585255
For 3rd level
MEM_LOAD_RETIRED.LLC_MISS 4071015 62631
CPU_CLK_UNHALTED.THREAD 12957299712 4799
Result(%) 5,655365827




Hi,
For different workloads VTune changes the SAV (Sample After value), so comparing # of samples with different workloads is not correct.
If you want to compare the event with different workload then you need to calculate total # of events and then compare it.
Total # of events = # of Samples * SAV

Hope this will help you.
Thanks,

Regards,
Dny
0 Kudos
alef_dos
Beginner
554 Views
Quoting - Dny

Hi,
For different workloads VTune changes the SAV (Sample After value), so comparing # of samples with different workloads is not correct.
If you want to compare the event with different workload then you need to calculate total # of events and then compare it.
Total # of events = # of Samples * SAV

Hope this will help you.
Thanks,

Regards,
Dny

So , the number of total events that appears when the program is run is not correct?

I did it and the result is still the same.For Example:

200x200 (3rd level cache)

MEM_LOAD_RETIRED.LLC_MISS:

Samples: 41207
SAV:41
Total:1730694

CPU_CLK_UNHALTED.THREAD:

Samples: 577
SAV:2700000
Total:1557900000

Result: 19%

2000x2000 (3rd level cache)

MEM_LOAD_RETIRED.LLC_MISS:

Samples: 80742
SAV:61
Total:4925262

CPU_CLK_UNHALTED.THREAD:

Samples: 6472
SAV:2700000
Total:17474400000


Result: 5%

0 Kudos
Thomas_W_Intel
Employee
554 Views
Quoting - alef_dos

So , the number of total events that appears when the program is run is not correct?

I did it and the result is still the same.For Example:

200x200 (3rd level cache)

MEM_LOAD_RETIRED.LLC_MISS:

Samples: 41207
SAV:41
Total:1730694

CPU_CLK_UNHALTED.THREAD:

Samples: 577
SAV:2700000
Total:1557900000

Result: 19%

2000x2000 (3rd level cache)

MEM_LOAD_RETIRED.LLC_MISS:

Samples: 80742
SAV:61
Total:4925262

CPU_CLK_UNHALTED.THREAD:

Samples: 6472
SAV:2700000
Total:17474400000


Result: 5%


As sampling is a statistical method, there is always some variance in your measurements. It is recommended to have not more than 1000 samples per second for not disturbing the measurement too much. For getting statistically relevant data, you should have at least a few thousand samples.

Your sample-after-value for last-level cache misses is very low. On the other hand, you have a lot of samples for cache missed compared to the samples for clock-ticks, especially in the 200x200 case.

I suggest that you disable the calibration run and use a fixed sample-after value. A good way to choose SAVs is to adjust them to the impact of the event that you are measuring. For clock-ticks, a SAV of 2,000,000 is fine, because you will get about 1000 samples per second. For LLC misses, a SAV of 10,000 is appropriate, because a cache miss lasts a few hundred cycles. Again, you should get about 1000 samples per second if LLC misses are a problem in your application. (You will get much fewer samples, if LLC misses are not a problem, but then you don't care about the exact value anyway, right?)

Kind regards
Thomas
0 Kudos
Anders_Ø_
Beginner
554 Views

Dear Intel

For Moore and the 4nanometer enigma, on 137025 faith leap Savant in 2, (simple arethmetrics) suggests 1,10,100,1000,10000 etc.

Best wishes Anders Øbro Ravn validationx2

0 Kudos
Reply