- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have an intel core i7.
I execute a program that multiplies 2 matrix for different sizes (200..5000).I use 4 threads with affinity ( 1 thread x core).The graph with time/iteration grows fast until 2000 (like a step) and then it maintains.This is due to cache misses.
So I calculate with vtune the cache misses (2nd level and 3rd level) for a size of 200 and 2000. The problem is that the % of cache misses (2nd and 3rd level) for 200 are higher than for 2000.
I have followed the guide for intel core i7 but i dont know what Im doing wrong.
I execute a program that multiplies 2 matrix for different sizes (200..5000).I use 4 threads with affinity ( 1 thread x core).The graph with time/iteration grows fast until 2000 (like a step) and then it maintains.This is due to cache misses.
So I calculate with vtune the cache misses (2nd level and 3rd level) for a size of 200 and 2000. The problem is that the % of cache misses (2nd and 3rd level) for 200 are higher than for 2000.
I have followed the guide for intel core i7 but i dont know what Im doing wrong.
Link Copied
6 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - alef_dos
I have an intel core i7.
I execute a program that multiplies 2 matrix for different sizes (200..5000).I use 4 threads with affinity ( 1 thread x core).The graph with time/iteration grows fast until 2000 (like a step) and then it maintains.This is due to cache misses.
So I calculate with vtune the cache misses (2nd level and 3rd level) for a size of 200 and 2000. The problem is that the % of cache misses (2nd and 3rd level) for 200 are higher than for 2000.
I have followed the guide for intel core i7 but i dont know what Im doing wrong.
I execute a program that multiplies 2 matrix for different sizes (200..5000).I use 4 threads with affinity ( 1 thread x core).The graph with time/iteration grows fast until 2000 (like a step) and then it maintains.This is due to cache misses.
So I calculate with vtune the cache misses (2nd level and 3rd level) for a size of 200 and 2000. The problem is that the % of cache misses (2nd and 3rd level) for 200 are higher than for 2000.
I have followed the guide for intel core i7 but i dont know what Im doing wrong.
Can you post the absolute values for the L2 and L3 cache misses (# of samples and sample-after value)? I also suggest that you double-check that you are accessing the data in the correct order to use all data in a cache line and verify that you don't have false sharing between your threads. Furthermore, I would verify (using VTune) what the hardware prefetchers are doing with your data access pattern.
Kind regards
Thomas
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - Thomas Willhalm (Intel)
Can you post the absolute values for the L2 and L3 cache misses (# of samples and sample-after value)? I also suggest that you double-check that you are accessing the data in the correct order to use all data in a cache line and verify that you don't have false sharing between your threads. Furthermore, I would verify (using VTune) what the hardware prefetchers are doing with your data access pattern.
Kind regards
Thomas
The first column are events and the second are samples. I use only events to calculate the % of the result. Thanks for the help.
For 200x200
Cache Misses | 200x200 (events) | samples |
For 2nd level: | ||
MEM_LOAD_RETIRED.LLC_UNSHARED_HIT | 137025 | 9135 |
MEM_LOAD_RETIRED.OTHER_CORE_L2_HIT_HITM | 690217 | 40601 |
CPU_CLK_UNHALTED.THREAD | 1431000064 | 530 |
Result(%) | 3,904397659 | |
For 3rd level | ||
MEM_LOAD_RETIRED.LLC_MISS | 1237228 | 26324 |
CPU_CLK_UNHALTED.THREAD | 1431000064 | 530 |
Result(%) | 15,56261566 |
For 2000x2000
Cache Misses | 2000x2000 (events) | samples |
For 2nd level: | ||
MEM_LOAD_RETIRED.LLC_UNSHARED_HIT | 3596748 | 87718 |
MEM_LOAD_RETIRED.OTHER_CORE_L2_HIT_HITM | 2715848 | 87608 |
CPU_CLK_UNHALTED.THREAD | 12957299712 | 4799 |
Result(%) | 2,522585255 | |
For 3rd level | ||
MEM_LOAD_RETIRED.LLC_MISS | 4071015 | 62631 |
CPU_CLK_UNHALTED.THREAD | 12957299712 | 4799 |
Result(%) | 5,655365827 |
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - alef_dos
The first column are events and the second are samples. I use only events to calculate the % of the result. Thanks for the help.
For 200x200
Cache Misses | 200x200 (events) | samples |
For 2nd level: | ||
MEM_LOAD_RETIRED.LLC_UNSHARED_HIT | 137025 | 9135 |
MEM_LOAD_RETIRED.OTHER_CORE_L2_HIT_HITM | 690217 | 40601 |
CPU_CLK_UNHALTED.THREAD | 1431000064 | 530 |
Result(%) | 3,904397659 | |
For 3rd level | ||
MEM_LOAD_RETIRED.LLC_MISS | 1237228 | 26324 |
CPU_CLK_UNHALTED.THREAD | 1431000064 | 530 |
Result(%) | 15,56261566 |
For 2000x2000
Cache Misses | 2000x2000 (events) | samples |
For 2nd level: | ||
MEM_LOAD_RETIRED.LLC_UNSHARED_HIT | 3596748 | 87718 |
MEM_LOAD_RETIRED.OTHER_CORE_L2_HIT_HITM | 2715848 | 87608 |
CPU_CLK_UNHALTED.THREAD | 12957299712 | 4799 |
Result(%) | 2,522585255 | |
For 3rd level | ||
MEM_LOAD_RETIRED.LLC_MISS | 4071015 | 62631 |
CPU_CLK_UNHALTED.THREAD | 12957299712 | 4799 |
Result(%) | 5,655365827 |
Hi,
For different workloads VTune changes the SAV (Sample After value), so comparing # of samples with different workloads is not correct.
If you want to compare the event with different workload then you need to calculate total # of events and then compare it.
Total # of events = # of Samples * SAV
Hope this will help you.
Thanks,
Regards,
Dny
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - Dny
Hi,
For different workloads VTune changes the SAV (Sample After value), so comparing # of samples with different workloads is not correct.
If you want to compare the event with different workload then you need to calculate total # of events and then compare it.
Total # of events = # of Samples * SAV
Hope this will help you.
Thanks,
Regards,
Dny
So , the number of total events that appears when the program is run is not correct?
I did it and the result is still the same.For Example:
200x200 (3rd level cache)
MEM_LOAD_RETIRED.LLC_MISS:
Samples: 41207
SAV:41
Total:1730694
CPU_CLK_UNHALTED.THREAD:
Samples: 577
SAV:2700000
Total:1557900000
Result: 19%
2000x2000 (3rd level cache)
MEM_LOAD_RETIRED.LLC_MISS:
Samples: 80742
SAV:61
Total:4925262
CPU_CLK_UNHALTED.THREAD:
Samples: 6472
SAV:2700000
Total:17474400000
Result: 5%
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting - alef_dos
So , the number of total events that appears when the program is run is not correct?
I did it and the result is still the same.For Example:
200x200 (3rd level cache)
MEM_LOAD_RETIRED.LLC_MISS:
Samples: 41207
SAV:41
Total:1730694
CPU_CLK_UNHALTED.THREAD:
Samples: 577
SAV:2700000
Total:1557900000
Result: 19%
2000x2000 (3rd level cache)
MEM_LOAD_RETIRED.LLC_MISS:
Samples: 80742
SAV:61
Total:4925262
CPU_CLK_UNHALTED.THREAD:
Samples: 6472
SAV:2700000
Total:17474400000
Result: 5%
As sampling is a statistical method, there is always some variance in your measurements. It is recommended to have not more than 1000 samples per second for not disturbing the measurement too much. For getting statistically relevant data, you should have at least a few thousand samples.
Your sample-after-value for last-level cache misses is very low. On the other hand, you have a lot of samples for cache missed compared to the samples for clock-ticks, especially in the 200x200 case.
I suggest that you disable the calibration run and use a fixed sample-after value. A good way to choose SAVs is to adjust them to the impact of the event that you are measuring. For clock-ticks, a SAV of 2,000,000 is fine, because you will get about 1000 samples per second. For LLC misses, a SAV of 10,000 is appropriate, because a cache miss lasts a few hundred cycles. Again, you should get about 1000 samples per second if LLC misses are a problem in your application. (You will get much fewer samples, if LLC misses are not a problem, but then you don't care about the exact value anyway, right?)
Kind regards
Thomas
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear Intel
For Moore and the 4nanometer enigma, on 137025 faith leap Savant in 2, (simple arethmetrics) suggests 1,10,100,1000,10000 etc.
Best wishes Anders Øbro Ravn validationx2
Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page