- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I am a new MSc student working on Intel Xeon Phi KNC. For my current project, I am running NAS parallel benchmark BT on xeon phi coprocessor with two different affinity patterns using KMP_AFFINITY explicit
scenario 1: running with 40 threads with 4 threads per each physical core (using 4 virtual cores per core and total of 10 physical cores )
results from VTune amplifier:
Elapsed time: 73 seconds
L2_DATA_READ_MISS_CACHE_FILL :76,150,000
L2_DATA_WRITE_MISS_CACHE_FILL: 616,750,000
L2_READ_MISS_CACHE_MEM_FILL : 1,862,300,000
L2_DATA_WRITE_MISS_MEM_FILL: 1,964,750,000
scenario 2: running same benchmark with same 40 threads but with one thread per each physical core (using total 40 physical cores)
results from VTune amplifier:
Elapsed time: 48 seconds
L2_DATA_READ_MISS_CACHE_FILL :272,800,000
L2_DATA_WRITE_MISS_CACHE_FILL: 104,200,000
L2_READ_MISS_CACHE_MEM_FILL : 1,524,350,000
L2_DATA_WRITE_MISS_MEM_FILL: 2,548,200,000
Though scenario 2 has significant amount of write misses, around 580 million more than scenario 1, scenario 2 is taking significantly less time to complete than scenario 1. I also read KNC performance can be peaked by using more than one thread per core. I am confused why the first configuration takes so much time. I expected it to be faster. What might be the reason for this? I would really appreciate your help!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
As I mentioned in the other forum (https://software.intel.com/en-us/forums/intel-moderncode-for-parallel-architectures/topic/714924#comment-1898880), you need to evaluate whether this code is likely to be compute-limited or memory-access limited. The total traffic for the 2 cases above is almost the same, but the compute resources are substantially different. Interpreting these results as indicating a compute-bound application seems like a reasonable start.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
As I mentioned in the other forum (https://software.intel.com/en-us/forums/intel-moderncode-for-parallel-architectures/topic/714924#comment-1898880), you need to evaluate whether this code is likely to be compute-limited or memory-access limited. The total traffic for the 2 cases above is almost the same, but the compute resources are substantially different. Interpreting these results as indicating a compute-bound application seems like a reasonable start.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi John
I observed computational resources for both cases, it shows 2430 Mops for first case and 4079 Mops for second case. so what you said seems correct, their computational resources are different. Thank you for your suggestions!!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Also, when these KMP_AFFINITY modes are used CPU #1 and CPU#2 share L2,
The initial question says "KNC". It is KNL where tiles are used and two cores share an L2$.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page