- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
Hello,
I am a new MSc student working on Intel Xeon Phi KNC. For my current project, I am running NAS parallel benchmark BT on xeon phi coprocessor with two different affinity patterns using KMP_AFFINITY explicit
scenario 1: running with 40 threads with 4 threads per each physical core (using 4 virtual cores per core and total of 10 physical cores )
results from VTune amplifier:
Elapsed time: 73 seconds
L2_DATA_READ_MISS_CACHE_FILL :76,150,000
L2_DATA_WRITE_MISS_CACHE_FILL: 616,750,000
L2_READ_MISS_CACHE_MEM_FILL : 1,862,300,000
L2_DATA_WRITE_MISS_MEM_FILL: 1,964,750,000
scenario 2: running same benchmark with same 40 threads but with one thread per each physical core (using total 40 physical cores)
results from VTune amplifier:
Elapsed time: 48 seconds
L2_DATA_READ_MISS_CACHE_FILL :272,800,000
L2_DATA_WRITE_MISS_CACHE_FILL: 104,200,000
L2_READ_MISS_CACHE_MEM_FILL : 1,524,350,000
L2_DATA_WRITE_MISS_MEM_FILL: 2,548,200,000
Though scenario 2 has significant amount of write misses, around 580 million more than scenario 1, scenario 2 is taking significantly less time to complete than scenario 1. I also read KNC performance can be peaked by using more than one thread per core. I am confused why the first configuration takes so much time. I expected it to be faster. What might be the reason for this? I would really appreciate your help!
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
As I mentioned in the other forum (https://software.intel.com/en-us/forums/intel-moderncode-for-parallel-architectures/topic/714924#comment-1898880), you need to evaluate whether this code is likely to be compute-limited or memory-access limited. The total traffic for the 2 cases above is almost the same, but the compute resources are substantially different. Interpreting these results as indicating a compute-bound application seems like a reasonable start.
コピーされたリンク
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
As I mentioned in the other forum (https://software.intel.com/en-us/forums/intel-moderncode-for-parallel-architectures/topic/714924#comment-1898880), you need to evaluate whether this code is likely to be compute-limited or memory-access limited. The total traffic for the 2 cases above is almost the same, but the compute resources are substantially different. Interpreting these results as indicating a compute-bound application seems like a reasonable start.
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
Hi John
I observed computational resources for both cases, it shows 2430 Mops for first case and 4079 Mops for second case. so what you said seems correct, their computational resources are different. Thank you for your suggestions!!
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
Also, when these KMP_AFFINITY modes are used CPU #1 and CPU#2 share L2,
The initial question says "KNC". It is KNL where tiles are used and two cores share an L2$.
