Solved: >>...two different affinity

jaisimha_sai_k_ · ‎02-28-2017

Hello,

I am a new MSc student working on Intel Xeon Phi KNC. For my current project, I am running NAS parallel benchmark BT on xeon phi coprocessor with two different affinity patterns using KMP_AFFINITY explicit

scenario 1: running with 40 threads with 4 threads per each physical core (using 4 virtual cores per core and total of 10 physical cores )

results from VTune amplifier:
Elapsed time: 73 seconds

L2_DATA_READ_MISS_CACHE_FILL :76,150,000

L2_DATA_WRITE_MISS_CACHE_FILL: 616,750,000

L2_READ_MISS_CACHE_MEM_FILL : 1,862,300,000

L2_DATA_WRITE_MISS_MEM_FILL: 1,964,750,000

scenario 2: running same benchmark with same 40 threads but with one thread per each physical core (using total 40 physical cores)

results from VTune amplifier:
Elapsed time: 48 seconds

L2_DATA_READ_MISS_CACHE_FILL :272,800,000

L2_DATA_WRITE_MISS_CACHE_FILL: 104,200,000

L2_READ_MISS_CACHE_MEM_FILL : 1,524,350,000

L2_DATA_WRITE_MISS_MEM_FILL: 2,548,200,000

Though scenario 2 has significant amount of write misses, around 580 million more than scenario 1, scenario 2 is taking significantly less time to complete than scenario 1. I also read KNC performance can be peaked by using more than one thread per core. I am confused why the first configuration takes so much time. I expected it to be faster. What might be the reason for this? I would really appreciate your help!

McCalpinJohn · ‎02-28-2017

As I mentioned in the other forum (https://software.intel.com/en-us/forums/intel-moderncode-for-parallel-architectures/topic/714924#comment-1898880), you need to evaluate whether this code is likely to be compute-limited or memory-access limited. The total traffic for the 2 cases above is almost the same, but the compute resources are substantially different. Interpreting these results as indicating a compute-bound application seems like a reasonable start.

View solution in original post

McCalpinJohn · ‎02-28-2017

As I mentioned in the other forum (https://software.intel.com/en-us/forums/intel-moderncode-for-parallel-architectures/topic/714924#comment-1898880), you need to evaluate whether this code is likely to be compute-limited or memory-access limited. The total traffic for the 2 cases above is almost the same, but the compute resources are substantially different. Interpreting these results as indicating a compute-bound application seems like a reasonable start.

jaisimha_sai_k_ · ‎02-28-2017

Hi John

I observed computational resources for both cases, it shows 2430 Mops for first case and 4079 Mops for second case. so what you said seems correct, their computational resources are different. Thank you for your suggestions!!

SergeyKostrov · ‎03-08-2017

>>...two different affinity patterns using KMP_AFFINITY explicit... If you try KMP_AFFINITY set to scatter or balanced you should get highest performance numbers similar to scenario 2. Also, when these KMP_AFFINITY modes are used CPU #1 and CPU#2 share L2, and CPU #1 uses exclusively L1, and so on ( execute cpuinfo for more details if you're using a Linux OS ). High number of cache misses, for both your cases, indicates that there is a problem with a test case and if this is an OpenMP processing then data partitioning is not done correctly.

James_C_Intel2 · ‎03-08-2017

Also, when these KMP_AFFINITY modes are used CPU #1 and CPU#2 share L2,

The initial question says "KNC". It is KNL where tiles are used and two cores share an L2$.

Performance of kmp_affinity explicit on Intel Xeon Phi KNC