Software Archive
Read-only legacy content
17061 Discussions

Performance of kmp_affinity explicit on Intel Xeon Phi KNC

jaisimha_sai_k_
Beginner
343 Views

Hello,

I am a new MSc student working on Intel Xeon Phi KNC. For my current project, I am running NAS parallel benchmark BT on xeon phi coprocessor with two different affinity patterns using KMP_AFFINITY explicit

scenario 1: running with 40 threads with 4 threads per each physical core (using 4 virtual cores per core and total of 10 physical cores )

results from VTune amplifier:
Elapsed time: 73 seconds

L2_DATA_READ_MISS_CACHE_FILL :76,150,000

L2_DATA_WRITE_MISS_CACHE_FILL: 616,750,000

L2_READ_MISS_CACHE_MEM_FILL : 1,862,300,000

L2_DATA_WRITE_MISS_MEM_FILL: 1,964,750,000

scenario 2: running same benchmark with same 40 threads but with one thread per each physical core (using total 40 physical cores)

results from VTune amplifier:
Elapsed time: 48 seconds

L2_DATA_READ_MISS_CACHE_FILL :272,800,000

L2_DATA_WRITE_MISS_CACHE_FILL: 104,200,000

L2_READ_MISS_CACHE_MEM_FILL : 1,524,350,000

L2_DATA_WRITE_MISS_MEM_FILL: 2,548,200,000

Though scenario 2 has significant amount of write misses, around 580 million more than scenario 1, scenario 2 is taking significantly less time to complete than scenario 1.  I also read KNC performance can be peaked by using more than one thread per core. I am confused why the first configuration takes so much time. I expected it to be faster. What might be the reason for this? I would really appreciate your help!

0 Kudos
1 Solution
McCalpinJohn
Honored Contributor III
343 Views

As I mentioned in the other forum (https://software.intel.com/en-us/forums/intel-moderncode-for-parallel-architectures/topic/714924#comment-1898880), you need to evaluate whether this code is likely to be compute-limited or memory-access limited.  The total traffic for the 2 cases above is almost the same, but the compute resources are substantially different.  Interpreting these results as indicating a compute-bound application seems like a reasonable start.

View solution in original post

0 Kudos
4 Replies
McCalpinJohn
Honored Contributor III
344 Views

As I mentioned in the other forum (https://software.intel.com/en-us/forums/intel-moderncode-for-parallel-architectures/topic/714924#comment-1898880), you need to evaluate whether this code is likely to be compute-limited or memory-access limited.  The total traffic for the 2 cases above is almost the same, but the compute resources are substantially different.  Interpreting these results as indicating a compute-bound application seems like a reasonable start.

0 Kudos
jaisimha_sai_k_
Beginner
343 Views

Hi John

I observed computational resources for both cases, it shows 2430 Mops for first case and 4079 Mops for second case. so what you said seems correct, their computational resources are different. Thank you for your suggestions!!

0 Kudos
SergeyKostrov
Valued Contributor II
343 Views
>>...two different affinity patterns using KMP_AFFINITY explicit... If you try KMP_AFFINITY set to scatter or balanced you should get highest performance numbers similar to scenario 2. Also, when these KMP_AFFINITY modes are used CPU #1 and CPU#2 share L2, and CPU #1 uses exclusively L1, and so on ( execute cpuinfo for more details if you're using a Linux OS ). High number of cache misses, for both your cases, indicates that there is a problem with a test case and if this is an OpenMP processing then data partitioning is not done correctly.
0 Kudos
James_C_Intel2
Employee
343 Views

 Also, when these KMP_AFFINITY modes are used CPU #1 and CPU#2 share L2,

The initial question says "KNC". It is KNL where tiles are used and two cores share an L2$.

0 Kudos
Reply