Solved: Performance of kmp_affinity explicit on Intel Xeon Phi

jaisimha_sai_k_ · ‎02-27-2017

Hello,

I am a new MSc student working on Intel Xeon Phi. For my current project, I am running NAS parallel benchmark BT on xeon phi coprocessor with two different affinity patterns using KMP_AFFINITY explicit

scenario 1: running with 40 threads with 4 threads per each physical core (using 4 virtual cores per core and total of 10 physical cores )

results from VTune amplifier:
Elapsed time: 73 seconds

L2_DATA_READ_MISS_CACHE_FILL :76,150,000

L2_DATA_WRITE_MISS_CACHE_FILL: 616,750,000

L2_READ_MISS_CACHE_MEM_FILL : 1,862,300,000

L2_DATA_WRITE_MISS_MEM_FILL: 1,964,750,000

scenario 2: running same benchmark with same 40 threads but with one thread per each physical core (using total 40 physical cores)

results from VTune amplifier:
Elapsed time: 48 seconds

L2_DATA_READ_MISS_CACHE_FILL :272,800,000

L2_DATA_WRITE_MISS_CACHE_FILL: 104,200,000

L2_READ_MISS_CACHE_MEM_FILL : 1,524,350,000

L2_DATA_WRITE_MISS_MEM_FILL: 2,548,200,000

Though scenario 2 has significant amount of write misses, around 580 million more than scenario 1, scenario 2 is taking significantly less time to complete than scenario 1. I am confused why the first configuration takes so much time. I expected it to be faster. What might be the reason for this? I would really appreciate your help!

McCalpinJohn · ‎02-28-2017

It would help to specify which specific version and which problem size is being run.... We could probably make a pretty good guess from looking at the timings, but it would be a lot easier if the experimental details were provided up front (especially for such a well-known benchmark -- it does not take many words to make a clear specification).

There have been plenty of analyses of the NAS BT benchmark over the years. The paper introducing the OpenMP version is a good example of an informative high-level analysis (https://www.nas.nasa.gov/assets/pdf/techreports/1999/nas-99-011.pdf). In particular, figure 3 in that paper shows the scalability of the major functions in the code, and discusses how the codes with the least cache-friendly memory access patterns (e.g. "rhsz") show the poorest scaling.

It is very hard to comment on the interaction of the code with the caches unless we know which problem size is being run. The Class A, B, and C, sizes of the BT benchmark use grid sizes of 64^3, 102^3, and 162^3, respectively, and the BT benchmark is known to have fairly strong idiosyncrasies with scaling (e.g., https://www.researchgate.net/publication/261130301_Performance_Evaluation_of_NAS_Parallel_Benchmarks_on_Intel_Xeon_Phi), with different idiosyncrasies for the different problem sizes.

There is relatively little difference in overall data traffic beyond the L2 in the results posted above -- the sums of the 4 categories are within 2%. This suggests that looking at the memory traffic is not the right place to look for the performance difference. One very important difference between the two tests is in the number of cores available for instruction execution. The first case must execute the instructions for all 40 threads on 10 physical cores (each limited to 2 instructions per cycle), while the second case has 40 cores to execute the instructions.

If the values above are interpreted as total counts of cache-line transactions, then the first case is only moving 4 GB/s and the second case is only moving 6 GB/s. Since one KNL core can move more than 10 GB/s, this strongly suggests that the problem is compute-limited, and looking at these memory access values is irrelevant.

View solution in original post

TimP · ‎02-28-2017

I haven't seen any experts on NPB frequenting this forum. As you appear to be concentrating on MIC, that forum https://software.intel.com/en-us/forums/intel-many-integrated-core appears to be a better bet for follow-up.

I don't know whether the current MIC KNL is considered better suited to BT than the past KNC was. KNL is designed to achieve full performance with 1 thread per core, while KNC normally peaked at 2 or 3 threads per core. This is not to say that multiple threads per core could be expected to match performance of the same number of threads spread out 1 per core.

I guess you may be running the BT for OpenMP. You really don't tell enough here to answer questions some of which you might have answered yourself if you read literature about your processor. The explicit specifier in KMP_AFFINITY doesn't do much but enable some of the more important specifiers. Did you achieve performance gains over KMP_HW_SUBSET defaults together with OMP_PROC_BIND=close ?

jaisimha_sai_k_ · ‎02-28-2017

Hi Tim,

I am using KNC and I read KNC performance can be peaked by using more than one thread per core. so I expected my scenario 1 will give me better performance but what happening here is so confusing. I am running BT for OpenMP and didn't set any KMP_HW_SUBSET and OMP_PROC_BIND. I tried to pin the thread to core using KMP_AFFINITY explicit, Isn't this sufficient for binding thread to core?

McCalpinJohn · ‎02-28-2017

It would help to specify which specific version and which problem size is being run.... We could probably make a pretty good guess from looking at the timings, but it would be a lot easier if the experimental details were provided up front (especially for such a well-known benchmark -- it does not take many words to make a clear specification).

There have been plenty of analyses of the NAS BT benchmark over the years. The paper introducing the OpenMP version is a good example of an informative high-level analysis (https://www.nas.nasa.gov/assets/pdf/techreports/1999/nas-99-011.pdf). In particular, figure 3 in that paper shows the scalability of the major functions in the code, and discusses how the codes with the least cache-friendly memory access patterns (e.g. "rhsz") show the poorest scaling.

It is very hard to comment on the interaction of the code with the caches unless we know which problem size is being run. The Class A, B, and C, sizes of the BT benchmark use grid sizes of 64^3, 102^3, and 162^3, respectively, and the BT benchmark is known to have fairly strong idiosyncrasies with scaling (e.g., https://www.researchgate.net/publication/261130301_Performance_Evaluation_of_NAS_Parallel_Benchmarks_on_Intel_Xeon_Phi), with different idiosyncrasies for the different problem sizes.

There is relatively little difference in overall data traffic beyond the L2 in the results posted above -- the sums of the 4 categories are within 2%. This suggests that looking at the memory traffic is not the right place to look for the performance difference. One very important difference between the two tests is in the number of cores available for instruction execution. The first case must execute the instructions for all 40 threads on 10 physical cores (each limited to 2 instructions per cycle), while the second case has 40 cores to execute the instructions.

If the values above are interpreted as total counts of cache-line transactions, then the first case is only moving 4 GB/s and the second case is only moving 6 GB/s. Since one KNL core can move more than 10 GB/s, this strongly suggests that the problem is compute-limited, and looking at these memory access values is irrelevant.

jaisimha_sai_k_ · ‎02-28-2017

I am using NPB version 3.0, BT benchmark Class A, grid size of 64^3. what i am wondering was how come first case can have those many cache misses when there are sharing local L2 cache. may be what i am thinking is wrong i don't understand!!

jaisimha_sai_k_ · ‎02-28-2017

Hi John

I observed computational resources for both cases, it shows 2430 Mops for first case and 4079 Mops for second case. so what you said seems correct, their computational resources are different. Thank you for your suggestions!!