Solved: Optimizing Performance for the C API of Intel-Tensorflow

ETrau1 · ‎07-10-2019

I am currently experimenting with the Intel optimizations for Tensorflow since I am very impressed of the inference throughputs shown here (https://software.intel.com/en-us/articles/maximize-tensorflow-performance-on-cpu-considerations-and-recommendations-for-inference).

As I learned from the article, the following options are useful for maximizing TensorFlow throughput during inference:

-> inter_op_parallelism =2 (setting intra_op_parallelism to # of physical cores seems to be only recommended for real time inference)

-> using NCHW data format

-> setting several NUMA parameters

-> using KMP_AFFINITY=granularity=fine,verbose,compact,1,0

-> using KMP_BLOCKTIME=1 (I use a non-CNN network)

However, I have to use TensorFlow's C API within my project. I found out how to set inter_op_parallelism and intra_op_parallelism using the experimental header of the C API. Still, I cannot figure out how to make use of the further suggested optimization recommendations while using the C API.

I would be very grateful if somebody could give me a hint about whether that is possible at all and if yes, where to start.

Thanks in advance and best regards,

Elias Trautner

Nathan_G_Intel · ‎07-24-2019

@ETrau1 Yes this is a per node recommendation. You do not want to execute across multiple NUMA nodes with a single program, as you will run into poor performance. Instead launch a single inference program stream on each node simultaneously, using the recommended inter/intra ops threads. Also, don't forget to set the OMP_NUM_THREADS as well. This is the most straightforward way to get strong performance. Alternatively, you can think about launching more than one program stream per node (still however not spreading a stream execution across nodes) and pull our more performance. You will have to measure the optimal amount of streams per node empirically. We have a whitepaper if you want to try this advanced execution. (https://www.intel.ai/solutions/best-known-methods-for-scaling-deep-learning-with-tensorflow-on-intel-xeon-processor-based-clusters/). See the "Multi-Stream Inference on the Trained Model" section for more details.

View solution in original post

Nathan_G_Intel · ‎07-10-2019

@ETrau1 KMP_AFFINITY and KMP_BLOCKTIME are environmental variables. Simply set them in the shell with "export <>".

NCHW refers to input format. This is related to how you've stored/loaded your data into memory. It is not a setting.

NUMA is only relevant if you have a NUMA-enabled system. There is a great write-up here: https://www.thegeekdiary.com/centos-rhel-how-to-find-if-numa-configuration-is-enabled-or-disabled/

ETrau1 · ‎07-18-2019

Hello @NathanG_intel ,

thanks for your reply, I just noticed it and tried some settings. For me, only the KMP settings improved the inference performance, while NUMA (although it is enabled) and OMP_NUM_THREADS made the inference speed slower. However I would like to apply the KMP settings. My remaining question concerns the execution on multinode systems (e.g. 20 nodes with 24 cores).

Should I set intra_op_parallelism_threads to 24*20 or to 24 (physical cores on onde node)?

Thanks!

Nathan_G_Intel · ‎07-19-2019

@ETrau1 can you pass "lscpu" into your shell and past the contents into the forum?

ETrau1 · ‎07-22-2019

On each single node of our so-called large queue, entering "lscpu" gives me:

[elias@node120 ~]$ lscpu

Architecture: x86_64

CPU op-mode(s): 32-bit, 64-bit

Byte Order: Little Endian

CPU(s): 24

On-line CPU(s) list: 0-23

Thread(s) per core: 1

Core(s) per socket: 12

Socket(s): 2

NUMA node(s): 2

Vendor ID: GenuineIntel

CPU family: 6

Model: 63

Model name: Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz

Stepping: 2

CPU MHz: 1199.975

CPU max MHz: 3100,0000

CPU min MHz: 1200,0000

BogoMIPS: 4599.62

Virtualization: VT-x

L1d cache: 32K

L1i cache: 32K

L2 cache: 256K

L3 cache: 30720K

NUMA node0 CPU(s): 0-11

NUMA node1 CPU(s): 12-23

Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt xsave avx f16c rdrand lahf_lm abm epb intel_ppin ssbd ibrs ibpb tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc dtherm ida arat pln pts

I do not know a way to enter lscpu for the entire queue, but for our simulations we want to use 1, 2, 5, 10 and 20 of these nodes subsequently. Therefore I have access to 24 cores per node (2 sockets * 12 cores), meaning I can use 24, 48, 120, 240 and 480 cores for the inference.

Nathan_G_Intel · ‎07-23-2019

@ETrau1 TensorFlow has known NUMA performance issues across nodes. So we recommend you to launch 1 process per socket. In the case of inference, you can launch multiple processes (1 for each socket). For intersocket execution, we recommend to set inter_op_threads and MKL_NUM_THREADS to "# of physical cores", NOT the #of available hyperthreads. So in your case, the number of physical cores per socket is 12. and NUmber of NUMA nodes is 2.

ETrau1 · ‎07-24-2019

I am just using TensorFlow for inference purposes. I assumed the intra_op_threads should be 12 in this case...? And inter_op still 2 as suggested on the optimization pages?

So I would assume it would be best to use intra_op = 12 and inter_op =2?

Nathan_G_Intel · ‎07-24-2019

@ETrau1 Yes you are correct. I had a typo in my response, writing "inter" instead of "intra". I apologize. Your statement "best to use intra_op = 12 and inter_op =2" is correct. You can always fine-tune imperically if you have the time for testing, but this is our general recommendation (directly from our engineering dev teams).

ETrau1 · ‎07-24-2019

Ok thanks for the quick reply. I finally conclude that I can keep intra_op=12 and inter_op=2 independently of the number of nodes I use, right?

Nathan_G_Intel · ‎07-24-2019

@ETrau1 Yes this is a per node recommendation. You do not want to execute across multiple NUMA nodes with a single program, as you will run into poor performance. Instead launch a single inference program stream on each node simultaneously, using the recommended inter/intra ops threads. Also, don't forget to set the OMP_NUM_THREADS as well. This is the most straightforward way to get strong performance. Alternatively, you can think about launching more than one program stream per node (still however not spreading a stream execution across nodes) and pull our more performance. You will have to measure the optimal amount of streams per node empirically. We have a whitepaper if you want to try this advanced execution. (https://www.intel.ai/solutions/best-known-methods-for-scaling-deep-learning-with-tensorflow-on-intel-xeon-processor-based-clusters/). See the "Multi-Stream Inference on the Trained Model" section for more details.