Solved: Running int8 model on Intel-Optimized-Tensorflow

lin__chiungliang · ‎01-15-2020

I read the article

https://www.intel.ai/accelerating-tensorflow-inference-with-intel-deep-learning-boost-on-2nd-gen-intel-xeon-scalable-processors/#gs.rzwuy9

It mentioned that the 2nd generation instructions such as AVX512_VNNI are optimized for Neural Network

I ran one of INT8 models in IntelAI

https://github.com/IntelAI/models/tree/master/benchmarks

Here is my environment

- Docker: docker.io/intelaipg/intel-optimized-tensorflow:latest

- CPU info

Architecture: x86_64

CPU op-mode(s): 32-bit, 64-bit

Byte Order: Little Endian

CPU(s): 96

On-line CPU(s) list: 0-95

Thread(s) per core: 2

Core(s) per socket: 24

Socket(s): 2

NUMA node(s): 2

Vendor ID: GenuineIntel

CPU family: 6

Model: 85

Model name: Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz

Stepping: 7

CPU MHz: 1838.080

BogoMIPS: 5000.00

Hypervisor vendor: KVM

Virtualization type: full

L1d cache: 32K

L1i cache: 32K

L2 cache: 1024K

L3 cache: 36608K

NUMA node0 CPU(s): 0-23,48-71

NUMA node1 CPU(s): 24-47,72-95

Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke avx512_vnni

I expect to run the Neural Network by the 2 gen instructions (AVX512_VNNI)

but it shows that the following optimized instructions are used:

AVX512F, AVX2, FMA

Is the docker image the optimized version to run Neural Network?

How can I get the information whether AVX512_VNNI is used or not?

How can I compile the code provided by IntelAI by the 2 gen Intel instructions?

Which docker image can I use to run the program?

Thanks in advance

Jing_Xu · ‎02-03-2020

Hi,

I tensorflow/core/platform/cpu_feature_guard.cc:145] This TensorFlow binary is optimized with Intel(R) MKL-DNN to use the following CPU instructions in performance critical operations: AVX2 AVX512F FMA

The message shown above doesn't make sense in Intel optimization for tensorflow, since either MKL-DNN or MKL will do the dynamic dispatch in runtime to take advantage of the latest instruction set that is supported on your hardware.

For MKL-DNN, it will show

dnnl_verbose,info,DNNL v1.1.0 (commit 5be2cfea21ec6d1d29f52600553baff53e30aedb)
dnnl_verbose,info,Detected ISA is Intel AVX-512 with Intel DL Boost

For MKL, it will show

MKL_VERBOSE Intel(R) MKL 2019.0 Update 3 Product build 20190125 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) with support of Vector Neural Network Instructions enabled processors, Lnx 2.10GHz lp64 intel_thread

If you see the above messages, then it will use VNNI in runtime.

Use the following commands to show verbose messages.

export MKLDNN_VERBOSE=1 [or] export DNNL_VERBOSE=1
export MKL_VERBOSE=1

As for the performance, have you tried environment variables like KMP_AFFINITY and/or OMP_NUM_THREADS?

These environment variables effect performance. They will be set automatically in some of the docker images, in some you have to set them by yourselves.

All docker images released in docker.io/intelaipg/intel-optimized-tensorflow should all have been enabled for VNNI support.

View solution in original post

Jing_Xu · ‎01-19-2020

Hi,

May I know which model had you tested? Please also let me know your steps.

1. If you run the benchmark with environment variable DNNL_VERBOSE set to 1, you will see messages like the following at the beginning of all verbose messages. If VNNI is supported, it will shown in these messages.

dnnl_verbose,info,DNNL v1.1.0 (commit 5be2cfea21ec6d1d29f52600553baff53e30aedb)
dnnl_verbose,info,Detected ISA is Intel AVX-512 with Intel DL Boost

2. You don't need to compile code. DNNL will dispatch code in runtime automatically.

3. Please use the docker images mentioned in the github page, like https://github.com/IntelAI/models/tree/master/benchmarks/image_recognition/tensorflow/resnet50.

e.g. gcr.io/deeplearning-platform-release/tf-cpu.1-15

lin__chiungliang · ‎01-19-2020

Hi,

I run "wide & deep" model.

https://github.com/IntelAI/models/tree/master/benchmarks/recommendation/tensorflow/wide_deep_large_ds

The Int8 Model can't run on docker "gcr.io/deeplearning-platform-release/tf-cpu.1-15"

Some error occurs.

I choose docker "docker.io/intelaipg/intel-optimized-tensorflow:latest", which I think it's the last version of optimized-tensorflow with MKL-DNN

It shows some messages:

I tensorflow/core/platform/cpu_feature_guard.cc:145] This TensorFlow binary is optimized with Intel(R) MKL-DNN to use the following CPU instructions in performance critical operations: AVX2 AVX512F FMA
To enable them in non-MKL-DNN operations, rebuild TensorFlow with the appropriate compiler flags.

I think it works with instructions optimization.

But I can't see the message you said.

dnnl_verbose,info,DNNL v1.1.0 (commit 5be2cfea21ec6d1d29f52600553baff53e30aedb)

dnnl_verbose,info,Detected ISA is Intel AVX-512 with Intel DL Boost

I also try "export DNNL_VERBOSE=1"

But it doesn't work

Is there any wrong?

Would you provide a docker image for us to run "deep & wide" by VNNI?

Thank you very much~

Jing_Xu · ‎01-20-2020

Hi,

The dataset is large, and it takes time to download.

Could you please try MKLDNN_VERBOSE=1 instead?

I'll try and investigate this issue once I got dataset downloaded.

Thank you.

Clayne_R_Intel · ‎01-23-2020

Hi Lin ChiungLiang,

Two items that may help.

1) The message that was output by the CPU feature guard is helpful. It means that the binary was compiled with GCC flags that used AVX instructions, but to allow the container to work on the greatest number of systems possible, it was not compiled with *static* AVX2, AVX512, or AVX512_VNNI instructions in the eigen library, which would cause TensorFlow in that container to crash when run on older systems.

However, MKL-DNN detects CPU features at run-time and adjusts accordingly. Thus, when TensorFlow loads the MKL-DNN library, AVX512_VNNI instructions will be used if they are available on that system.

2) The TensorFlow version in the docker.io/intelaipg/intel-optimized-tensorflow:latest container is TensorFlow 1.15, which uses MKL-DNN version 0.x. If you want to see the verbose output in that version, as suggested above, you need to set MKLDNN_VERBOSE=1. DNNL_VERBOSE=1 will only work once MKL-DNN 1.x has been integrated into Tensorflow.

lin__chiungliang · ‎01-23-2020

Hi Robison,

Thanks for your information.

1) Is there any command to check the version of MKL-DNN?

2) As I mentioned, I'd like to run int8 "wide & deep" model, would you please let me know which docker image should I use?

3) I found that when I run the model, it can't fully utilize CPUs.

I am sure all cpus are used, but don't know why the utilization of CPUs are still low, about 30%~40%

Lot of thanks

Jing_Xu · ‎01-28-2020

Hi,

I'm still checking with the dev team for the cpu usage. Probably the workload of this task is not large enough.

Alternatively, you may wish to try environment variables like KMP_AFFINITY. (https://software.intel.com/en-us/articles/maximize-tensorflow-performance-on-cpu-considerations-and-recommendations-for-inference)

1) Please check the first MKLDNN verbose message. MKLDNN newer than 0.18 will print its version information as the first verbose message.

2) The docker image mentioned in the github page works for int8 of this model. Please just use that docker image.

lin__chiungliang · ‎02-02-2020

Hi,

Any update?

I'm still waiting for your reply.

I tried the docker image you mentioned in the github, but it didn't the optimized one.

It only used optimized instructions (AVX512F)

the performance of the model on the docker is even poor (longer computational time) than the docker I mentioned.

docker.io/intelaipg/intel-optimized-tensorflow:latest

It used instructions AVX512F, AVX2, FMA

Please help me to check which docker image is the best

Lot of thanks

Jing_Xu · ‎02-03-2020

Hi,

I tensorflow/core/platform/cpu_feature_guard.cc:145] This TensorFlow binary is optimized with Intel(R) MKL-DNN to use the following CPU instructions in performance critical operations: AVX2 AVX512F FMA

The message shown above doesn't make sense in Intel optimization for tensorflow, since either MKL-DNN or MKL will do the dynamic dispatch in runtime to take advantage of the latest instruction set that is supported on your hardware.

For MKL-DNN, it will show

dnnl_verbose,info,DNNL v1.1.0 (commit 5be2cfea21ec6d1d29f52600553baff53e30aedb)
dnnl_verbose,info,Detected ISA is Intel AVX-512 with Intel DL Boost

For MKL, it will show

MKL_VERBOSE Intel(R) MKL 2019.0 Update 3 Product build 20190125 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) with support of Vector Neural Network Instructions enabled processors, Lnx 2.10GHz lp64 intel_thread

If you see the above messages, then it will use VNNI in runtime.

Use the following commands to show verbose messages.

export MKLDNN_VERBOSE=1 [or] export DNNL_VERBOSE=1
export MKL_VERBOSE=1

As for the performance, have you tried environment variables like KMP_AFFINITY and/or OMP_NUM_THREADS?

These environment variables effect performance. They will be set automatically in some of the docker images, in some you have to set them by yourselves.

All docker images released in docker.io/intelaipg/intel-optimized-tensorflow should all have been enabled for VNNI support.

lin__chiungliang · ‎02-03-2020

Hi,

Thanks for your response

Finally, I saw the messages when I enable the flags you mentioned.

mkldnn_verbose,info,Intel MKL-DNN v0.20.3 (commit N/A)
mkldnn_verbose,info,Detected ISA is Intel AVX-512 with Intel DL Boost

There are some questions

1) If the flags are not set as 1, does the program run without VNNI? Or it only run without printing the information?

The results are interesting, when I enable the flags, the computational time increases a little.

I think it only run without printing when the flags are disabled, so when run the program with printing information, the results got worse.

2) About CPU utilization

I set the number of cores = number of threads = #cores, number of inter-threads = 2

utilization rate can't reach 100%

If you have any comment, please let me know

Lot of thanks

Jing_Xu · ‎02-05-2020

Hi,

1) Exactly your understanding is correct. The environment variable just controls whether to print message or not.

2) It is possible that CPU usage not going to 100%, depending on use case. Are you satisfying with the performance?

lin__chiungliang · ‎02-05-2020

Hi,

The performance is good.

I just want to know how to perform the best result.

Thanks for your help.

Jing_Xu · ‎02-07-2020

Hi,

For simple methods you can take a reference to the following article.

https://software.intel.com/en-us/articles/maximize-tensorflow-performance-on-cpu-considerations-and-recommendations-for-inference

For more advanced ways, you need to profile the execution to see which parts take the longest time, and improve them accordingly.

AthiraM_Intel · ‎02-13-2020

Hi,

Could you please confirm whether the solution provided was helpful.

lin__chiungliang · ‎02-13-2020

Hi,

Yes, thanks for your help

AthiraM_Intel · ‎02-13-2020

Hi,

Thanks for the confirmation. We are closing this thread. Feel free to open a new thread if you have any further queries.