Solved: Running int8 pytorch model with AVX512_VNNI

lin__chiungliang · ‎02-06-2020

Hi,

I tried to translate floating point model to int8 model using pytorch.

The results are shown bellow:

.local/lib/python3.6/site-packages/torch/quantization/observer.py:172: UserWarning: Must run observer before calling calculate_qparams.                           Returning default scale and zero point.
  Returning default scale and zero point.")
---------> Run FP model
Sequential(
  (0): Linear(in_features=64, out_features=1000, bias=True)
  (1): ReLU()
  (2): Linear(in_features=1000, out_features=5, bias=True)
  (3): ReLU()
)
Size (MB): 0.280785
elapsed time 0.4727492332458496
---------> Run Int8 model
Sequential(
  (0): QuantizedLinear(
    in_features=64, out_features=1000, scale=1.0, zero_point=0
    (_packed_params): LinearPackedParams()
  )
  (1): QuantizedReLU()
  (2): QuantizedLinear(
    in_features=1000, out_features=5, scale=1.0, zero_point=0
    (_packed_params): LinearPackedParams()
  )
  (3): QuantizedReLU()
)
Size (MB): 0.074412
elapsed time 0.5817153453826904

I build two-layer Linear model and use static quantization, and then do inference of both models.

The size of parameters is about 1/4, which is the same as my expectation.

But the run time is a little longer.

Here is my torch environment:

PyTorch built with:
  - GCC 7.3
  - Intel(R) Math Kernel Library Version 2019.0.4 Product Build 20190411 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v0.21.1 (Git Hash 7d2fd500bc78936d1d648ca713b901012f470dbc)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - Build settings: BLAS=MKL, BUILD_NAMEDTENSOR=OFF, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -fopenmp -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Wno-stringop-overflow, DISABLE_NUMA=1, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_STATIC_DISPATCH=OFF,

I found that when I run floating version, SGEMM is performed to accelerate computing.

But when I run int8 version, no accelerate library is performed.

How to run optimized int8 model?

Lot of thanks

JananiC_Intel · ‎02-11-2020

Hi,

For int8 model, no MKLDNN log output is displayed because you are using Facebook GEneral Matrix Multiplication(fbgemm) for your model quantization not MKL-DNN.( q_model.qconfig = torch.quantization.get_default_qconfig('fbgemm') ). Note, it doesn’t impact the VNNI. FB solution supports VNNI too.
By the MKLDNN output of CNN, we observed that there is no VNNI is detected on the CPU.So, no VNNI is used in the int-8 model .Hence your int-8 model is slower.Please use ‘lscpu’ to check if the CPU supports VNNI.
Also, the linear layer is supported with MKL-DNN

Thanks.

View solution in original post

JananiC_Intel · ‎02-07-2020

Hi,

Thanks for reaching out to us.

May I know the tool used behind the conversion from floating point to int8 model?

lin__chiungliang · ‎02-09-2020

Hi,

I wrote code by the step of document provided by PyTorch.

# params
num_iter = 10000
batch_size = 10
L1, L2, L3 = 64, 1000, 5

...

# setup
model = torch.nn.Sequential(
    torch.nn.Linear(L1,L2),
    torch.nn.ReLU(),
    torch.nn.Linear(L2,L3),
    torch.nn.ReLU(),
)

model.eval()

# Quantize
q_x = torch.quantize_per_tensor(x, scale=1e-3, zero_point=128, dtype=torch.quint8)
q_model = copy.deepcopy(model)
q_model.qconfig = torch.quantization.get_default_qconfig('fbgemm')
torch.quantization.prepare(q_model, inplace=True)
torch.quantization.convert(q_model, inplace=True)
q_model.eval()

#
print('---------> Run FP model')
print(model)
print_size_of_model(model)
time_model_evaluation(model, x, num_iter)

#
print('---------> Run Int8 model')
print(q_model)
print_size_of_model(q_model)
time_model_evaluation(q_model, q_x, num_iter)

JananiC_Intel · ‎02-09-2020

Hi,

Thanks for the update.Try out the below commands so that you can get the execution of Intel MKL-DNN primitives and collection of basic statistics like execution time and primitive parameters.

export MKLDNN_VERBOSE=1 [or] export DNNL_VERBOSE=1
export MKL_VERBOSE=1

You can refer the below link:

https://software.intel.com/en-us/forums/intel-optimized-ai-frameworks/topic/843478

lin__chiungliang · ‎02-09-2020

Hi,

I tried the flags

- Floating model

MKL_VERBOSE SEGEMM is performed

- Int8 model

No messages

lin__chiungliang · ‎02-09-2020

Hi,

I ran CNN on the same environment.

got the following messages

mkldnn_verbose,info,Intel MKL-DNN v0.21.1 (commit 7d2fd500bc78936d1d648ca713b901012f470dbc)
mkldnn_verbose,info,Detected ISA is Intel AVX-512 with AVX512BW, AVX512VL, and AVX512DQ extensions
mkldnn_verbose,exec,reorder,simple:any,undef,in:f32_nchw out:f32_nChw16c,num:1,64x1000x128x128,259.319
mkldnn_verbose,exec,reorder,simple:any,undef,in:f32_oihw out:f32_OIhw16i16o,num:1,100x1000x3x3,0.246094

It seems that

1) MKLDNN is turned on

2) MKLDNN support nn.Conv2d, but not nn.Linear

Am I right?

Thanks in advance.

JananiC_Intel · ‎02-10-2020

Hi,

Thanks for the immediate response.

Yes you are right MKL-DNN is turned on and supports Convolutional layer. Regarding acceleration using int8 model and support for MKL-DNN linear layer we will contact SME and get back to you soon.

JananiC_Intel · ‎02-11-2020

Hi,

For int8 model, no MKLDNN log output is displayed because you are using Facebook GEneral Matrix Multiplication(fbgemm) for your model quantization not MKL-DNN.( q_model.qconfig = torch.quantization.get_default_qconfig('fbgemm') ). Note, it doesn’t impact the VNNI. FB solution supports VNNI too.
By the MKLDNN output of CNN, we observed that there is no VNNI is detected on the CPU.So, no VNNI is used in the int-8 model .Hence your int-8 model is slower.Please use ‘lscpu’ to check if the CPU supports VNNI.
Also, the linear layer is supported with MKL-DNN

Thanks.

lin__chiungliang · ‎02-11-2020

Hi,

1. for FBGEMM, is there any log to confirm VNNI or FBGEMM is turned on?

When I tried machine with AVX512_VNNI,

The int8 model is still faster than fp32 model

FP32: SGEMM

INT8: no message

2. I used a machine that is not support VNNI to run CNN last time.

I changed the machine, the following is the information of CPU

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              48
On-line CPU(s) list: 0-47
Thread(s) per core:  2
Core(s) per socket:  24
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               85
Model name:          Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
Stepping:            7
CPU MHz:             2038.616
BogoMIPS:            5000.00
Hypervisor vendor:   KVM
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            1024K
L3 cache:            36608K
NUMA node0 CPU(s):   0-47
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke avx512_vnni

I confirmed that when I used machine with VNNI to run CNN, it used AVX512_VNNI

mkldnn_verbose,info,Intel MKL-DNN v0.21.1 (commit 7d2fd500bc78936d1d648ca713b901012f470dbc)
mkldnn_verbose,info,Detected ISA is Intel AVX-512 with Intel DL Boost

JananiC_Intel · ‎02-11-2020

Hi,

Thanks for the update.

We haven’t heard FBGEMM had verbose mode. It is enabled by default .Quantization in PyTorch works with VNNI.

Hope we clarified your queries.Can we close the case?

lin__chiungliang · ‎02-11-2020

Hi,

OK

Thanks for your help.

JananiC_Intel · ‎02-11-2020

Hi,

Thanks for the confirmation.We are closing the case.Feel free to open a new thread if you face any further issues