Hi,
I tried to translate floating point model to int8 model using pytorch.
The results are shown bellow:
.local/lib/python3.6/site-packages/torch/quantization/observer.py:172: UserWarning: Must run observer before calling calculate_qparams. Returning default scale and zero point. Returning default scale and zero point.") ---------> Run FP model Sequential( (0): Linear(in_features=64, out_features=1000, bias=True) (1): ReLU() (2): Linear(in_features=1000, out_features=5, bias=True) (3): ReLU() ) Size (MB): 0.280785 elapsed time 0.4727492332458496 ---------> Run Int8 model Sequential( (0): QuantizedLinear( in_features=64, out_features=1000, scale=1.0, zero_point=0 (_packed_params): LinearPackedParams() ) (1): QuantizedReLU() (2): QuantizedLinear( in_features=1000, out_features=5, scale=1.0, zero_point=0 (_packed_params): LinearPackedParams() ) (3): QuantizedReLU() ) Size (MB): 0.074412 elapsed time 0.5817153453826904
I build two-layer Linear model and use static quantization, and then do inference of both models.
The size of parameters is about 1/4, which is the same as my expectation.
But the run time is a little longer.
Here is my torch environment:
PyTorch built with: - GCC 7.3 - Intel(R) Math Kernel Library Version 2019.0.4 Product Build 20190411 for Intel(R) 64 architecture applications - Intel(R) MKL-DNN v0.21.1 (Git Hash 7d2fd500bc78936d1d648ca713b901012f470dbc) - OpenMP 201511 (a.k.a. OpenMP 4.5) - NNPACK is enabled - Build settings: BLAS=MKL, BUILD_NAMEDTENSOR=OFF, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -fopenmp -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Wno-stringop-overflow, DISABLE_NUMA=1, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_STATIC_DISPATCH=OFF,
I found that when I run floating version, SGEMM is performed to accelerate computing.
But when I run int8 version, no accelerate library is performed.
How to run optimized int8 model?
Lot of thanks
Hi,
- For int8 model, no MKLDNN log output is displayed because you are using Facebook GEneral Matrix Multiplication(fbgemm) for your model quantization not MKL-DNN.( q_model.qconfig = torch.quantization.get_default_qconfig('fbgemm') ). Note, it doesn’t impact the VNNI. FB solution supports VNNI too.
- By the MKLDNN output of CNN, we observed that there is no VNNI is detected on the CPU.So, no VNNI is used in the int-8 model .Hence your int-8 model is slower.Please use ‘lscpu’ to check if the CPU supports VNNI.
- Also, the linear layer is supported with MKL-DNN
Thanks.
連結已複製
Hi,
I wrote code by the step of document provided by PyTorch.
# params num_iter = 10000 batch_size = 10 L1, L2, L3 = 64, 1000, 5 ... # setup model = torch.nn.Sequential( torch.nn.Linear(L1,L2), torch.nn.ReLU(), torch.nn.Linear(L2,L3), torch.nn.ReLU(), ) model.eval() # Quantize q_x = torch.quantize_per_tensor(x, scale=1e-3, zero_point=128, dtype=torch.quint8) q_model = copy.deepcopy(model) q_model.qconfig = torch.quantization.get_default_qconfig('fbgemm') torch.quantization.prepare(q_model, inplace=True) torch.quantization.convert(q_model, inplace=True) q_model.eval() # print('---------> Run FP model') print(model) print_size_of_model(model) time_model_evaluation(model, x, num_iter) # print('---------> Run Int8 model') print(q_model) print_size_of_model(q_model) time_model_evaluation(q_model, q_x, num_iter)
Hi,
Thanks for the update.Try out the below commands so that you can get the execution of Intel MKL-DNN primitives and collection of basic statistics like execution time and primitive parameters.
export MKLDNN_VERBOSE=1 [or] export DNNL_VERBOSE=1 export MKL_VERBOSE=1
You can refer the below link:
https://software.intel.com/en-us/forums/intel-optimized-ai-frameworks/topic/843478
Hi,
I ran CNN on the same environment.
got the following messages
mkldnn_verbose,info,Intel MKL-DNN v0.21.1 (commit 7d2fd500bc78936d1d648ca713b901012f470dbc) mkldnn_verbose,info,Detected ISA is Intel AVX-512 with AVX512BW, AVX512VL, and AVX512DQ extensions mkldnn_verbose,exec,reorder,simple:any,undef,in:f32_nchw out:f32_nChw16c,num:1,64x1000x128x128,259.319 mkldnn_verbose,exec,reorder,simple:any,undef,in:f32_oihw out:f32_OIhw16i16o,num:1,100x1000x3x3,0.246094
It seems that
1) MKLDNN is turned on
2) MKLDNN support nn.Conv2d, but not nn.Linear
Am I right?
Thanks in advance.
Hi,
Thanks for the immediate response.
Yes you are right MKL-DNN is turned on and supports Convolutional layer. Regarding acceleration using int8 model and support for MKL-DNN linear layer we will contact SME and get back to you soon.
Hi,
- For int8 model, no MKLDNN log output is displayed because you are using Facebook GEneral Matrix Multiplication(fbgemm) for your model quantization not MKL-DNN.( q_model.qconfig = torch.quantization.get_default_qconfig('fbgemm') ). Note, it doesn’t impact the VNNI. FB solution supports VNNI too.
- By the MKLDNN output of CNN, we observed that there is no VNNI is detected on the CPU.So, no VNNI is used in the int-8 model .Hence your int-8 model is slower.Please use ‘lscpu’ to check if the CPU supports VNNI.
- Also, the linear layer is supported with MKL-DNN
Thanks.
Hi,
1. for FBGEMM, is there any log to confirm VNNI or FBGEMM is turned on?
When I tried machine with AVX512_VNNI,
The int8 model is still faster than fp32 model
FP32: SGEMM
INT8: no message
2. I used a machine that is not support VNNI to run CNN last time.
I changed the machine, the following is the information of CPU
Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 48 On-line CPU(s) list: 0-47 Thread(s) per core: 2 Core(s) per socket: 24 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 85 Model name: Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz Stepping: 7 CPU MHz: 2038.616 BogoMIPS: 5000.00 Hypervisor vendor: KVM Virtualization type: full L1d cache: 32K L1i cache: 32K L2 cache: 1024K L3 cache: 36608K NUMA node0 CPU(s): 0-47 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke avx512_vnni
I confirmed that when I used machine with VNNI to run CNN, it used AVX512_VNNI
mkldnn_verbose,info,Intel MKL-DNN v0.21.1 (commit 7d2fd500bc78936d1d648ca713b901012f470dbc) mkldnn_verbose,info,Detected ISA is Intel AVX-512 with Intel DL Boost
