Solved: Make inference faster via pre-process of data

timosy · ‎07-04-2022

I'm tsting how much inference gets faster.

I've alredy tested compression algorithms uisng intel-NNCF.

When I checked information in Web related Pytorch, I found that pre-process of a image can also make the inference faster a bit. Two exsamples are introduced there.

1). max_length:

Limits the input max_length to make the input data lighter.

from transformers import BertTokenizer
MAX_LENGTH = 512
tokenizer = BertTokenizer.from_pretrained("hoge_pretrain")
data = tokenizer.encode_plus(
            TEXT,
            add_special_tokens=True,
            max_length=MAX_LENGTH,
            padding="max_length",
            truncation=True,
            return_tensors="pt",
        )

2). do_not_pad

This method can be used when inferring with batch_size == 1. Normally, padding of input data is required for batch inference, but in the situation of batch_size == 1, it can be inferred without padding.

from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("hoge_pretrain")
data = tokenizer.encode_plus(
            TEXT,
            add_special_tokens=True,
            max_length=512,
            padding="do_not_pad",
            truncation=True,
            return_tensors="pt",
        )

These methods are for text/lungauge related inference.

Do you know it there exists any similar pre-processing techniquefor the inference of image classificoation task??

Best regards!

Peh_Intel · ‎07-18-2022

Hi Timosy,

Yes, you are correct. This GitHub discussion might useful.

During quantization, the process inserts an operation called FakeQuantize into the model graph.

During runtime, these FakeQuantize layers convert the input to the convolution layer into Int8. For example, if the next convolutional layer has Int8 weights, then the input to that layer will also be converted to Int8. Further on, the precision however depends on the next operation. If the next operation requires a full-precision format, then the inputs will be reconverted to full-precision during runtime.

Regards,

Peh

View solution in original post

Peh_Intel · ‎07-05-2022

Hi Timosy,

Thanks for sharing this information with us.

For preprocessing in OpenVINO™, we usually resize the input image, convert colour format, convert U8 to FP32 precision and change layout. You can refer to Optimize Preprocessing for more details.

In addition, using model caching can help to increase the inferencing which minimizes model’s read and load time. This is because the application’s code can load saved file and don’t perform preprocessing anymore. You can refer to Use Case - Integrate and Save Preprocessing Steps Into IR for more information.

Regards,

Peh

timosy · ‎07-10-2022

Thanks for the useful infomation above.

I have another question. I'm currently testing INT8 model compressed with NNCF. Is it possible to input Integer data (image or whatever) to the compressed model indated of Float data. If its possible, inference gets more faster thoght accuracy might get low.

A function to handle such conversion exsits in the openVino framework?

timosy · ‎07-13-2022

I tried below preprocessing with INT8 model

if 1:
    # https://docs.openvino.ai/2022.1/openvino_docs_OV_UG_Preprocessing_Overview.html 
    ppp = PrePostProcessor(ir_model)
    # no index/name is needed if model has one input
    # N=1, C=3, H=224, W=224
    ppp.input().model().set_layout(Layout('NCHW'))
    #ppp.input().preprocess() \ # just speed test, not necessary
    # .mean([0.5029, 0.4375, 0.3465]) .scale([0.2818, 0.2659, 0.2629])
    # First define data type for your tensor
    ppp.input().tensor().set_element_type(Type.u8)
    # Then define preprocessing step
    #ppp.input().preprocess().convert_element_type(Type.f32)
    ppp.input().preprocess().convert_element_type(Type.u8)
    # Model expects shape {1, 3, 480, 640}
    ppp.input().preprocess().convert_layout([0, 3, 1, 2])
    print(f'Dump preprocessor: {ppp}')

What I got as Dump as follows

Dump preprocessor: Input "input.0":
    User's input tensor: {1,2048,2048,3}, [N,H,W,C], u8
    Model's expected tensor: {1,3,2048,2048}, [N,C,H,W], f32

    Pre-processing steps (2):
    convert type (u8): ({1,2048,2048,3}, [N,H,W,C], u8) -> ({1,2048,2048,3}, [N,H,W,C], u8)
    convert layout (0,3,1,2): ({1,2048,2048,3}, [N,H,W,C], u8) -> ({1,3,2048,2048}, [N,C,H,W], u8)

    Implicit pre-processing steps (1):
    convert type (f32): ({1,3,2048,2048}, [N,C,H,W], u8) -> ({1,3,2048,2048}, [N,C,H,W], f32)

The inference speed does not get fast.

I'm mistaking something ?

Best regards

Peh_Intel · ‎07-14-2022

Hi Timosy,

First and foremost, we don’t have this conversion option for Model Optimizer.

Next, preprocessing in OpenVINO™ is used for perfectly fit the input data to Neural Network model input tensor. It is not used to increase the inference speed. It is recommended to use model caching if increasing inference speed is critical for you.

Regards,

Peh

timosy · ‎07-14-2022

Thanks for your comments. I'd like to confirm my understanding.

Simply speaking, there is no support to input a "INT" tensor to an INT8-model so that we make inference faster more compared to a input of genral "Float" tensor, currently.

Is this correct?

Peh_Intel · ‎07-18-2022

Hi Timosy,

Yes, you are correct. This GitHub discussion might useful.

During quantization, the process inserts an operation called FakeQuantize into the model graph.

During runtime, these FakeQuantize layers convert the input to the convolution layer into Int8. For example, if the next convolutional layer has Int8 weights, then the input to that layer will also be converted to Int8. Further on, the precision however depends on the next operation. If the next operation requires a full-precision format, then the inputs will be reconverted to full-precision during runtime.

Regards,

Peh

Peh_Intel · ‎07-26-2022

Hi Timosy,

This thread will no longer be monitored since we have provided answers and suggestions. If you need any additional information from Intel, please submit a new question.

Regards,

Peh

Make inference faster via pre-process of data

Code Samples

Inference Engine

Model Optimizer