I was wondering if anybody will have a suggestion about this. I have a model in which the first few layers are mapped to the CPU and the last layers are mapped to a NCS2 device.
The pipeline performs inference and the CPU part is using 8-bit precision and the NCS2 is using FP16 precision.
How could I unquantize the 8-bit values to feed them to the floating point network at runtime ?
or alternative how could I quantize the FP16 values from one of the models to input them into a 8-bit model at runtime ?
I am not sure if some software implementation is available to perform this precision switches during inference for models that are not mapped to the same device.
Thanks for any ideas you might have, Maybe this is not possible after all.
Thanks for reaching out.
You can check this documentation about the heterogeneous plugin, it will let you run inference in the precision required for each layer in your model by setting a primary and a fallback device for backup.
For example, if you use FP16 IR files, you have to set the primary device as MYRIAD (for NCS2), then if you use the CPU as the fallback device., it would automatically convert the FP16 to FP32 when necessary for the CPU.
If you have additional questions, let us know