The infer result is not stable, somtimes is correct, sometimes become NaN

longfei98 · ‎04-06-2021

Hello, currently I am optimizing my own model with python openvino. While when testing with large amount of dataset for infering, I find that some data got result with all nan. But if I infer these data again, it becomes correct. It means that the result is not stable.

Some tests I took:

1. Model A: I firstly got this issue from model A. It is a generic UNet with conv and deconv. And for improving the performance, I pruned some channels with network slimming method. Then convert pytorch model to IR with FP16. And then I test and find this issue. At first, I though this may be caused by data type. I changed to FP32, same issue happens. Then I though it may be cuased by pruning. So I do next testing.

2. Model B: this is same model without pruning. And samely, convert to IR with FP16 and do testing. The issue seems to be difficult to be rerpoduced than model A. While after testing large amount of data, it is reproduced again.

Now I am confused. Could anyone give any suggestion or idea? Thank you very much.

openvino version: 2020

longfei98 · ‎04-07-2021

Any one can help? Thank you very much.

Iffa_Intel · ‎04-07-2021

Greetings,

Have you tried to cut the dataset to see whether a large dataset caused the issue?

You may refer here:

https://docs.openvinotoolkit.org/2020.1/_docs_Workbench_DG_Download_and_Cut_Datasets.html

Sincerely,

Iffa

longfei98 · ‎04-07-2021

Thanks for your reply.

I try to infer each data one by one. After each prediction for one data is done, the process will be killed and re-start to predict for next data. So although I test large amount of data, I do not predict at the same time.

And the dataset I used is from local and not from open-source dataset.

longfei98 · ‎04-07-2021

I am wondering whether it is caused by memory usage overflow.

I use openvino-cpu to do inference. Is it possible to get the nan result if the cpu memory usage is full?

I try to do some test to verify my assumption, but not reproduced yet.

Iffa_Intel · ‎04-08-2021

Generally NaN “Not a Number” is a type of error that indicate an exception which usually occurs in the cases when an expression results in a number that can't be represented

Can you share your model for me to try it out if possible?

Plus could you clarify which model, topology and openvino demo that you used to infer this?

Sincerely,

Iffa

longfei98 · ‎04-12-2021

Sorry for my late reply.

This is a commercial used model so I 'm afraid I can not share it to you.

The backbone I used is UNet structure where is from nnUnet (https://github.com/MIC-DKFZ/nnUNet) for segmentation. there are some self defined layers, I'm not sure whether it causes the reason.

I will try to create the similar model with natural dataset and reproduce it. Then that model can be shared to you.

BTW, is there any one to meet the similar issue like this before? My collegue also meets this issue with different model. It happens randomly and hard to be reproduced. Really strange.

Iffa_Intel · ‎04-14-2021

Generally,

If you are using the supported models and feed them with the correct inputs, these kinds of issues won't appear.

Let say that your program is expecting to receive some numbers as input but it receives strings instead.

This would definitely produce error.

You may refer here: https://docs.openvinotoolkit.org/latest/openvino_docs_MO_DG_prepare_model_convert_model_Convert_Model_From_TensorFlow.html

This documentation contains the types of models and the supported topologies and is officially validated.

This is really important.

Sincerely,

Iffa

Iffa_Intel · ‎04-20-2021

Greetings,

Intel will no longer monitor this thread since we have provided a solution. If you need any additional information from Intel, please submit a new question.

Sincerely,

Iffa