Solved: Issue with TLS2.0 Tools

DarkHorse · ‎03-01-2021

Hello Expert,

I want to check with you if this error messages is related to memory or system limitation when customer is using TLS 2.0 tools. They have managed to use TLS 2.0 to train a model with input data set of 50 images but as the input data set increased, they faced some obstacles.

The hardware spec they are using is Intel XEON Server with 32GB of Ram (no GPU).

Hard Disk : 500GB

Core: 16

Processor: Intel Xeon Silver 4110

OS: Ubuntu 18.04.5 LTS

Latest: ELL V2.4

Config page :

Training steps : 5000

Learning Rate : 0.001

Batch Size : 1

Momentum optimizer

Value : 0.9

Min dimension [FRCN Detector]/height [SSD detector] = 150

Max dimension [FRCN Detector]/witdh [SSD detector] = 300

Model : ssd_mobilenet_v1

Datasets : No of images = 1000

Detection : object detection

Annotation : Box with 8 points

Sample object labelled as shown in attachment "Object Labeled.jpeg"

Success Case: We labelled 50 such images and generated the model, and perform a correct inference check, all using the TLS web portal.

Failure Case: Followed by the above check, we labelled 1400images, and while performing the training, we encountered a "Training Error, Please Check Administrator" Error, the screenshot of the error message attached "Error_Message.jpeg". I have also attached the log file "dic2.log", the log file shared is the output of docker logs <container id of tlscore:2.0>.

The error messages are :

[2021-02-24 15:19:11,612: ERROR/MainProcess] Process 'ForkPoolWorker-25' pid:13727 exited with 'signal 9 (SIGKILL)'

[2021-02-24 15:19:22,660: INFO/MainProcess] Terminating cbdcdb6b-f57a-47c3-a9b5-e69408f6c76e (Signals.SIGTERM)

Please take a look at the attached file

Dic2.log : Error messages after they use 1400 data set images

Success.log : Log for 50 data input set which was successful trained with TLS 2.0

Eventually I asked customer to try to increase the input data set incrementally to see which is the threshold when this error messages and they see this error messages again with input data set of 135 images.

Please refer to the attached file:

Success_130.log: Log for input data set of 130 images

Failure_135.log: Error log for input data set of 135 images

And I did a quick check and this error messages is related to memory / hardware limitation and the recommendation is to reduce the size of the input data sets, do let me know if my recommendation to customer is correct before I proceed.

Thanks,

Allen

WengWai_C_Intel · ‎03-09-2021

@DarkHorse

For the bigger dataset and error message, that is related to memory resources. So, increasing the system memory should help. In the current version, the annotated data will be stored in the database container and it is not available for external transfer. However, in the scenario if you want to have multiple users to label the same dataset, it can be done by having the few users login into the same TLS system through web UI, then each person can do their labeling on the same dataset. When user refresh the web UI, the data being labeled by other users will be reflected on the web UI as well. Hope this info help.

View solution in original post

WengWai_C_Intel · ‎03-02-2021

@DarkHorse

Yes, you are right that the failure is related to the hardware memory configuration. Since you already running with batch size 1, can you try reduce the max dimension from 300 to 150 for the training.

By the way, there are few clarifications regarding the info and log files provided earlier. For SSDMobileV1 training, the annotated data should be bounding box, but below mentioned 8 points. So, first will need to make sure the dataset has to be annotated with bounding box which gives only 4 points. Secondly, the failure_135.log error message is showing segmentation training error, not SSDMobileV1 training. May need your help to further verify these information for better understanding the potential issue is facing.

thanks!

DarkHorse · ‎03-04-2021

Hello @WengWai_C_Intel ,

Customer has redo all the setup which follow the recommendation and below are what they observed:

1) Able to do training for multiple combination images (100, 200, 500) and steps. (100, 500, 1000).

2) When the number of steps was increased to 2000, they get the same segmentation error, please find the attached error messages.

3) They also noticed Redis logs keep complaining of read only directory and hence BGSAVE is failing.

12:M 04 Mar 2021 06:04:16.061 * Background saving started by pid 21941

21941:C 04 Mar 2021 06:04:16.062 # Failed opening the RDB file dump.rdb (in server root dir /) for saving: Read-only file system

12:M 04 Mar 2021 06:04:16.162 # Background saving error

4) I try to find on the user guide but unable to find on how to transfer the labelled images from one TLS setup to another.

Thanks.

WengWai_C_Intel · ‎03-09-2021

@DarkHorse

For the bigger dataset and error message, that is related to memory resources. So, increasing the system memory should help. In the current version, the annotated data will be stored in the database container and it is not available for external transfer. However, in the scenario if you want to have multiple users to label the same dataset, it can be done by having the few users login into the same TLS system through web UI, then each person can do their labeling on the same dataset. When user refresh the web UI, the data being labeled by other users will be reflected on the web UI as well. Hope this info help.

Issue with TLS2.0 Tools

Edge Insights for Industrial (EII)