How to get results in batch processing mode?

ova101 · ‎09-23-2023

how will, i get results from neural network model training in batch processing mode and analyze the training . I have created neural network mode and qsub to submit for execution in devcloud account, but don't know how to store training results results for analyzing what model has learned. I need help in this regards

SreedeviK_Intel · ‎09-25-2023

Hi,

Thank you for posting in Intel Communities.

To submit jobs to Intel DevCloud for oneAPI using Batch mode, you need to create a job script that contains the following lines which are expected to be run on the node being requested using qsub:

source /opt/intel/oneapi/setvars.sh
source activate <conda_env_name>
python <program_file>.py

The first line is used to source the Intel oneAPI system variables. The second line is used to activate the conda (default or the custom created) environment in which the code needs to be run and third line is used to run the python code.

Reference for job script files:

https://github.com/oneapi-src/oneAPI-samples/blob/master/AI-and-Analytics/Getting-Started-Samples/IntelTensorFlow_GettingStarted/run.sh

https://github.com/oneapi-src/oneAPI-samples/blob/master/AI-and-Analytics/Getting-Started-Samples/Intel_Extension_For_PyTorch_GettingStarted/run.sh

Reference for the official neural network training samples using TensorFlow and PyTorch can be obtained from the below links:

https://github.com/oneapi-src/oneAPI-samples/tree/master/AI-and-Analytics/Getting-Started-Samples/IntelTensorFlow_GettingStarted

https://github.com/oneapi-src/oneAPI-samples/tree/master/AI-and-Analytics/Getting-Started-Samples/Intel_Extension_For_PyTorch_GettingStarted

The jobs can be submitted using the qsub command as shown below:

$ qsub -l nodes=1:gpu:ppn=2 -d . run.sh

Once the above command is run, a Job ID will be created. The status of the jobs submitted for execution can be monitored using qstat command.

After the execution is complete, two files will be generated, run.sh.o<job_id> (output file) and run.sh.e<job_id> (error file). All the standard outputs generated by the script/program is saved to the output file and all the standard errors generated by the script/program is saved to the error file. These generated files can be used to analyze the status of the neural network training or inference submitted by batch mode.

Alternatively, if you wish to have an interactive JupyterLab environment which facilitates the above process, you can also use the JupyterLab feature provided by Intel DevCloud for oneAPI to perform model training, inference and thus analyze the results of the training and inference visually. JupyterLab environment in Intel DevCloud for oneAPI can be accessed using the link:

https://jupyter.oneapi.devcloud.intel.com/

Reference for JupyterLab sample: https://github.com/oneapi-src/oneAPI-samples/tree/master/AI-and-Analytics/Getting-Started-Samples/Intel_Extension_For_TensorFlow_GettingStarted

If this resolves your issue, make sure to accept this as a solution. This would help others with similar issue. Kindly get back to us with more information if we misunderstood your query or if you are still facing issues doing batch submission.

Thanks.

Regards,

Sreedevi

SreedeviK_Intel · ‎10-02-2023

Hi,

We have not heard back from you. Could you please give us an update? Is your issue resolved?

Regards,

Sreedevi

ova101 · ‎10-03-2023

Thanks for providing help. Yes I got it

SreedeviK_Intel · ‎10-03-2023

Hi,

Glad to know that your issue is resolved. If you need any additional information, please post a new question as this thread will no longer be monitored by Intel.

Regards,

Sreedevi