Solved: Error while executing python code: killed python file

Vaneet_Aggarwal · ‎07-26-2021

I'm trying to execute a python file on devcloud. The job script job.sh is as follows:

#!/bin/bash
source /opt/intel/inteloneapi/setvars.sh  > /dev/null 2>&1
python master.py

I am assigning it using the command on Mac terminal:

qsub -l nodes=1:xeon:batch:ppn=2 -d . job.sh

The job ran for something around 3 hours and produced 2 output files: job.sh.e934264 & job.sh.o934264

The job.sh.e934264 file is as follows:

2021-07-26 03:49:45.014693: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /glob/development-tools/versions/oneapi/2021.3/inteloneapi/vpl/2021.4.0/lib:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/tbb/2021.3.0/env/../lib/intel64/gcc4.8:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/rkcommon/1.6.1/lib:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/ospray_studio/0.7.0/lib:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/ospray/2.6.0/lib:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/openvkl/0.13.0/lib:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/oidn/1.4.0/lib:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/mpi/2021.2.0//libfabric/lib:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/mpi/2021.2.0//lib/release:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/mpi/2021.2.0//lib:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/mkl/2021.3.0/lib/intel64:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/itac/2021.3.0/slib:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/ipp/2021.3.0/lib/intel64:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/ippcp/2021.3.0/lib/intel64:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/ipp/2021.3.0/lib/intel64:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/embree/3.13.0/lib:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/dnnl/2021.3.0/cpu_dpcpp_gpu_dpcpp/lib:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/debugger/10.1.2/gdb/intel64/lib:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/debugger/10.1.2/libipt/intel64/lib:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/debugger/10.1.2/dep/lib:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/dal/2021.3.0/lib/intel64:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/compiler/2021.3.0/linux/lib:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/compiler/2021.3.0/linux/lib/x64:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/compiler/2021.3.0/linux/lib/emu:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/compiler/2021.3.0/linux/lib/oclfpga/host/linux64/lib:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/compiler/2021.3.0/linux/lib/oclfpga/linux64/lib:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/compiler/2021.3.0/linux/compiler/lib/intel64_lin:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/ccl/2021.3.0/lib/cpu_gpu_dpcpp
2021-07-26 03:49:45.014777: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2021-07-26 03:49:50.062319: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /glob/development-tools/versions/oneapi/2021.3/inteloneapi/vpl/2021.4.0/lib:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/tbb/2021.3.0/env/../lib/intel64/gcc4.8:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/rkcommon/1.6.1/lib:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/ospray_studio/0.7.0/lib:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/ospray/2.6.0/lib:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/openvkl/0.13.0/lib:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/oidn/1.4.0/lib:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/mpi/2021.2.0//libfabric/lib:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/mpi/2021.2.0//lib/release:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/mpi/2021.2.0//lib:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/mkl/2021.3.0/lib/intel64:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/itac/2021.3.0/slib:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/ipp/2021.3.0/lib/intel64:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/ippcp/2021.3.0/lib/intel64:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/ipp/2021.3.0/lib/intel64:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/embree/3.13.0/lib:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/dnnl/2021.3.0/cpu_dpcpp_gpu_dpcpp/lib:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/debugger/10.1.2/gdb/intel64/lib:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/debugger/10.1.2/libipt/intel64/lib:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/debugger/10.1.2/dep/lib:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/dal/2021.3.0/lib/intel64:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/compiler/2021.3.0/linux/lib:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/compiler/2021.3.0/linux/lib/x64:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/compiler/2021.3.0/linux/lib/emu:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/compiler/2021.3.0/linux/lib/oclfpga/host/linux64/lib:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/compiler/2021.3.0/linux/lib/oclfpga/linux64/lib:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/compiler/2021.3.0/linux/compiler/lib/intel64_lin:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/ccl/2021.3.0/lib/cpu_gpu_dpcpp
2021-07-26 03:49:50.062403: W tensorflow/stream_executor/cuda/cuda_driver.cc:326] failed call to cuInit: UNKNOWN ERROR (303)
2021-07-26 03:49:50.062449: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (s001-n061): /proc/driver/nvidia/version does not exist
2021-07-26 03:49:50.062948: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-07-26 03:52:31.660446: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2)
2021-07-26 03:52:31.679568: I tensorflow/core/platform/profile_utils/cpu_utils.cc:114] CPU Frequency: 3400000000 Hz
/var/spool/torque/mom_priv/jobs/934264.v-qsvr-1.aidevcloud.SC: line 4: 110188 Killed                  python master.py

job.sh.o934264 is:


########################################################################
#      Date:           Mon 26 Jul 2021 03:49:38 AM PDT
#    Job ID:           934264.v-qsvr-1.aidevcloud
#      User:           u65358
# Resources:           neednodes=1:xeon:batch:ppn=2,nodes=1:xeon:batch:ppn=2,walltime=06:00:00
########################################################################


########################################################################
# End of output for job 934264.v-qsvr-1.aidevcloud
# Date: Mon 26 Jul 2021 06:52:21 AM PDT
########################################################################

The desired output and code weren't produced and I am facing this issue. Can someone please help me with this? Thanks

Vaneet_Aggarwal · ‎08-27-2021

Hi, the issue was on my side and not because of devcloud. Thanks

View solution in original post

AbhijeetJ_Intel · ‎07-27-2021

Hi,

Thank you for reaching out.

In order to analyze issue from our end, could you please share with us the master.py file if it is ok or a minimal sample reproducer code.

Also, in meantime could you try executing the file with an increased wall time.

You can increase the walltime as below:

Syntax:

-l walltime=<time>

Requests the maximum wall clock time the job may run. Format of time is either seconds, or [[hh:]mm:]ss. By default, the maximum wall clock time is set by the queue parameter resources default walltime. When you request less wall clock time than default, it increases the likelihood that your job will start earlier due to the scheduler's backfill policy. When you request more wall clock time than default, you still cannot get more than what is specified by the queue parameter resources_max.walltime. To query the queue parameters, run the following command on the head node:

qmgr -c "list queue batch"

Example:

$ echo sleep 1000 | qsub -l walltime=00:30:00

For more information please refer this link:

https://devcloud.intel.com/oneapi/documentation/advanced-queue/

Regards

Abhijeet

Vaneet_Aggarwal · ‎07-30-2021

Hey, thanks for the reply but increasing or decreasing the walltime doesn't help. It results in the same error.

Vaneet_Aggarwal · ‎07-30-2021

Here's master.py for your reference:

# from downloadData import download_all_data
# from unzipAllGz import unzipAll
# from parseSequence import parseAllFastqs
from ML import ml_main

## Get data

# samplesRange = (0, 30) # Eg. use first thirty samples only

# download files *.filt.fastq.gz
# samples, files = download_all_data(samplesRange)
# unzip *.filt.fastq.gz into *.filt.fastq
# unzipAll(samples, files)
# parse and save *.filt.fastq as *.bin
# parseAllFastqs(samples, files)

## Machine learning part
ml_main()

Also. ML.py used in master.py:

import tensorflow as tf
# import keras
import numpy as np
# import matplotlib as plt
from trainingData import trainDataGenerator

def make_model():
    model = tf.keras.Sequential()
    model.add(tf.keras.layers.LSTM(64, return_sequences=True, input_shape=(None, 4)))
    model.add(tf.keras.layers.Dropout(0.3))
    model.add(tf.keras.layers.LSTM(10, return_sequences=False))
    model.add(tf.keras.layers.Dropout(0.3))
    model.add(tf.keras.layers.Dense(5, activation = tf.nn.softmax))
    model.summary()
    model.compile(optimizer = tf.keras.optimizers.Adam(learning_rate=0.001, beta_1=0.9), loss = tf.keras.losses.CategoricalCrossentropy())
    return model

##    Model: "sequential"
##    _________________________________________________________________
##    Layer (type)                 Output Shape              Param #   
##    =================================================================
##    lstm (LSTM)                  (None, None, 64)          17664     
##    _________________________________________________________________
##    dropout (Dropout)            (None, None, 64)          0         
##    _________________________________________________________________
##    lstm_1 (LSTM)                (None, 10)                3000      
##    _________________________________________________________________
##    dropout_1 (Dropout)          (None, 10)                0         
##    _________________________________________________________________
##    dense (Dense)                (None, 5)                 55        
##    =================================================================
##    Total params: 20,719
##    Trainable params: 20,719
##    Non-trainable params: 0
##    _________________________________________________________________

def ml_main():
    model = make_model()

    ## train model
    # trainDataGenerator() is the generator function
    print("test print!!!")
    model.fit_generator(trainDataGenerator(), verbose = 1, epochs = 1)
    model.save('./models/42files10epochs')
    #...
    ## test model

AbhijeetJ_Intel · ‎08-04-2021

Hi,

Could you please share the exact steps to run the file and the requirements file of the environment so that we could try to reproduce the issue from our end and assist you better.

Regards

Abhijeet

Vaneet_Aggarwal · ‎08-05-2021

The files are attached.

I have used the following query to run job.sh:

qsub -l nodes=1:xeon:batch:ppn=2 -d . job.sh

AbhijeetJ_Intel · ‎08-19-2021

Hi,

We tried to reproduce the issue from our end but was getting following error

Traceback (most recent call last):
  File "master.py", line 4, in <module>
    from ML import ml_main
  File "/home/u72280/Files/ML.py", line 27, in <module>
    from trainingData import trainDataGenerator
ModuleNotFoundError: No module named 'trainingData'

Could you please tell which environment you are using to run your job?

Are you using any conda environment?

To see list of available environments on the Devcloud use the following command:

conda info –env

To activate environment on the Devcloud use:

source activate env

or

conda activate env

where env is the name of environment.

Could you please try to add environment activate command in your job.sh for example:

#!/bin/bash
source /opt/intel/inteloneapi/setvars.sh > /dev/null 2>&1
source activate tensorflow
python master.py

Regards

Abhijeet

AbhijeetJ_Intel · ‎08-26-2021

Hi,

Is your issue resolved?

Please let us know if the issue still persists.

Regards

Abhijeet

Vaneet_Aggarwal · ‎08-27-2021

Hi, the issue was on my side and not because of devcloud. Thanks

AbhijeetJ_Intel · ‎08-30-2021

Hi,

Glad to know that your issue is resolved. If you need any additional information, please post a new question as this thread will no longer be monitored by Intel.

Regards

Abhijeet

Error while executing python code: killed python file

Job Submission