- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm trying to execute a python file on devcloud. The job script job.sh is as follows:
#!/bin/bash
source /opt/intel/inteloneapi/setvars.sh > /dev/null 2>&1
python master.py
I am assigning it using the command on Mac terminal:
qsub -l nodes=1:xeon:batch:ppn=2 -d . job.sh
The job ran for something around 3 hours and produced 2 output files: job.sh.e934264 & job.sh.o934264
The job.sh.e934264 file is as follows:
2021-07-26 03:49:45.014693: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /glob/development-tools/versions/oneapi/2021.3/inteloneapi/vpl/2021.4.0/lib:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/tbb/2021.3.0/env/../lib/intel64/gcc4.8:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/rkcommon/1.6.1/lib:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/ospray_studio/0.7.0/lib:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/ospray/2.6.0/lib:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/openvkl/0.13.0/lib:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/oidn/1.4.0/lib:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/mpi/2021.2.0//libfabric/lib:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/mpi/2021.2.0//lib/release:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/mpi/2021.2.0//lib:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/mkl/2021.3.0/lib/intel64:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/itac/2021.3.0/slib:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/ipp/2021.3.0/lib/intel64:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/ippcp/2021.3.0/lib/intel64:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/ipp/2021.3.0/lib/intel64:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/embree/3.13.0/lib:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/dnnl/2021.3.0/cpu_dpcpp_gpu_dpcpp/lib:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/debugger/10.1.2/gdb/intel64/lib:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/debugger/10.1.2/libipt/intel64/lib:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/debugger/10.1.2/dep/lib:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/dal/2021.3.0/lib/intel64:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/compiler/2021.3.0/linux/lib:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/compiler/2021.3.0/linux/lib/x64:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/compiler/2021.3.0/linux/lib/emu:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/compiler/2021.3.0/linux/lib/oclfpga/host/linux64/lib:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/compiler/2021.3.0/linux/lib/oclfpga/linux64/lib:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/compiler/2021.3.0/linux/compiler/lib/intel64_lin:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/ccl/2021.3.0/lib/cpu_gpu_dpcpp
2021-07-26 03:49:45.014777: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2021-07-26 03:49:50.062319: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /glob/development-tools/versions/oneapi/2021.3/inteloneapi/vpl/2021.4.0/lib:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/tbb/2021.3.0/env/../lib/intel64/gcc4.8:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/rkcommon/1.6.1/lib:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/ospray_studio/0.7.0/lib:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/ospray/2.6.0/lib:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/openvkl/0.13.0/lib:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/oidn/1.4.0/lib:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/mpi/2021.2.0//libfabric/lib:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/mpi/2021.2.0//lib/release:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/mpi/2021.2.0//lib:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/mkl/2021.3.0/lib/intel64:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/itac/2021.3.0/slib:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/ipp/2021.3.0/lib/intel64:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/ippcp/2021.3.0/lib/intel64:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/ipp/2021.3.0/lib/intel64:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/embree/3.13.0/lib:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/dnnl/2021.3.0/cpu_dpcpp_gpu_dpcpp/lib:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/debugger/10.1.2/gdb/intel64/lib:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/debugger/10.1.2/libipt/intel64/lib:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/debugger/10.1.2/dep/lib:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/dal/2021.3.0/lib/intel64:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/compiler/2021.3.0/linux/lib:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/compiler/2021.3.0/linux/lib/x64:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/compiler/2021.3.0/linux/lib/emu:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/compiler/2021.3.0/linux/lib/oclfpga/host/linux64/lib:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/compiler/2021.3.0/linux/lib/oclfpga/linux64/lib:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/compiler/2021.3.0/linux/compiler/lib/intel64_lin:/glob/development-tools/versions/oneapi/2021.3/inteloneapi/ccl/2021.3.0/lib/cpu_gpu_dpcpp
2021-07-26 03:49:50.062403: W tensorflow/stream_executor/cuda/cuda_driver.cc:326] failed call to cuInit: UNKNOWN ERROR (303)
2021-07-26 03:49:50.062449: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (s001-n061): /proc/driver/nvidia/version does not exist
2021-07-26 03:49:50.062948: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-07-26 03:52:31.660446: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2)
2021-07-26 03:52:31.679568: I tensorflow/core/platform/profile_utils/cpu_utils.cc:114] CPU Frequency: 3400000000 Hz
/var/spool/torque/mom_priv/jobs/934264.v-qsvr-1.aidevcloud.SC: line 4: 110188 Killed python master.py
job.sh.o934264 is:
########################################################################
# Date: Mon 26 Jul 2021 03:49:38 AM PDT
# Job ID: 934264.v-qsvr-1.aidevcloud
# User: u65358
# Resources: neednodes=1:xeon:batch:ppn=2,nodes=1:xeon:batch:ppn=2,walltime=06:00:00
########################################################################
########################################################################
# End of output for job 934264.v-qsvr-1.aidevcloud
# Date: Mon 26 Jul 2021 06:52:21 AM PDT
########################################################################
The desired output and code weren't produced and I am facing this issue. Can someone please help me with this? Thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi, the issue was on my side and not because of devcloud. Thanks
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thank you for reaching out.
In order to analyze issue from our end, could you please share with us the master.py file if it is ok or a minimal sample reproducer code.
Also, in meantime could you try executing the file with an increased wall time.
You can increase the walltime as below:
Syntax:
-l walltime=<time>
Requests the maximum wall clock time the job may run. Format of time is either seconds, or [[hh:]mm:]ss. By default, the maximum wall clock time is set by the queue parameter resources default walltime. When you request less wall clock time than default, it increases the likelihood that your job will start earlier due to the scheduler's backfill policy. When you request more wall clock time than default, you still cannot get more than what is specified by the queue parameter resources_max.walltime. To query the queue parameters, run the following command on the head node:
qmgr -c "list queue batch"
Example:
$ echo sleep 1000 | qsub -l walltime=00:30:00
For more information please refer this link:
https://devcloud.intel.com/oneapi/documentation/advanced-queue/
Regards
Abhijeet
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hey, thanks for the reply but increasing or decreasing the walltime doesn't help. It results in the same error.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Here's master.py for your reference:
# from downloadData import download_all_data
# from unzipAllGz import unzipAll
# from parseSequence import parseAllFastqs
from ML import ml_main
## Get data
# samplesRange = (0, 30) # Eg. use first thirty samples only
# download files *.filt.fastq.gz
# samples, files = download_all_data(samplesRange)
# unzip *.filt.fastq.gz into *.filt.fastq
# unzipAll(samples, files)
# parse and save *.filt.fastq as *.bin
# parseAllFastqs(samples, files)
## Machine learning part
ml_main()
Also. ML.py used in master.py:
import tensorflow as tf
# import keras
import numpy as np
# import matplotlib as plt
from trainingData import trainDataGenerator
def make_model():
model = tf.keras.Sequential()
model.add(tf.keras.layers.LSTM(64, return_sequences=True, input_shape=(None, 4)))
model.add(tf.keras.layers.Dropout(0.3))
model.add(tf.keras.layers.LSTM(10, return_sequences=False))
model.add(tf.keras.layers.Dropout(0.3))
model.add(tf.keras.layers.Dense(5, activation = tf.nn.softmax))
model.summary()
model.compile(optimizer = tf.keras.optimizers.Adam(learning_rate=0.001, beta_1=0.9), loss = tf.keras.losses.CategoricalCrossentropy())
return model
## Model: "sequential"
## _________________________________________________________________
## Layer (type) Output Shape Param #
## =================================================================
## lstm (LSTM) (None, None, 64) 17664
## _________________________________________________________________
## dropout (Dropout) (None, None, 64) 0
## _________________________________________________________________
## lstm_1 (LSTM) (None, 10) 3000
## _________________________________________________________________
## dropout_1 (Dropout) (None, 10) 0
## _________________________________________________________________
## dense (Dense) (None, 5) 55
## =================================================================
## Total params: 20,719
## Trainable params: 20,719
## Non-trainable params: 0
## _________________________________________________________________
def ml_main():
model = make_model()
## train model
# trainDataGenerator() is the generator function
print("test print!!!")
model.fit_generator(trainDataGenerator(), verbose = 1, epochs = 1)
model.save('./models/42files10epochs')
#...
## test model
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Could you please share the exact steps to run the file and the requirements file of the environment so that we could try to reproduce the issue from our end and assist you better.
Regards
Abhijeet
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The files are attached.
I have used the following query to run job.sh:
qsub -l nodes=1:xeon:batch:ppn=2 -d . job.sh
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
We tried to reproduce the issue from our end but was getting following error
Traceback (most recent call last):
File "master.py", line 4, in <module>
from ML import ml_main
File "/home/u72280/Files/ML.py", line 27, in <module>
from trainingData import trainDataGenerator
ModuleNotFoundError: No module named 'trainingData'
Could you please tell which environment you are using to run your job?
Are you using any conda environment?
To see list of available environments on the Devcloud use the following command:
conda info –env
To activate environment on the Devcloud use:
source activate env
or
conda activate env
where env is the name of environment.
Could you please try to add environment activate command in your job.sh for example:
#!/bin/bash
source /opt/intel/inteloneapi/setvars.sh > /dev/null 2>&1
source activate tensorflow
python master.py
Regards
Abhijeet
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Is your issue resolved?
Please let us know if the issue still persists.
Regards
Abhijeet
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi, the issue was on my side and not because of devcloud. Thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Glad to know that your issue is resolved. If you need any additional information, please post a new question as this thread will no longer be monitored by Intel.
Regards
Abhijeet

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page