- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am training a neural network and I'm using keras.callback for creating checkpoints but I can't see the .hdf5 files in my working directory how can I send it to my login node from the computing node. Please help me out
- Tags:
- General Support
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thanks for reaching out to us.
Home folder is NFS-shared between the login node and the compute nodes. So that the files in compute nodes are available in login node also.
Could you please check whether you have set the path to save .hdf5 into the working directory in your program.
If you are submitting it as a job , then add PBS_O_WORKDIR (the absolute path of the current working directory of the qsub command) in the job script.
If you are still facing the issue, please share some sample script or example.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
m1.compile(optimizer=optimizer, loss = 'categorical_crossentropy',metrics = ['accuracy'])
filepath="weights.best.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='val_accuracy', verbose=1, save_best_only=True, mode='max')
callbacks_list = [checkpoint]
info.append(m1.fit([X1,X2],y,epochs=5,batch_size=128,callbacks = callbacks_list,class_weight= class_weights))
m1.save('Cascade_MulEpoch.h5')
This is the code that i am using for saving the model. I am using tensorflow 1.13 and keras 2.X
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Could you please check whether your model files are saved in the home directory
cd /home/uxxxxx
To save the model into the working directory, please give the complete path in the code
m1.save('/home/uxxxxx/path to working directory/Cascade_MulEpoch.h5')
Please let us know, if the issue still persists.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am already using cd PBS_O_WORKDIR in the job script already should I use the file full filepath in the save function?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am facing one more problem
When I use the checkpoints I can't see the training of the model ie the time for each epoch and which epoch is running the info.append command why is that?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
By default, PBS scripts execute in your home directory, not the directory from which they were submitted. The following line places you in the directory from which the job was submitted.
cd $PBS_O_WORKDIR
In your command $ is missing.
Could you please confirm whether this issue resolved now?
You can check the logs of your running job as follows: For output logs, use qpeek:
qpeek -o <JOB_ID>
For error logs:
qpeek -e <JOB_ID>
To get JOB_ID use the command : qstat
Hope this clarifies your query. If not could you please elaborate your query to get us more clarity on the issue.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sorry I didn't use the $ symbol when I commented here but I am using that in my job script and i am already using qpeek and qstat for checking my outputs and job id. Still I cannot see the weight files in my home directory or anywhere
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
We have tried one sample code using keras 2.2.4 , and was able to save the model file successfully in the specified path.
Could you please share the work load you are trying , so that we can verify it from our end.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
optimizer = optimizers.SGD(lr=0.08)
fold = os.listdir("Dataset/Image_Data/HG/")
fold.sort(key=str.lower)
for path in fold:
print(path)
path = "Dataset/Image_Data/HG/"+path
p = os.listdir(path)
p.sort(key=str.lower)
arr = []
# Reading from 4 images and creating 4 channel slice-wise
for i in range(len(p)):
if(i != 4):
p1 = os.listdir(path+'/'+p)
p1.sort()
img = sitk.ReadImage(path+'/'+p+'/'+p1[-1])
arr.append(sitk.GetArrayFromImage(img))
else:
p1 = os.listdir(path+'/'+p)
img = sitk.ReadImage(path+'/'+p+'/'+p1[0])
Y_labels = sitk.GetArrayFromImage(img)
data = np.zeros((Y_labels.shape[1],Y_labels.shape[0],Y_labels.shape[2],4))
for i in range(Y_labels.shape[1]):
data[i,:,:,0] = arr[0][:,i,:]
data[i,:,:,1] = arr[1][:,i,:]
data[i,:,:,2] = arr[2][:,i,:]
data[i,:,:,3] = arr[3][:,i,:]
print(data.shape)
info = []
# Creating patches for each slice and training(slice-wise)
for i in range(data.shape[0]):
d = data_gen(data,Y_labels,i,1)
if(len(d) != 0):
y = np.zeros((d[2].shape[0],1,1,5))
for j in range(y.shape[0]):
y[j,:,:,d[2]
X1 = d[0]
X2 = d[1]
class_weights = class_weight.compute_class_weight('balanced',
np.unique(d[2]),
d[2])
print('slice no:'+str(i))
m1.compile(optimizer=optimizer, loss = 'categorical_crossentropy',metrics = ['accuracy'])
filepath="weights.best.hdf5" //This is not working
checkpoint = ModelCheckpoint(filepath, monitor='val_accuracy', verbose=1, save_best_only=True, mode='max')
callbacks_list = [checkpoint]
info.append(m1.fit([X1,X2],y,epochs=5,batch_size=128,callbacks = callbacks_list,class_weight= class_weights))
m1.save('Cascade_MulEpoch.h5') // This line is working
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thanks for sharing the code snippet.
It would be great if you could attach the complete scripts in a folder and the steps to run so that we can easily reproduce it from our end.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thank you for sharing the work load.
we were unable to try out your code due to some package dependency and dataset issues.
However, we observed that you are facing issue while saving intermediate check points.
We tried using one sample code to save intermediate checkpoints from our end and was able to save the checkpoint files after every epoch.
Please find the below code snippet which we used to train and save the intermediate checkpoints.
model.compile(optimizer = optimizer, loss = 'categorical_crossentropy', metrics = ['accuracy']) filepath="checkpoint/weights-improvement-{epoch:02d}-{val_accuracy:.2f}.hdf5" checkpoint = ModelCheckpoint(filepath, monitor='val_accuracy', verbose = 1, save_best_only=False, mode='max') callbacks_list = [checkpoint] info.append(model.fit_generator(train_generator, steps_per_epoch=100, epochs=5, validation_data=validation_generator, validation_steps=10, callbacks= callbacks_list)) model.save("savedModels_new/tensorflow.keras_C_C_C_MP_28_right_test_set.h5")
Could you please change your code accordingly and let us know if you face any issues.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks a lot for the snippet I'll take a look using it and then reply here
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Could you please let us know whether your issue got resolved.
Thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello, and thanks for the information till now but my issue is not yet solved I'll try something else and check it out. You can now close the thread.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Praburam,
Thanks for your response. As mentioned since we were able to save the checkpoints on DevCloud,it doesn't seem to be a DevCloud issue ,it could be due to the settings in your code.
Would suggest you to cross check the following:
1)Saving the intermediate checkpoints in a separate folder with a different name for each model
filepath="checkpoint/weights-improvement-{epoch:02d}-{val_accuracy:.2f}.hdf5"
2)Try running a sample on your local machine to verify your code settings works
As you suggested , we would be closing this thread for now.Please feel free to open a new thread if you face any further issues.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page