Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
160 Views

Can't view checkpoint files in the Home directory

I am training a neural network and I'm using keras.callback for creating checkpoints but I can't see the .hdf5 files in my working directory how can I send it to my login node from the computing node. Please help me out

Tags (1)
0 Kudos
16 Replies
Highlighted
Moderator
160 Views

Hi,

Hi,

Thanks for reaching out to us. 

Home folder is NFS-shared between the login node and the compute nodes. So that the files in compute nodes are available in login node also.

Could you please check whether you have set the path to save .hdf5 into the working directory in your program.

If you are submitting it as a job , then add  PBS_O_WORKDIR (the absolute path of the current working directory of the qsub command) in the job script.

If you are still facing the issue, please share some sample script or example.

 

0 Kudos
Highlighted
160 Views

m1.compile(optimizer

m1.compile(optimizer=optimizer, loss = 'categorical_crossentropy',metrics = ['accuracy'])

filepath="weights.best.hdf5"

checkpoint = ModelCheckpoint(filepath, monitor='val_accuracy', verbose=1, save_best_only=True, mode='max')

callbacks_list = [checkpoint]

info.append(m1.fit([X1,X2],y,epochs=5,batch_size=128,callbacks = callbacks_list,class_weight= class_weights))

m1.save('Cascade_MulEpoch.h5')

This is the code that i am using for saving the model. I am using tensorflow 1.13 and keras 2.X 

 

0 Kudos
Highlighted
Moderator
160 Views

Hi,

Hi,

Could you please check whether your model files are saved in the home directory 

cd /home/uxxxxx

To save the model into the working directory,  please give the complete path in the code

m1.save('/home/uxxxxx/path to working directory/Cascade_MulEpoch.h5')

Please let us know, if the issue still persists.

 

0 Kudos
Highlighted
160 Views

I am already using cd PBS_O

I am already using cd PBS_O_WORKDIR in the job script already should I use the file full filepath in the save function? 

0 Kudos
Highlighted
160 Views

I am facing one more problem

I am facing one more problem

When I use the checkpoints I can't see the training of the model ie the time for each epoch and which epoch is running the info.append command why is that? 

0 Kudos
Highlighted
Moderator
160 Views

Hi,

Hi,

By default, PBS scripts execute in your home directory, not the directory from which they were submitted. The following line places you in the directory from which the job was submitted.   

cd $PBS_O_WORKDIR 

In your command $ is missing.

Could you please confirm whether this issue resolved now?

You can check the logs of your running job as follows: For output logs, use qpeek:

qpeek -o <JOB_ID>

For error logs:

qpeek -e <JOB_ID>

To get JOB_ID  use the command : qstat

Hope this clarifies your query. If not could you please elaborate your query to get us more clarity on the issue.

0 Kudos
Highlighted
160 Views

Sorry I didn't use the $

Sorry I didn't use the $ symbol when I commented here but I am using that in my job script and i am already using qpeek and qstat for checking my outputs and job id. Still I cannot see the weight files in my home directory or anywhere

0 Kudos
Highlighted
Moderator
160 Views

Hi,

Hi,

We have tried one sample code using keras 2.2.4 , and was able to save the model file successfully in the specified path.

Could you please share the work load you are trying , so that we can verify it from our end.

 

0 Kudos
Highlighted
160 Views

optimizer = optimizers.SGD(lr

optimizer = optimizers.SGD(lr=0.08)

 

fold = os.listdir("Dataset/Image_Data/HG/")

fold.sort(key=str.lower) 

 

for path in fold:

    print(path)

    path = "Dataset/Image_Data/HG/"+path

    p = os.listdir(path)

    p.sort(key=str.lower)

    arr = []

    

    # Reading from 4 images and creating 4 channel slice-wise 

    for i in range(len(p)):

      if(i != 4):

        p1 = os.listdir(path+'/'+p)

        p1.sort()

        img = sitk.ReadImage(path+'/'+p+'/'+p1[-1])

        arr.append(sitk.GetArrayFromImage(img))

      else:

        p1 = os.listdir(path+'/'+p)

        img = sitk.ReadImage(path+'/'+p+'/'+p1[0])

        Y_labels = sitk.GetArrayFromImage(img)

    data = np.zeros((Y_labels.shape[1],Y_labels.shape[0],Y_labels.shape[2],4))

    for i in range(Y_labels.shape[1]):

      data[i,:,:,0] = arr[0][:,i,:]

      data[i,:,:,1] = arr[1][:,i,:]

      data[i,:,:,2] = arr[2][:,i,:]

      data[i,:,:,3] = arr[3][:,i,:]

    print(data.shape)

    info = []

    

    # Creating patches for each slice and training(slice-wise)

    for i in range(data.shape[0]):

      d = data_gen(data,Y_labels,i,1)

      if(len(d) != 0):

        y = np.zeros((d[2].shape[0],1,1,5))

        for j in range(y.shape[0]):

          y[j,:,:,d[2]] = 1

        X1 = d[0]

        X2 = d[1]

        class_weights = class_weight.compute_class_weight('balanced',

                                                          np.unique(d[2]),

                                                          d[2])

        print('slice no:'+str(i))

        m1.compile(optimizer=optimizer, loss = 'categorical_crossentropy',metrics = ['accuracy'])

        filepath="weights.best.hdf5" //This is not working

        checkpoint = ModelCheckpoint(filepath, monitor='val_accuracy', verbose=1, save_best_only=True, mode='max')

        callbacks_list = [checkpoint]

        info.append(m1.fit([X1,X2],y,epochs=5,batch_size=128,callbacks = callbacks_list,class_weight= class_weights))

        m1.save('Cascade_MulEpoch.h5')  // This line is working

0 Kudos
Highlighted
Moderator
160 Views

Hi,

Hi,

Thanks for sharing the code snippet.

It would be great if you could attach the complete scripts in a folder and the steps to run so that we can easily reproduce it from our end.

0 Kudos
Highlighted
160 Views

I have attached the training

I have attached the training script and the dataset for reference

 

0 Kudos
Highlighted
Moderator
160 Views

Hi,

Hi,

Thank you for sharing the work load.

we were unable to try out your code due to some package dependency and dataset issues.

However, we observed that you are facing issue while saving intermediate check points.

We tried using one sample code to save intermediate checkpoints from our end and was able to save the checkpoint files after every epoch.

Please find the below code snippet which we used to train and save the intermediate checkpoints.
 

model.compile(optimizer = optimizer, loss = 'categorical_crossentropy', metrics = ['accuracy'])
filepath="checkpoint/weights-improvement-{epoch:02d}-{val_accuracy:.2f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='val_accuracy', verbose = 1, save_best_only=False, mode='max')
callbacks_list = [checkpoint]
info.append(model.fit_generator(train_generator,
        steps_per_epoch=100,
        epochs=5,
        validation_data=validation_generator,
        validation_steps=10,
        callbacks= callbacks_list))
model.save("savedModels_new/tensorflow.keras_C_C_C_MP_28_right_test_set.h5")

Could you please change your code accordingly and let us know if you face any issues.

 

0 Kudos
Highlighted
160 Views

Thanks a lot for the snippet

Thanks a lot for the snippet I'll take a look using it and then reply here

0 Kudos
Highlighted
Moderator
160 Views

Hi,

Hi,

Could you please let us know whether your issue got resolved.

 

Thanks

0 Kudos
Highlighted
160 Views

Hello, and thanks for the

Hello, and thanks for the information till now but my issue is not yet solved I'll try something else and check it out. You can now close the thread.

0 Kudos
Highlighted
Moderator
160 Views

Hi Praburam,

Hi Praburam,

Thanks for your response. As mentioned since we were able to save the checkpoints on DevCloud,it doesn't seem to be a DevCloud issue ,it could be due to the settings in your code.

Would suggest you to cross check the following:

1)Saving the intermediate checkpoints in a separate folder with a different name for each model

            filepath="checkpoint/weights-improvement-{epoch:02d}-{val_accuracy:.2f}.hdf5"

2)Try running a sample on your local machine to verify your code settings works

 

As you suggested , we would be closing this thread for now.Please feel free to open a new thread if you face any further issues.

0 Kudos