Solved: Re: Re:Connected to gpu node but cuda is not available

ffa · ‎11-21-2022

I used the following command to interactively connect to the gpu torch.cuda.is_available() returns False

qsub -I -l nodes=1:gpu:ppn=2

Please help!

JesusE_Intel · ‎12-19-2022

Hi ffa,

The max walltime for academia accounts is 20 minutes on Intel Developer Cloud for the Edge.

Regards,

Jesus

View solution in original post

RemyaP_Intel · ‎11-22-2022

Hi,

Thank you for posting in Intel Communities.

What you see is the expected result. Unfortunately, CUDA is not supported in Intel DevCloud.

If that answers your question, can we go ahead and close this case?

Regards,

Remya Premdas

ffa · ‎11-22-2022

How can we utilize the gpu for training models then?

RemyaP_Intel · ‎11-25-2022

Hi,

Intel AI Analytics framework does not support GPU utilization. You may utilize the Xeon scalable multi-core CPUs for training the models.

Regards,

Remya Premdas

ffa · ‎11-25-2022

I have heard about that we can use intel extension for pytorch to utilize gpu with xpus. But I am not able to get it work. Can you please help with it?

RemyaP_Intel · ‎12-01-2022

Hi,

You can find more details regarding how Intel Extension for PyTorch provides easy GPU acceleration for Intel discrete GPUs here - https://intel.github.io/intel-extension-for-pytorch/cpu/latest/

Here are some example codes:

https://github.com/intel/intel-extension-for-pytorch/tree/xpu-master

https://intel.github.io/intel-extension-for-pytorch/xpu/latest/tutorials/examples.html

Regards,

Remya Premdas

ffa · ‎12-02-2022

It is slower than using CPU. Am I doing something wrong? Are there any drivers to install?

Here is my code:

import argparse
import numpy as np
import torch
from torch import nn, optim
import intel_extension_for_pytorch as ipex
from torchvision import datasets, models, transforms
from PIL import Image
import sys
import warnings
warnings.filterwarnings("ignore")

parser = argparse.ArgumentParser(description = "Train a new neural network on a dataset.")

parser.add_argument("data_dir", type = str, help = "Dataset for the network to train on.")

parser.add_argument("--arch", type = str, default = "resnet18", 
help = "Available architectures: resnet18, vgg13")

parser.add_argument("--epochs", type = int, default = 10, 
help = "Number of epochs.")

parser.add_argument("--gpu", action = "store_true", help = "Train on a GPU device.")

parser.add_argument("--hidden_units", type = int, default = 256,
help = "Number of hidden units.")

parser.add_argument("--learning_rate", type = float, default = 0.003,
help = "Learning rate to use for the model.")

parser.add_argument("--save_dir", type = str, default = "./",
help = "Location to save your model after training.")

args_in = parser.parse_args()


# if args_in.gpu:
# try:
# assert torch.cuda.is_available() == True
# device = "cuda"
# print("Using CUDA..")
# except AssertionError:
# answer = input("GPU is not available on this device, use CPU? (yes, no): ")

# if answer.lower() == "yes":
# device = "cpu"
# print("Using CPU..")
# elif answer.lower() == "no": 
# print("Terminating..")
# sys.exit()
# else:
# print("Invalid option selected, terminating..")
# sys.exit()
# else:
# device = "cpu"
# print("Using CPU..")

# print("Loading data..")

device = "xpu"
print("Using", device.upper())

data_dir = args_in.data_dir[:-1] if args_in.data_dir[-1] == "/" else args_in.data_dir
train_dir = data_dir + '/train'
valid_dir = data_dir + '/valid'
test_dir = data_dir + '/test'

train_transforms = transforms.Compose([
transforms.RandomResizedCrop(224),
transforms.RandomRotation(30),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406],
[0.229, 0.224, 0.225])])
val_test_transforms = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406],
[0.229, 0.224, 0.225])])

train_dataset = datasets.ImageFolder(train_dir, transform = train_transforms)
val_dataset = datasets.ImageFolder(valid_dir, transform = val_test_transforms)
test_dataset = datasets.ImageFolder(test_dir, transform = val_test_transforms)

trainloader = torch.utils.data.DataLoader(train_dataset, batch_size = 64, shuffle = True)
valloader = torch.utils.data.DataLoader(val_dataset, batch_size = 64, shuffle = True)
testloader = torch.utils.data.DataLoader(test_dataset, batch_size = 64, shuffle = True)

print("Building model..")


if args_in.arch == "resnet18":

model = models.resnet18(pretrained=True)

for params in model.parameters():
params.requires_grad = False

classifier = nn.Sequential(
nn.Linear(512, args_in.hidden_units),
nn.ReLU(),
nn.Dropout(p=0.25),
nn.Linear(args_in.hidden_units, 102),
nn.LogSoftmax(dim=1)
)

model.fc = classifier
optimizer = optim.Adam(model.fc.parameters(), lr=args_in.learning_rate)

elif args_in.arch == "densenet161":

model = models.densenet161(pretrained=True)

for params in model.parameters():
params.requires_grad = False

classifier = nn.Sequential(
nn.Linear(2208, args_in.hidden_units),
nn.ReLU(),
nn.Dropout(p=0.25),
nn.Linear(args_in.hidden_units, 102),
nn.LogSoftmax(dim=1)
)

model.classifier = classifier
optimizer = optim.Adam(model.classifier.parameters(), lr=args_in.learning_rate)

elif args_in.arch == "alexnet":

model = models.alexnet(pretrained=True)

for params in model.parameters():
params.requires_grad = False

classifier = nn.Sequential(
nn.Linear(9216, args_in.hidden_units),
nn.ReLU(),
nn.Dropout(p=0.25),
nn.Linear(args_in.hidden_units, 102),
nn.LogSoftmax(dim=1)
)

model.classifier = classifier
optimizer = optim.Adam(model.classifier.parameters(), lr=args_in.learning_rate)

else:
print("Architecture is not available!")
sys.exit()

criterion = nn.NLLLoss()
epochs = args_in.epochs
steps = 0
train_losses, test_losses = [], []
running_loss = 0
print_every = 1

model.to(device);

print("Training model..")

for e in range(epochs):
for images, labels in trainloader:

steps += 1

images, labels = images.to(device), labels.to(device)
optimizer.zero_grad()

logps = model(images)
loss = criterion(logps, labels)
loss.backward()
optimizer.step()

running_loss += loss.item()

if steps % print_every == 0:
test_loss = 0
accuracy = 0
model.eval()

with torch.no_grad():
for images, labels in valloader:

images, labels = images.to(device), labels.to(device)

logps = model(images)
loss = criterion(logps, labels)
test_loss += loss.item()

ps = torch.exp(logps)
top_p, top_class = ps.topk(1, dim=1)
equals = top_class == labels.view(*top_class.shape)
accuracy += torch.mean(equals.type(torch.FloatTensor))

model.train()
train_losses.append(running_loss/print_every)
test_losses.append(test_loss/len(valloader))

print("Epochs: {}/ {}..".format(e+1, epochs),
"Train loss: {:.3f}..".format(running_loss/print_every),
"Test loss: {:.3f}..".format(test_loss/len(valloader)),
"Accuracy: {:.3f}..".format(accuracy/len(valloader)))

running_loss = 0
print("Model trained") 

print("Testing data..") 
with torch.no_grad():

accuracy = 0
model.eval()
for (images, labels) in testloader:

(images, labels) = (images.to(device), labels.to(device))

logps = model(images)
loss = criterion(logps, labels)

ps = torch.exp(logps)
(top_p, top_class) = ps.topk(1, dim=1)
equals = top_class == labels.view(*top_class.shape)
accuracy += torch.mean(equals.type(torch.FloatTensor))

print("Accuracy on test data: {}".format(accuracy/len(testloader)))

print("Saving model..")

model.class_to_idx = train_dataset.class_to_idx
checkpoint = {
'epochs': epochs,
'label_mapping': model.class_to_idx,
'model_arch': args_in.arch,
'hidden_units': args_in.hidden_units,
'model_state_dict': model.state_dict(),
'optim_state_dict': optimizer.state_dict()
}

save_dir = args_in.save_dir[:-1] if args_in.data_dir[-1] == "/" else args_in.save_dir
torch.save(checkpoint, save_dir + '/checkpoint-' + args_in.arch + '.pth')

print("Model saved successfully.")

Also I keep getting the following error:

[CRITICAL ERROR] Kernel '_ZTSZZN2at15AtenIpexTypeXPU17dpcppMemoryScale2IffEEvPT_PKT0_mfdENKUlRN2cl4sycl7handlerEE_clESA_EUlNS8_4itemILi1ELb1EEEE_' removed due to usage of FP64 instructions unsupported by the targeted hardware. Running this kernel may result in unexpected results.

Please help.

RemyaP_Intel · ‎12-08-2022

Hi,

Could you please share the link from which you have referred below code? Also, please try the example sample codes as per the steps given and let us know if you are still getting any errors.

Regards,

Remya Premdas

ffa · ‎12-08-2022

I tried using the sample example on both CPU and GPU. Did not encounter kernel error.

CPU appears to be faster than GPU.

train-cpu.py

import torch
import torchvision
import time
############# code changes ###############
# import intel_extension_for_pytorch as ipex
############# code changes ###############

LR = 0.001
DOWNLOAD = False
DATA = 'datasets/cifar10/'

transform = torchvision.transforms.Compose([
torchvision.transforms.Resize((224, 224)),
torchvision.transforms.ToTensor(),
torchvision.transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])
print("Loading data..")
train_dataset = torchvision.datasets.CIFAR10(
root=DATA,
train=True,
transform=transform,
download=DOWNLOAD,
)
train_loader = torch.utils.data.DataLoader(
dataset=train_dataset,
batch_size=128
)

print("Defining model..")
model = torchvision.models.resnet50()
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr = LR, momentum=0.9)
model.train()
#################################### code changes ################################
model = model.to("cpu")
# model, optimizer = ipex.optimize(model, optimizer=optimizer, dtype=torch.float32)
#################################### code changes ################################

print("Training the model..")
start = time.time()
for batch_idx, (data, target) in enumerate(train_loader):
batch_start = time.time()
########## code changes ##########
data = data.to("cpu")
target = target.to("cpu")
########## code changes ##########
optimizer.zero_grad()
output = model(data)
loss = criterion(output, target)
loss.backward()
optimizer.step()
batch_end = time.time()
final_time = batch_end - batch_start
print(batch_idx, final_time)
end = time.time()
print("Done! Total time =", end - start)

print("Saving the model!")
torch.save({
'model_state_dict': model.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
}, 'checkpoint.pth')

Output

(pytorch) u178728@s019-n005:~/iefp$ python train-cpu.py
Loading data..
Defining model..
Training the model..
0 14.459826469421387
1 13.667452096939087
2 13.634796857833862
3 13.60639500617981
4 13.862899541854858
5 13.84148621559143
6 13.908369302749634
7 13.900601387023926
8 14.252801179885864
9 13.971882820129395
10 13.960126399993896
11 13.995576620101929
12 13.973973751068115
13 14.029194355010986
14 13.97255802154541
15 13.985599994659424
16 13.95111346244812
17 14.342556238174438
18 14.030582427978516
19 14.013816118240356
20 14.008814334869385
21 14.047874927520752
22 13.99599814414978
23 14.034571647644043
24 14.028615474700928
25 14.386399507522583
26 14.104341983795166
27 14.081377983093262
28 14.128713607788086
29 14.0845787525177
30 14.076815605163574
31 14.024810075759888
32 14.055947303771973
33 14.045863628387451
34 14.361424446105957
35 14.067027568817139
36 14.039616346359253
37 14.151290893554688
38 14.132614612579346
39 14.100401878356934
40 14.10266399383545
41 14.128549575805664
42 14.498270988464355
43 14.082086324691772
44 14.067831754684448
45 14.110792398452759
46 14.032299757003784
47 14.120846033096313
48 14.047749519348145
49 14.165565490722656
50 14.175102233886719
51 14.409415245056152
52 14.14152717590332
53 14.127228736877441
54 14.123661041259766
55 14.11213731765747
56 14.085555791854858
57 14.1154043674469
^C
Traceback (most recent call last):
File "/home/u178728/iefp/train-cpu.py", line 50, in <module>
loss.backward()
File "/glob/development-tools/versions/oneapi/2022.3.1/oneapi/intelpython/latest/envs/pytorch/lib/python3.9/site-packages/torch/_tensor.py", line 396, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/glob/development-tools/versions/oneapi/2022.3.1/oneapi/intelpython/latest/envs/pytorch/lib/python3.9/site-packages/torch/autograd/__init__.py", line 173, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
KeyboardInterrupt

(pytorch) u178728@s019-n005:~/iefp$

I am using the default pytorch conda environment.

train-gpu.py

import torch
import torchvision
import time
############# code changes ###############
import intel_extension_for_pytorch as ipex
############# code changes ###############

LR = 0.001
DOWNLOAD = False
DATA = 'datasets/cifar10/'

transform = torchvision.transforms.Compose([
torchvision.transforms.Resize((224, 224)),
torchvision.transforms.ToTensor(),
torchvision.transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])
print("Loading data..")
train_dataset = torchvision.datasets.CIFAR10(
root=DATA,
train=True,
transform=transform,
download=DOWNLOAD,
)
train_loader = torch.utils.data.DataLoader(
dataset=train_dataset,
batch_size=128
)

print("Defining model..")
model = torchvision.models.resnet50()
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr = LR, momentum=0.9)
model.train()
#################################### code changes ################################
model = model.to("xpu")
model, optimizer = ipex.optimize(model, optimizer=optimizer, dtype=torch.float32)
#################################### code changes ################################

print("Training the model..")
start = time.time()
for batch_idx, (data, target) in enumerate(train_loader):
batch_start = time.time()
########## code changes ##########
data = data.to("xpu")
target = target.to("xpu")
########## code changes ##########
optimizer.zero_grad()
output = model(data)
loss = criterion(output, target)
loss.backward()
optimizer.step()
batch_end = time.time()
final_time = batch_end - batch_start
print(batch_idx, final_time)
end = time.time()
print("Done! Total time =", end - start)

print("Saving the model!")
torch.save({
'model_state_dict': model.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
}, 'checkpoint.pth')

Output:

(iefp) u178728@s019-n004:~/iefp$ python train.py
Loading data..
Defining model..
/home/u178728/iefp/lib/python3.9/site-packages/intel_extension_for_pytorch/frontend.py:277: UserWarning: pending the optimization for LSTM
warnings.warn("pending the optimization for LSTM")
Training the model..
0 91.93736839294434

It is stuck at 0 iter for more then 1/2 hour.

I am using venv environment.

u178728@s019-n005:~/iefp$ source bin/activate
(iefp) u178728@s019-n005:~/iefp$ pip list
Package Version
--------------------------- -------------------
intel-extension-for-pytorch 1.10.200+gpu
numpy 1.23.4
Pillow 9.3.0
pip 22.3.1
setuptools 58.1.0
torch 1.10.0a0+git3d5f2d4
torchvision 0.11.0+cpu
typing_extensions 4.4.0
(iefp) u178728@s019-n005:~/iefp$ which python
/home/u178728/iefp/bin/python
(iefp) u178728@s019-n005:~/iefp$ python --version
Python 3.9.13 :: Intel Corporation
(iefp) u178728@s019-n005:~/iefp$

I don't know what I am doing wrong, please help!

RemyaP_Intel · ‎12-12-2022

Hi,

As per the installation document, Intel® Data Center GPU Flex Series 170 is required. Unfortunately, we do not have this in DevCloud. Could you please try this sample on a machine where the hardware and software requirements are met?

Regards,

Remya Premdas

ffa · ‎12-12-2022

Hi Remya,

Thanks for your help! Unfortunately, I do not have a machine that meets the hardware requirements.

ffa · ‎12-12-2022

Although I do see some nodes in the Intel® Developer Cloud for the Edge with Intel® Data Center GPU Flex 170. Can I use those?

https://www.intel.com/content/www/us/en/developer/tools/devcloud/edge/hardware-workloads.html

RemyaP_Intel · ‎12-14-2022

Hi,

The node you are trying to use is Edge Devcloud node. This will not be accessible from oneAPI DevCloud. You will have to request for Edge DevCloud access and try connecting to that node.

Regards,

Remya Premdas

ffa · ‎12-16-2022

I am not able to get a node for more than 20 minutes. Please help!

JesusE_Intel · ‎12-19-2022

Hi ffa,

The max walltime for academia accounts is 20 minutes on Intel Developer Cloud for the Edge.

Regards,

Jesus

ffa · ‎12-19-2022

Hi,

Thank you Remya and Jesus. My query has been answered. You can close this thread.

Regards,

Aniket

JesusE_Intel · ‎12-20-2022

If you need any additional information, please submit a new question as this thread will no longer be monitored.