'ccl' issue with Multi-GPU AI Training (Data-Parallel) with Intel® Extension for PyTorch

RogerT · ‎06-26-2024

Hi All,

I have been experimenting with Multi-GPU AI Training (Data-Parallel) with Intel® Extension for PyTorch as per the video here:
https://www.youtube.com/watch?v=3A8AVsNNHOg

I have setup multiple environments but have not been successful in getting the example shown in the video working.

I have narrowed the issue down into a simple example which is shown below:

...

print("rank: ", rank)
print("world_size: ",world_size)

device = "xpu:{}".format(rank)
print("device: ",device)
torch.xpu.set_device(device)

dist.init_process_group( backend='ccl', rank=rank, world_size=world_size)

model = nn.Linear(10, 10)
modelondevice = model.to(device)
print("model moved to device: ",device)

ddp_model = DDP(modelondevice, device_ids=[device])
print("model added to ddp model for: ",device)

...

No matter how I run it (from the command-line via torchrun or a slurm batch script) I still get the same error when calling 'ddp_model = DDP(modelondevice, device_ids=[device])':

...
2024:06:26-12:33:14:(18833) |CCL_WARN| could not get local_idx/count from environment variables, trying to get them from ATL
2024:06:26-12:33:14:(18833) |CCL_INFO| process launcher: hydra, local_proc_idx: -1, local_proc_count: -1
...
2024:06:26-12:33:17:(18835) |CCL_ERROR| base_thread.cpp:22 start: error while creating worker thread #0 pthread_create returns 22
2024:06:26-12:33:17:(18835) |CCL_ERROR| exec.cpp:122 start_workers: condition workers.back()->start(cpu_affinity, mem_affinity) == ccl::status::success failed
...

I have tried multiple settings of the follow environment variables (ZE_AFFINITY_MASK, CCL_PROCESS_LAUNCHER, CCL_ATL_TRANSPORT, CCL_WORKER_COUNT, CCL_WORKER_AFFINITY, CCL_WORKER_MEM_AFFINITY) but the error persists.

Please can someone advise me on how to overcome this issue?

Roger

Environment:
export CCL_ZE_IPC_EXCHANGE="sockets"
export ZE_FLAT_DEVICE_HIERARCHY="COMPOSITE"
export ZE_AFFINITY_MASK="0, 1, 2, 3"
export CCL_PROCESS_LAUNCHER="hydra"
export CCL_ATL_TRANSPORT="mpi"
#export CCL_WORKER_COUNT="4"
#export CCL_WORKER_AFFINITY="auto"
#export CCL_WORKER_MEM_AFFINITY="auto"

Packages:
torch==2.0.1a0
torchvision==0.15.2a0
intel-extension-for-pytorch==2.0.120+xpu
oneccl-bind-pt==2.0.200

Hardware:
2x Intel(R) Xeon(R) Platinum 8468 (formerly codenamed Sapphire Rapids) (96 cores in total)
1024 GiB RAM
4x Intel(R) Data Center GPU Max 1550 GPUs (formerly codenamed Ponte Vecchio) (128 GiB GPU RAM each)
Xe-Link 4-way GPU interconnect within the node
Quad-rail NVIDIA (Mellanox) HDR200 InfiniBand interconnect

See attachments for more full code, submission script, outputs and environment details

Vipin_Singh1 · ‎06-27-2024

Hi Roger, we would like to inform you that we are routing your query to the dedicated team for further assistance.

RogerT · ‎07-01-2024

Ok thanks.