Multi-GPU run

breyerml · ‎11-04-2020

Is it possible to use multiple GPUs on the devcloud?

My current qsub script looks like:

#!/bin/bash

#PBS -N parameter_test_friedman
#PBS -l nodes=4:gpu
#PBS -l cput=6:00:00

# setup env
...

# run code
mpirun -n 4 ./prog [options]

The idea is that one GPU is assigned to every MPI rank.

Therefore I wrote the following custom device selector:

int sycl_lsh::device_selector::operator()([[maybe_unused]] const sycl_lsh::sycl::device& device) const {
    #if SYCL_LSH_TARGET == SYCL_LSH_TARGET_CPU
        // TODO: implement correctly
        return sycl::cpu_selector{}.operator()(device);
    #else

        #if SYCL_LSH_TARGET == SYCL_LSH_TARGET_NVIDIA
            const std::string_view platform_name = "NVIDIA CUDA";
        #elif SYCL_LSH_TARGET == SYCL_LSH_TARGET_AMD
            const std::string_view platform_name = "AMD";
        #elif SYCL_LSH_TARGET == SYCL_LSH_TARGET_INTEL
            const std::string_view platform_name = "Intel";
        #endif

        // get platform associated with the current device
        auto platform = device.get_platform();
        // check if we are currently on a NVIDIA platform as requested
        if (detail::contains_substr(platform.get_info<sycl::info::platform::name>(), platform_name)) {
            auto device_list = platform.get_devices();
            // check whether the current platform has enough devices to satisfy the requested number of slots
            if (device_list.size() < static_cast<std::size_t>(comm_.size())) {
                throw std::runtime_error(fmt::format("Found {} devices, but need {} devices to satisfy the requested number of slots!",
                                         device_list.size(), comm_.size()));
            }

            // select current device, if the current device is the ith device in the list given the current MPI rank is i
            if (detail::compare_devices(device_list[comm_.rank()], device) && device_list[comm_.rank()].is_gpu()) {
                return 100;
            }
        }
        // never choose current device otherwise
        return -1;

    #endif
}

However, the device selector always reports only one suitable device (if I target an Intel GPU), and hence an exception is thrown.

Do I have to change the qsub script or do I have to somehow change my custom device_selector?

AbhishekD_Intel · ‎11-05-2020

Hi,

Thanks for reaching out to us.

Yes, we can use multiple iGPU's on Devcloud.

With the use of Intel MPI, you can assign iGPU to your ranks.

I have tried the below sample in which I was using a custom device selector to select Intel GPU and tried running it with DPCPP and MPI. And I cannot see any exception of errors while using the same code.

I also cannot see any problem with the qsub script you are using.

Please try the sample once and check the below screenshot for more details.

#include <CL/sycl.hpp>
#include <mpi.h>
#include <string>
#define SIZE 10


class custom_selector : public sycl::device_selector {
 public:
         std::string search;
  custom_selector(std::string toFind) : sycl::device_selector() , search{toFind} {}
  int operator()(const sycl::device& device) const override {
      std::string device_name = device.get_info<sycl::info::device::name>();
      if ( device_name.find( search ) != std::string::npos )
        if ( device.get_info<sycl::info::device::device_type>() == sycl::info::device_type::gpu )
          return 100;

      return -1;
  }
};

int main(int argc, char** argv) {

        MPI_Init(NULL, NULL);

        int world_size;
        MPI_Comm_size(MPI_COMM_WORLD, &world_size);

        // Get the rank of the process
        int world_rank;
        MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);

        // Get the name of the processor
        char processor_name[MPI_MAX_PROCESSOR_NAME];
        int name_len;
        MPI_Get_processor_name(processor_name, &name_len);


        {
                std::string search="Intel";
                custom_selector selector{search};
                sycl::queue myQueue(selector);
                {
                        std::cout<<"Device Name: "<<myQueue.get_device().get_info<sycl::info::device::name>() <<" On: "<<processor_name<<"node, rank "<<world_rank<<" out of "<<world_size<<" processors\n";
                        sycl::range<1> a_size{SIZE};
                        myQueue.submit([&](sycl::handler& cgh) {
                                cgh.parallel_for<class my_selector>(a_size, [=](sycl::item<1> item) {
                                        size_t idx = item.get_linear_id();
                                        //==============Your code Logic==================
                                });
                        });
                }
                myQueue.wait();

        }


        // Finalize the MPI environment.
        MPI_Finalize();
}

Output:

There might be the case that you are selecting a non-Intel GPU which is not available on the node, in such case you will get an exception like "No device of a requested type available".

If you want to get the list of all available devices on your every node you can try out the following code sample.

#include <CL/sycl.hpp>
#include <mpi.h>
#define SIZE 10

int main(int argc, char** argv) {

        MPI_Init(NULL, NULL);

        int world_size;
        MPI_Comm_size(MPI_COMM_WORLD, &world_size);

        // Get the rank of the process
        int world_rank;
        MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);

        // Get the name of the processor
        char processor_name[MPI_MAX_PROCESSOR_NAME];
        int name_len;
        MPI_Get_processor_name(processor_name, &name_len);

        printf("On %s node, rank %d out of %d processors\n", processor_name, world_rank, world_size);


        {
                sycl::range<1> a_size{SIZE};

                auto platforms = sycl::platform::get_platforms();

                for (auto &platform : platforms) {

                        std::cout << "-------------------------------------------------------------------------\nPlatform: "
                                << platform.get_info<sycl::info::platform::name>()
                                << std::endl;

                        auto devices = platform.get_devices();
                        for (auto &device : devices ) {
                                std::cout << " Device: "
                                        << device.get_info<sycl::info::device::name>()
                                        << std::endl<<"\n";

                        }
                }
        }


        // Finalize the MPI environment.
        MPI_Finalize();
}

Hope this will help you to solve your issue.

Warm Regards,

Abhishek

breyerml · ‎11-09-2020

Hi,

thanks for the reply.

I tried your code snippet and indeed it works (and I now know my mistake).

One question regarding your first code snippet. How does it guarantee that each MPI rank get exactly one GPU?

breyerml · ‎11-10-2020

I've a follow up question.

Given the qsub script:

#!/bin/bash

#PBS -N test_run
#PBS -l nodes=4:gpu
#PBS -l cput=24:00:00

mpirun -n 1 $HOME/test/a.out

Why does the code snippet

std::stringstream ss;
ss << "All GPUs on rank " << world_rank << "\n";
auto gpus = sycl::device::get_devices(sycl::info::device_type::gpu);
ss << "Sizes: " << gpus.size() << "\n";
for (auto& device : gpus) {
    ss << device.get_info<sycl::info::device::name>() << "\n";
}

print

All GPUs on rank 0
Sizes: 2
Intel(R) Gen9 HD Graphics NEO
Intel(R) Gen9

?

Shouldn't it print "Intel(R) Gen9 HD Graphics NEO" 4 times since I requested 4 GPUs?
Or does it print "Intel(R) Gen9 HD Graphics NEO" once for each type instead of once for each distinct device?

AbhishekD_Intel · ‎11-13-2020

Hi,

Moving towards your 1st question (How does it guarantee that each MPI rank gets exactly one GPU?)

When you will launch 1 process per node(-n 4 -ppn 1) at that time it is guaranteed that each of your ranks will have exactly one iGPU. Because iGPU is associated with a particular node.

But if you launch multiple processes on a single node(ref the above screenshot) at that time your iGPU of that node will get shared across the processes.

In your 2nd question, with single-process you are getting GPU size as 2, this is because there are two backends available OpenCL and Level0. So before running the above code specify the backend so that your code will search for that particular backend.

Try below commands:

export SYCL_BE=PI_LEVEL0

mpirun -n 1 ./<executable>

Hope this will solve your issue.

Please let us know if you have any issues while executing the above commands.

Warm Regards,

Abhishek

breyerml · ‎11-13-2020

Thanks for your answer.

So it's only possible to specify one GPU per MPI rank because I would launch one MPI process per node and one node contains only one GPU?

How could I do this, if one node contains multiple GPUs?
For example, say one node contains 4 GPUs. Then I would need to launch 4 MPI processes, one for each GPU on the node. In this case, how can I specify one MPI rank per GPU?
(for example, in the case of NVIDIA GPUs I currently use a hack with CUDA_VISIBLE_DEVICES)

(I need to launch one MPI rank per GPU since hipSYCL doesn't currently support multiple GPUs.)

AbhishekD_Intel · ‎11-20-2020

Hi,

If the node contains only one GPU then all the ranks associated with that node will share the same GPU and if you launch a single process on such nodes then each process will contain one GPU.

If the node contains more than one GPU then depending on the count of GPU you can launch your processes so as to get one GPU per process. In this case, you have to query all available devices and have to keep the track of their ID's and depending on your need you have to specify those IDs with the particular processes.

For a sample refer to my code, in which I tried querying all devices available on a particular node. I tried this sample on a node having a dual GPU environment.

I tried using already available functions from dpct namespace for simplicity.

You can try modifying it according to your usage.

#include <CL/sycl.hpp>
#include <dpct/device.hpp>
#include <mpi.h>
#define SIZE 10

int main(int argc, char** argv) {


        int devCount;
        devCount = dpct::dev_mgr::instance().device_count();
        printf("There are %d devices.\n", devCount);

        MPI_Init(NULL, NULL);

        int world_size;
        MPI_Comm_size(MPI_COMM_WORLD, &world_size);

        // Get the rank of the process
        int world_rank;
        MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);

        // Get the name of the processor
        char processor_name[MPI_MAX_PROCESSOR_NAME];
        int name_len;
        MPI_Get_processor_name(processor_name, &name_len);

        {
                // Iterate through devices
                dpct::device_info devProp;
                dpct::dev_mgr::instance().get_device( world_rank ).get_device_info(devProp);

                sycl::queue q;
                q = dpct::dev_mgr::instance().get_device( world_rank ).default_queue();
                std::cout<<"On "<<processor_name
                        <<" node, rank "<<world_rank
                        <<" out of "<<world_size
                        <<" processes, Device name: "<<q.get_device().get_info<sycl::info::device::name>()<<"\n";

        }


        // Finalize the MPI environment.
        MPI_Finalize();
}

O/P:

Hope the provided details will help you.

Warm Regards,

Abhishek

breyerml · ‎11-24-2020

Thanks for your reply. That should theoretically work (if I find a way to compare two devices for equality).

However, I've got a problem testing your code.

Do I have to do something special in order to use the dpct namespace?

It compiles but as soon as I use a function from that namespace like

dpct::dev_mgr::instance().device_count();

the executable runs forever and never completes.

AbhishekD_Intel · ‎11-26-2020

Hi,

Please check you code have you include #include <CL/sycl.hpp> #include <dpct/device.hpp> #include <mpi.h> headers?

Please refer to the above screenshot for the compilation command. I have tried the same code on multiple environments and I am not getting any errors, so please check your environment. Even I tried on multiple nodes of Devcloud it's running fine.

If you are getting this problem on Devcloud then try using a different node it should work.

Warm Regards,

Abhishek

AbhishekD_Intel · ‎11-26-2020

Hi,

I had also tried it on quad_gpu. Please refer to the below screenshot for more details.

Warm Regards,

Abhishek

breyerml · ‎11-29-2020

I'm currently using a MWE to test the code:

#include <CL/sycl.hpp>
#include <dpct/device.hpp>

#include <iostream>

int main() {

    std::cout <<  dpct::dev_mgr::instance().device_count() << std::endl;

    return 0;
}

The code is compiled using:

dpcpp -I /glob/development-tools/versions/oneapi/beta10/inteloneapi/mpi/2021.1-beta10/include -L /glob/development-tools/versions/oneapi/beta10/inteloneapi/mpi/2021.1-beta10/lib/release -L /glob/development-tools/versions/oneapi/beta10/inteloneapi/mpi/2021.1-beta10/lib -Xlinker --enable-new-dtags -Xlinker -rpath -Xlinker /glob/development-tools/versions/oneapi/beta10/inteloneapi/mpi/2021.1-beta10/lib/release -Xlinker -rpath -Xlinker /glob/development-tools/versions/oneapi/beta10/inteloneapi/mpi/2021.1-beta10/lib -lmpicxx -lmpifort -lmpi -ldl -lrt -lpthread main.cpp

Running the resulting executable on a node obtained by

qsub -I -l nodes=1:gpu:ppn=2

results in

...:~/test$ time ./a.out
5

real	0m3.561s
user	0m1.854s
sys	0m0.821s

which is ok (however, 3.5s isn't that create I think).

However, if I try to run the same code (compiled on the new node) on another node

qsub -I -l nodes=1:quad_gpu:ppn=2

no output is generated even after 5min.

AbhishekD_Intel · ‎12-02-2020

Hi,

There is a problem with some quad_gpu nodes and we have already working on that problem.

So try using another node for quad_gpu workload or use the dual_gpu node.

Let us know if we can close this thread as we have given solutions to all of your issues.

Warm Regards,

Abhishek

AbhishekD_Intel · ‎12-16-2020

Hi,

As we have resolved your multiple issues, we are no longer monitoring this thread.

Please post a new thread if you have any other issues.

Warm Regards,

Abhishek