Solved: My DPC++ program can't run on an intel ATS-P GPU. why?

PcDack1 · ‎04-24-2022

The DPC++ code is very simple, just have a local array ,set the value of array be 0 and barrier mem.

 #include <CL/sycl.hpp>
 using namespace sycl;
 #define WRAP_SIZE 32
 int main(){
    sycl::gpu_selector selector;
    queue exec_queue(selector);
    int num_blocks=128;
    int num_threads=256;
    int casBeg=0;
    int casEnd=2;
    exec_queue.submit([&](sycl::handler& cgh)
    {
        sycl::stream out{ 4096, 128, cgh };
        auto sharedmem = sycl::accessor<int, 1, sycl::access_mode::read_write, sycl::access::target::local>(11, cgh);
        cgh.parallel_for(
            sycl::nd_range<1>(num_blocks * num_threads, num_threads),
            [=](sycl::nd_item<1> item_ct1) [[intel::reqd_sub_group_size(WRAP_SIZE)]] {
                int blkId = item_ct1.get_group(0);
                int tid = item_ct1.get_local_id(0);
                int stride = item_ct1.get_local_range().get(0);
                        out<<"inter\n";
                if (tid == 0)
                    for (int i = 0; i < 11; ++i)
                        sharedmem[i] = 0;
                item_ct1.barrier(sycl::access::fence_space::local_space);
            });
    }).wait();
    return 0;
 }

The build command is

dpcpp -DMKL_ILP64 -lmkl_sycl -lmkl_intel_ilp64 -lmkl_tbb_thread -lmkl_core -pthread -std=c++17 -O0 -o <project_name> <code_name>.cpp

Compiled program can work ok on the P690 GPU, but not work on the NDK intel ATS-P GPU. Why? How to Fix it?Thx

PcDack1 · ‎06-06-2022

Hi All,

First of all, I apologize for not describing the problem precisely. In order to make the problem more clear, my team and I, after repeated testing, found that the problem can be described in a simpler way(related to sycl::stream). I created a new post to describe the problem. The current post will no longer be maintained.

Thx all.

Thanks & Regards,

Dack

View solution in original post

HemanthCH_Intel · ‎04-26-2022

Hi,

Thank you for posting in Intel Communities.

Could you please provide the DPC++ version and OS details, So that we can reproduce the issue from our end?

Use the below command to find the DPC++ version:

source /opt/intel/oneapi/setvars.sh
dpcpp --version

Thanks & Regards,

Hemanth

PcDack1 · ‎04-27-2022

Hi,

I am running under NDA devcloud. I do not know the os detail. The result of uname -a is

Linux s013-n002 5.10.54+prerelease2606 #1 SMP Fri Jan 7 14:21:53 PST 2022 x86_64 x86_64 x86_64 GNU/Linux

The result of /proc/version is

Linux version 5.10.54+prerelease2606 (ubit@fm6pudocker153) (gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0, GNU ld (GNU Binutils for Ubuntu) 2.34) #1 SMP Fri Jan 7 14:21:53 PST 2022

The result of dpcpp --version.

Intel(R) oneAPI DPC++/C++ Compiler 2022.0.0 (2022.0.0.20211123)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /nda/development-tools/versions/oneapi/2022.1.0.nda/oneapi/compiler/2022.0.1-prerelease/linux/bin-llvm

Thx

HemanthCH_Intel · ‎04-29-2022

Hi,

We are working on this internally and will get back to you soon.

Thanks & Regards,

Hemanth

Subarnarek_G_Intel · ‎05-04-2022

Hi,

I don't see any MKL calls in the code. Please help me understand why you have used MKL libraries in the build command? It works perfect without the MKL flags.

Regards,

Subarna

PcDack1 · ‎05-04-2022

Hi Subarna,

I have a project use some mkl functions(e.g. gemm and LU functions).But the project can not run correct in ATS-P card. So, I debug it and found the problem when add mkl flags. To simply describe the problem, I abstracted the simplest code. And, it is very confusing that if you add the MKL flags, the program will not run correctly.

Best Regards

Subarnarek_G_Intel · ‎05-12-2022

Hi Li,

This is not a generic issue. Refer to this code for example https://github.com/oneapi-src/oneAPI-samples/blob/master/Libraries/oneMKL/matrix_mul_mkl. It works perfectly fine on ATS. What is the error that you got? Can you refer to the make file to start the build with.

Regards,

Subarna

PcDack1 · ‎05-12-2022

Hi Subarna,

Thank you for your answer. First of all, based on your hint, I ran the usm example of oneMKL gemm (Devcloud path is ~/Intel_oneAPI_MKL_Training/00_GEMM/.Script is run_gemm_usm.sh). it doesn't work on ATS-P(NDK devcloud). The error message is CL_INVALID_DEVICE. Second, I added my code (paste below) below the sample code you gave. It still doesn't work on ATS-P with a compilation optimization level of O0.Usually, we need to use O0 optimization to debug the code.

My code snippet is:

    queue.submit([&](sycl::handler& cgh)
    {
        sycl::stream out{ 4096, 128, cgh };
        cgh.parallel_for(
            sycl::nd_range<1>(num_blocks * num_threads, num_threads),
            [=](sycl::nd_item<1> item_ct1) [[intel::reqd_sub_group_size(32)]] {
                int stride = item_ct1.get_local_range().get(0);
                item_ct1.barrier(sycl::access::fence_space::local_space);
            });
    }).wait();

Regards,

Dack

Subarnarek_G_Intel · ‎05-29-2022

Hi Li,

https://jupyter.oneapi.devcloud.intel.com/user/u29878/lab/tree/Intel_oneAPI_MKL_Training/00_GEMM/02_GEMM_DPCPP_USM.ipynb works fine for me. I think you didn't follow the instructions of uncommenting certain lines of code.

"Uncomment _one_ of the following three lines to select a device." - look for this line.

Regards,

Subarna

PcDack1 · ‎05-30-2022

Hi Subarna,

Thank you for your answer, the first question was an oversight on my part. Regarding the second question, how is it explained?

Regards,

Dack

Subarnarek_G_Intel · ‎05-30-2022

Hi Li,

I am getting a bit confused now? What are you trying to achieve from the thread? Is it an issue with MKL flags not getting used properly or is it that the code snippet with barrier functions that you put inside the MKL sample doesn't work? If that is the issue then I need to understand properly where have you put this code snippet in that code?

PcDack1 · ‎05-30-2022

Hi Subarna,

Code

#include <iostream>
#include <vector>
#include <CL/sycl.hpp>          //# sycl namespace
#include "oneapi/mkl/blas.hpp"  //# oneMKL DPC++ interface for BLAS functions

//# The following project performs matrix multiplication using oneMKL / DPC++ with buffers.
//# We will execute the simple operation A * B = C
//# The matrix B is set equal to the identity matrix such that A * B = A * I
//# After performing the computation, we will verify A * I = C -> A = C

namespace mkl = oneapi::mkl;  //# shorten mkl namespace

int main() {

    //# dimensions
    int m = 3, n = 3, k = 3;
    //# leading dimensions
    int ldA = 3, ldB = 3, ldC = 3;
    //# scalar multipliers
    double alpha = 1.0, beta = 1.0;
    //# transpose status of matrices
    mkl::transpose transA = mkl::transpose::nontrans;
    mkl::transpose transB = mkl::transpose::nontrans;
    //# matrix data
    std::vector<double> A = {1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0};
    std::vector<double> B = {1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0};
    std::vector<double> C = {0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0};

    //### Step 1 - Observe the definition of an asynchronous exception handler.
    //# This function object will later be supplied to the queue.
    //# It is designed to handle errors thrown while device code executes.
    auto async_handler = [](sycl::exception_list exceptions) {
        for (std::exception_ptr const &e : exceptions) {
            try {
                std::rethrow_exception(e);
            }
            catch (sycl::exception const &e) {
                std::cout << "Caught asynchronous SYCL exception: " << e.what() << std::endl;
            }
        }
    };

    //### Step 2 - Create a device object.
    //# Device selectors are used to specify the type of a device.
    //# Uncomment _one_ of the following three lines to select a device.
    // sycl::device device = sycl::device(sycl::default_selector());  //# default_selector returns a device based on a performance heuristic
    // sycl::device device = sycl::device(sycl::cpu_selector());      //# cpu_selector returns a cpu device
    sycl::device device = sycl::device(sycl::gpu_selector());      //# gpu_selector returns a gpu device
    std::cout << "Device: " << device.get_info<sycl::info::device::name>() << std::endl;

    //### Step 3 - Create a queue object.
    //# A queue accepts a single device, and optionally, an exception handler.
    //# Uncomment the following line to initialize a queue with our device and handler.
    sycl::queue queue(device, async_handler);

    //### Step 4 - Create buffers to hold our matrix data.
    //# Buffer objects can be constructed given a container
    //# Observe the creation of buffers for matrices A and B.
    //# Try and create a third buffer for matrix C called C_buffer.
    //# The solution is shown in the hidden cell below.
    sycl::buffer A_buffer(A);
    sycl::buffer B_buffer(B);
    /* define C_buffer here */
    sycl::buffer C_buffer(C);

    //### Step 5 - Execute gemm operation.
    //# Here, we need only pass in our queue and other familiar matrix multiplication parameters.
    //# This includes the dimensions and data buffers for matrices A, B, and C.
    mkl::blas::gemm(queue, transA, transB, m, n, k, alpha, A_buffer, ldA, B_buffer, ldB, beta, C_buffer, ldC);

    //# we cannot explicitly transfer memory to/from the device when using buffers
    //# that is why we must use this operation to ensure result data is returned to the host
    queue.wait_and_throw();  //# block until operation completes, throw any errors

    //### Step 6 - Observe creation of accessors to retrieve data from A_buffer and C_buffer.
    sycl::host_accessor A_acc(A_buffer, sycl::read_only);
    sycl::host_accessor C_acc(C_buffer, sycl::read_only);

    int status = 0;

    // verify C matrix using accessor to observe values held in C_buffer
    std::cout << std::endl;
    std::cout << "C = " << std::endl;
    for (int i = 0; i < m; ++i) {
        for (int j = 0; j < n; ++j) {
            if (A_acc[i*m+j] != C_acc[i*m+j]) status = 1;
            std::cout << C_acc[i*m+j] << " ";
        }
        std::cout << std::endl;
    }
    std::cout << std::endl;
        int num_blocks=128;
    int num_threads=256;
    queue.submit([&](sycl::handler& cgh)
    {
        sycl::stream out{ 4096, 128, cgh };
//         auto sharedmem = sycl::accessor<int, 1, sycl::access_mode::read_write, sycl::access::target::local>(11, cgh);
        cgh.parallel_for(
            sycl::nd_range<1>(num_blocks * num_threads, num_threads),
            [=](sycl::nd_item<1> item_ct1) [[intel::reqd_sub_group_size(32)]] {
                out<<"inter\n";
                int warpSize = item_ct1.get_sub_group().get_local_range()[0];
                item_ct1.barrier(sycl::access::fence_space::local_space);
            });
    }).wait();
    status == 0 ? std::cout << "Verified: A = C" << std::endl : std::cout << "Failed: A != C" << std::endl;
    return status;
}

Compile

dpcpp lab/dpcpp_gemm_buffers.cpp -fsycl-device-code-split=per_kernel -DMKL_ILP64 -I$MKLROOT/include -L$MKLROOT/lib/intel64 -lmkl_sycl -lmkl_intel_ilp64 -lmkl_sequential -lmkl_core -lsycl -lOpenCL -lpthread -lm -ldl -O0

Result:

Question:

There is printing in kernal, but actually there is no printed.

Subarnarek_G_Intel · ‎05-31-2022

Hi Li,

Can you explain what you meant by "There is printing in kernel, but actually there is no printed."

Regards,

Subarna

PcDack1 · ‎05-31-2022

Dear Subarna,

The correct result should be

Device:Intel Graphics [0x020a]

C=
1 2 3
4 5 6
7 8 9

inter
inter
...

verified:A=C

I don't see the "inter" being printed out.

Thx

Subarnarek_G_Intel · ‎06-01-2022

Hi Li,

It gets displayed for me.

## u29878 is compiling oneMKL_introduction Module0 -- gemm with buffers - 1 of 3 dpcpp_gemm_buffers.cpp

Device: Intel(R) UHD Graphics P630 [0x3e96]

C =

1 2 3

4 5 6

7 8 9

inter

....

PcDack1 · ‎06-01-2022

Hi Subarna,

Thx, All my problems are on the ATS-P graphics card(on devcloud). You have only tested it on the P630 graphics card.

PcDack1 · ‎06-06-2022

Hi All,

First of all, I apologize for not describing the problem precisely. In order to make the problem more clear, my team and I, after repeated testing, found that the problem can be described in a simpler way(related to sycl::stream). I created a new post to describe the problem. The current post will no longer be maintained.

Thx all.

Thanks & Regards,

Dack

Subarnarek_G_Intel · ‎06-07-2022

This issue has been resolved and we will no longer respond to this thread. If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only