- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The DPC++ code is very simple, just have a local array ,set the value of array be 0 and barrier mem.
#include <CL/sycl.hpp>
using namespace sycl;
#define WRAP_SIZE 32
int main(){
sycl::gpu_selector selector;
queue exec_queue(selector);
int num_blocks=128;
int num_threads=256;
int casBeg=0;
int casEnd=2;
exec_queue.submit([&](sycl::handler& cgh)
{
sycl::stream out{ 4096, 128, cgh };
auto sharedmem = sycl::accessor<int, 1, sycl::access_mode::read_write, sycl::access::target::local>(11, cgh);
cgh.parallel_for(
sycl::nd_range<1>(num_blocks * num_threads, num_threads),
[=](sycl::nd_item<1> item_ct1) [[intel::reqd_sub_group_size(WRAP_SIZE)]] {
int blkId = item_ct1.get_group(0);
int tid = item_ct1.get_local_id(0);
int stride = item_ct1.get_local_range().get(0);
out<<"inter\n";
if (tid == 0)
for (int i = 0; i < 11; ++i)
sharedmem[i] = 0;
item_ct1.barrier(sycl::access::fence_space::local_space);
});
}).wait();
return 0;
}
The build command is
dpcpp -DMKL_ILP64 -lmkl_sycl -lmkl_intel_ilp64 -lmkl_tbb_thread -lmkl_core -pthread -std=c++17 -O0 -o <project_name> <code_name>.cpp
Compiled program can work ok on the P690 GPU, but not work on the NDK intel ATS-P GPU. Why? How to Fix it?Thx
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi All,
First of all, I apologize for not describing the problem precisely. In order to make the problem more clear, my team and I, after repeated testing, found that the problem can be described in a simpler way(related to sycl::stream). I created a new post to describe the problem. The current post will no longer be maintained.
Thx all.
Thanks & Regards,
Dack
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thank you for posting in Intel Communities.
Could you please provide the DPC++ version and OS details, So that we can reproduce the issue from our end?
Use the below command to find the DPC++ version:
source /opt/intel/oneapi/setvars.sh
dpcpp --version
Thanks & Regards,
Hemanth
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I am running under NDA devcloud. I do not know the os detail. The result of uname -a is
Linux s013-n002 5.10.54+prerelease2606 #1 SMP Fri Jan 7 14:21:53 PST 2022 x86_64 x86_64 x86_64 GNU/Linux
The result of /proc/version is
Linux version 5.10.54+prerelease2606 (ubit@fm6pudocker153) (gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0, GNU ld (GNU Binutils for Ubuntu) 2.34) #1 SMP Fri Jan 7 14:21:53 PST 2022
The result of dpcpp --version.
Intel(R) oneAPI DPC++/C++ Compiler 2022.0.0 (2022.0.0.20211123)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /nda/development-tools/versions/oneapi/2022.1.0.nda/oneapi/compiler/2022.0.1-prerelease/linux/bin-llvm
Thx
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
We are working on this internally and will get back to you soon.
Thanks & Regards,
Hemanth
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I don't see any MKL calls in the code. Please help me understand why you have used MKL libraries in the build command? It works perfect without the MKL flags.
Regards,
Subarna
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Subarna,
I have a project use some mkl functions(e.g. gemm and LU functions).But the project can not run correct in ATS-P card. So, I debug it and found the problem when add mkl flags. To simply describe the problem, I abstracted the simplest code. And, it is very confusing that if you add the MKL flags, the program will not run correctly.
Best Regards
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Li,
This is not a generic issue. Refer to this code for example https://github.com/oneapi-src/oneAPI-samples/blob/master/Libraries/oneMKL/matrix_mul_mkl. It works perfectly fine on ATS. What is the error that you got? Can you refer to the make file to start the build with.
Regards,
Subarna
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Subarna,
Thank you for your answer. First of all, based on your hint, I ran the usm example of oneMKL gemm (Devcloud path is ~/Intel_oneAPI_MKL_Training/00_GEMM/.Script is run_gemm_usm.sh). it doesn't work on ATS-P(NDK devcloud). The error message is CL_INVALID_DEVICE. Second, I added my code (paste below) below the sample code you gave. It still doesn't work on ATS-P with a compilation optimization level of O0.Usually, we need to use O0 optimization to debug the code.
My code snippet is:
queue.submit([&](sycl::handler& cgh)
{
sycl::stream out{ 4096, 128, cgh };
cgh.parallel_for(
sycl::nd_range<1>(num_blocks * num_threads, num_threads),
[=](sycl::nd_item<1> item_ct1) [[intel::reqd_sub_group_size(32)]] {
int stride = item_ct1.get_local_range().get(0);
item_ct1.barrier(sycl::access::fence_space::local_space);
});
}).wait();
Regards,
Dack
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Li,
https://jupyter.oneapi.devcloud.intel.com/user/u29878/lab/tree/Intel_oneAPI_MKL_Training/00_GEMM/02_GEMM_DPCPP_USM.ipynb works fine for me. I think you didn't follow the instructions of uncommenting certain lines of code.
"Uncomment _one_ of the following three lines to select a device." - look for this line.
Regards,
Subarna
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Subarna,
Thank you for your answer, the first question was an oversight on my part. Regarding the second question, how is it explained?
Regards,
Dack
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Li,
I am getting a bit confused now? What are you trying to achieve from the thread? Is it an issue with MKL flags not getting used properly or is it that the code snippet with barrier functions that you put inside the MKL sample doesn't work? If that is the issue then I need to understand properly where have you put this code snippet in that code?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Subarna,
Code
#include <iostream>
#include <vector>
#include <CL/sycl.hpp> //# sycl namespace
#include "oneapi/mkl/blas.hpp" //# oneMKL DPC++ interface for BLAS functions
//# The following project performs matrix multiplication using oneMKL / DPC++ with buffers.
//# We will execute the simple operation A * B = C
//# The matrix B is set equal to the identity matrix such that A * B = A * I
//# After performing the computation, we will verify A * I = C -> A = C
namespace mkl = oneapi::mkl; //# shorten mkl namespace
int main() {
//# dimensions
int m = 3, n = 3, k = 3;
//# leading dimensions
int ldA = 3, ldB = 3, ldC = 3;
//# scalar multipliers
double alpha = 1.0, beta = 1.0;
//# transpose status of matrices
mkl::transpose transA = mkl::transpose::nontrans;
mkl::transpose transB = mkl::transpose::nontrans;
//# matrix data
std::vector<double> A = {1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0};
std::vector<double> B = {1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0};
std::vector<double> C = {0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0};
//### Step 1 - Observe the definition of an asynchronous exception handler.
//# This function object will later be supplied to the queue.
//# It is designed to handle errors thrown while device code executes.
auto async_handler = [](sycl::exception_list exceptions) {
for (std::exception_ptr const &e : exceptions) {
try {
std::rethrow_exception(e);
}
catch (sycl::exception const &e) {
std::cout << "Caught asynchronous SYCL exception: " << e.what() << std::endl;
}
}
};
//### Step 2 - Create a device object.
//# Device selectors are used to specify the type of a device.
//# Uncomment _one_ of the following three lines to select a device.
// sycl::device device = sycl::device(sycl::default_selector()); //# default_selector returns a device based on a performance heuristic
// sycl::device device = sycl::device(sycl::cpu_selector()); //# cpu_selector returns a cpu device
sycl::device device = sycl::device(sycl::gpu_selector()); //# gpu_selector returns a gpu device
std::cout << "Device: " << device.get_info<sycl::info::device::name>() << std::endl;
//### Step 3 - Create a queue object.
//# A queue accepts a single device, and optionally, an exception handler.
//# Uncomment the following line to initialize a queue with our device and handler.
sycl::queue queue(device, async_handler);
//### Step 4 - Create buffers to hold our matrix data.
//# Buffer objects can be constructed given a container
//# Observe the creation of buffers for matrices A and B.
//# Try and create a third buffer for matrix C called C_buffer.
//# The solution is shown in the hidden cell below.
sycl::buffer A_buffer(A);
sycl::buffer B_buffer(B);
/* define C_buffer here */
sycl::buffer C_buffer(C);
//### Step 5 - Execute gemm operation.
//# Here, we need only pass in our queue and other familiar matrix multiplication parameters.
//# This includes the dimensions and data buffers for matrices A, B, and C.
mkl::blas::gemm(queue, transA, transB, m, n, k, alpha, A_buffer, ldA, B_buffer, ldB, beta, C_buffer, ldC);
//# we cannot explicitly transfer memory to/from the device when using buffers
//# that is why we must use this operation to ensure result data is returned to the host
queue.wait_and_throw(); //# block until operation completes, throw any errors
//### Step 6 - Observe creation of accessors to retrieve data from A_buffer and C_buffer.
sycl::host_accessor A_acc(A_buffer, sycl::read_only);
sycl::host_accessor C_acc(C_buffer, sycl::read_only);
int status = 0;
// verify C matrix using accessor to observe values held in C_buffer
std::cout << std::endl;
std::cout << "C = " << std::endl;
for (int i = 0; i < m; ++i) {
for (int j = 0; j < n; ++j) {
if (A_acc[i*m+j] != C_acc[i*m+j]) status = 1;
std::cout << C_acc[i*m+j] << " ";
}
std::cout << std::endl;
}
std::cout << std::endl;
int num_blocks=128;
int num_threads=256;
queue.submit([&](sycl::handler& cgh)
{
sycl::stream out{ 4096, 128, cgh };
// auto sharedmem = sycl::accessor<int, 1, sycl::access_mode::read_write, sycl::access::target::local>(11, cgh);
cgh.parallel_for(
sycl::nd_range<1>(num_blocks * num_threads, num_threads),
[=](sycl::nd_item<1> item_ct1) [[intel::reqd_sub_group_size(32)]] {
out<<"inter\n";
int warpSize = item_ct1.get_sub_group().get_local_range()[0];
item_ct1.barrier(sycl::access::fence_space::local_space);
});
}).wait();
status == 0 ? std::cout << "Verified: A = C" << std::endl : std::cout << "Failed: A != C" << std::endl;
return status;
}
Compile
dpcpp lab/dpcpp_gemm_buffers.cpp -fsycl-device-code-split=per_kernel -DMKL_ILP64 -I$MKLROOT/include -L$MKLROOT/lib/intel64 -lmkl_sycl -lmkl_intel_ilp64 -lmkl_sequential -lmkl_core -lsycl -lOpenCL -lpthread -lm -ldl -O0
Result:
Question:
There is printing in kernal, but actually there is no printed.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Li,
Can you explain what you meant by "There is printing in kernel, but actually there is no printed."
Regards,
Subarna
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear Subarna,
The correct result should be
Device:Intel Graphics [0x020a]
C=
1 2 3
4 5 6
7 8 9
inter
inter
...
verified:A=C
I don't see the "inter" being printed out.
Thx
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Li,
It gets displayed for me.
## u29878 is compiling oneMKL_introduction Module0 -- gemm with buffers - 1 of 3 dpcpp_gemm_buffers.cpp
Device: Intel(R) UHD Graphics P630 [0x3e96]
C =
1 2 3
4 5 6
7 8 9
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
inter
....
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Subarna,
Thx, All my problems are on the ATS-P graphics card(on devcloud). You have only tested it on the P630 graphics card.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi All,
First of all, I apologize for not describing the problem precisely. In order to make the problem more clear, my team and I, after repeated testing, found that the problem can be described in a simpler way(related to sycl::stream). I created a new post to describe the problem. The current post will no longer be maintained.
Thx all.
Thanks & Regards,
Dack
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
This issue has been resolved and we will no longer respond to this thread. If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page