Solved: Regarding dpc++ usage

siril · ‎01-05-2023

Hi,

Currently am trying to process some data (vast amount of data) and want to try offloading it to the GPU. Two questions.

1. Doesn't dpc++ support C++20 ? Also is there a way to upgrade the gcc version on the devcloud. Currently its gcc 7. Am unable to use the default execution policy (complains about not able to find execution header file). Also while trying to do an std::reduce it again fails to identify reduce in namespace std (though I have included numeric header as per https://en.cppreference.com/w/cpp/algorithm/reduce). My aim here is to accumulate a list of values (can be out of order and so using std::reduce) and possibly trying to use the dpcpp execution policy (either default or the one suited for the selector). Is this possible?

2. Most of the examples I have come across deals with two separate vector/array addition and copying the value to an output array/buffer. Is it possible to pass a streaming set of values to the selector and invoke an operation over the array and pass the single value back optimally? Appreciate if you can point to any such examples.

Thanks,

Siril.

NoorjahanSk_Intel · ‎01-06-2023

Hi,

Thanks for posting in Intel communities.

>>Doesn't dpc++ support C++20 ?

Yes, Dpc++ supports C++ 20 features. By default C++17 is enabled, you need to use the compiler option -std=c++20(Linux) or /Qstd=c++20(windows) to enable C++20 features.

Please refer to the below link for more details:

https://www.intel.com/content/www/us/en/developer/articles/technical/c20-features-supported-by-intel-cpp-compiler.html

>> Also is there a way to upgrade the gcc version on the devcloud. Currently its gcc 7

You are checking the GCC version in login-node. You can get the GCC version as 9.3 on compute nodes.

Please try using compute node using the below command

qsub -I -l nodes=1:xeon:ppn=2

You cannot upgrade the GCC version on devcloud.

>>possibly trying to use the dpcpp execution policy (either default or the one suited for the selector). Is this possible?

You can use device execution policies using oneAPI DPC++ Library

Please refer to the below link for more details:

https://www.intel.com/content/www/us/en/develop/documentation/oneapi-dpcpp-library-guide/top/parallel-api/execution-policies.html

For your second query,

Could you please elaborate more on your statement(use case)? What do you mean by passing a streaming set of values?

Thanks & Regards,

Noorjahan.

View solution in original post

NoorjahanSk_Intel · ‎01-06-2023

Hi,

Thanks for posting in Intel communities.

>>Doesn't dpc++ support C++20 ?

Yes, Dpc++ supports C++ 20 features. By default C++17 is enabled, you need to use the compiler option -std=c++20(Linux) or /Qstd=c++20(windows) to enable C++20 features.

Please refer to the below link for more details:

https://www.intel.com/content/www/us/en/developer/articles/technical/c20-features-supported-by-intel-cpp-compiler.html

>> Also is there a way to upgrade the gcc version on the devcloud. Currently its gcc 7

You are checking the GCC version in login-node. You can get the GCC version as 9.3 on compute nodes.

Please try using compute node using the below command

qsub -I -l nodes=1:xeon:ppn=2

You cannot upgrade the GCC version on devcloud.

>>possibly trying to use the dpcpp execution policy (either default or the one suited for the selector). Is this possible?

You can use device execution policies using oneAPI DPC++ Library

Please refer to the below link for more details:

https://www.intel.com/content/www/us/en/develop/documentation/oneapi-dpcpp-library-guide/top/parallel-api/execution-policies.html

For your second query,

Could you please elaborate more on your statement(use case)? What do you mean by passing a streaming set of values?

Thanks & Regards,

Noorjahan.

siril · ‎01-10-2023

Thanks a lot Noorjahan. Was able to build with C++20 (GCC 9.4 version from xeon node).

However while trying to run the std::reduce on a vector of doubles over the gpu node, it complained about type not being supported. However it worked fine when given a vector of uint_fast64_t type (unsigned long).

Regarding my second question, Can we have a shared memory (A) between the cpu and the gpus, where I can update incoming prices from the market and apply a transformation in the gpu and put the results on a different shared memory (B)? Can this happen without invoking parallel_for every single time? Automatically trigger the transformation on the GPU whenever there is a new value in the shared memory. Also have this A and B as array of memory with size as the number of gpu. So every gpu works on its own memory and outputs the value.

Basically on a CPU core it can be a function running on a thread (core binded) with an index into the array where it can check for updates and process the input value and put the output into another memory. Can the same be achieved on a GPU?

Thanks,

Siril.

NoorjahanSk_Intel · ‎01-11-2023

Hi,

Glad to know that your issue is resolved.

>>Can we have a shared memory (A) between the cpu and the gpus

Yes, We can create a shared memory between the CPU and a device using the Intel USM model.

You can use malloc_shared, so that data can be accessed on the host or device by referencing the same pointer object.

Please refer to the below link for more memory allocation details:

https://oneapi-src.github.io/DPCPP_Reference/usm.html

You can also refer to Data Parallel C++ Text Book for more details

https://link.springer.com/chapter/10.1007/978-1-4842-5574-2_6

Thanks & Regards,

Noorjahan.

siril · ‎01-16-2023

Thanks Noorjahan.

My question was basically checking if 1. double types are supported on gpus? 2. Can the GPU core be used like we use the CPU cores (set affinity for a thread to a particular core where it can run the thread function (which could be a loop checking for input, processing it and output it to a different memory location)? I guess we can't use gpu core that way. Please correct me if am wrong).

The only way to use the GPU core using DPC++ will be to get the gpu_selector and use it to execute parallel_for on a function with a set of inputs passed as an array (or directly using any dpcpp stls giving the dpcpp execution policy). Please correct me if thats not the only case.

Thanks,

Siril.

NoorjahanSk_Intel · ‎01-24-2023

Hi,

By default, the GPU OpenCL driver only exposes the double-precision floating-point extension on devices where double precision is supported.

If it is not supported then we do have double-precision floating-point emulation.

Please refer to the below link for more details:

https://github.com/intel/compute-runtime/blob/master/opencl/doc/FAQ.md#feature-double-precision-emulation-fp64

Yes, You are correct regarding your second query.

GPU cores cannot be used just as CPU cores. The only way to use the GPU core is to get the gpu_selector and use it to execute parallel_for as you have mentioned.

You can control the SYCL* or OpenMP* threads on multiple CPU cores using Environment Variables as mentioned in the below link.

https://www.intel.com/content/www/us/en/develop/documentation/oneapi-programming-guide/top/programming-interface/cpu-flow/cpu-offload-flow/control-binary-execution-on-multiple-cpu-cores.html

Thanks & Regards,

Noorjahan.

NoorjahanSk_Intel · ‎02-01-2023

Hi,

We haven't heard back from you. Could you please provide an update on your issue?

Thanks & Regards,

Noorjahan.

siril · ‎02-01-2023

Thanks a lot Noorjahan. I got all my queries answered.

Thanks,

Siril.

NoorjahanSk_Intel · ‎02-01-2023

Hi,

Thanks for the confirmation!

As this issue has been resolved, we will no longer respond to this thread. If you need any additional information, please post a new question.

Thanks & Regards,

Noorjahan.