topic Re: Re:confused over the different parallel for in Intel® oneAPI DPC++/C++ Compiler

confused over the different parallel for

NewtonDynamics — Tue, 24 Jan 2023 19:48:22 GMT

hi, I am trying to write code that I previously wrote for cuda,

but so far, I still find the learning curve is a hard as it was two years ago.

This code is a straight counting sort that works perfectly on VS cpp and cuda

here is a very base function that, I can't get to work as expected.

template <class T, class ndEvaluateKey, int exponentRadix> void SyclCountItems(sycl::queue& queue, buffer<T>& src, buffer<int>& scansBuffer) { ndAssert((1 << exponentRadix) <= D_COUNTING_SORT_LOCAL_BLOCK_SIZE); queue.submit([&](sycl::handler& handler) { //ndEvaluateKey evaluator; int arraySize = src.size(); int workGroupSize = 1 << exponentRadix; int workGroupCount = (arraySize + workGroupSize - 1) / workGroupSize; sycl::accessor<T, 1> srcAccessor(src, handler); sycl::accessor<int, 1> scanAccessor(scansBuffer, handler); sycl::local_accessor<int, 1> counters(D_COUNTING_SORT_LOCAL_BLOCK_SIZE, handler); sycl::stream out(1024, 256, handler); handler.parallel_for(nd_range<1>{ {workGroupCount}, {workGroupSize}}, [=](nd_item<1> item) { id<1> localId = item.get_local_id(); counters[localId] = localId; item.barrier(); id<1> flatId = item.get_global_id(); scanAccessor[flatId] = counters[localId]; id<1> globalId = item.get_global_id(); out << "flatId: " << flatId << "group:" << item.get_global_id() << " local : " << localId << sycl::endl; }); #endif }); }

the log produces this:

flatId: {0}group:{0} local : {0}
flatId: {1}group:{1} local : {1}
flatId: {2}group:{2} local : {2}
flatId: {3}group:{3} local : {3}

but I expect something like,

flatId: {0}group:{0} local : {0}

flatId: {1}group:{0} local : {1}

flatId: {9}group:{2} local : {2}

flatId: {8}group:{1} local : {1}

flatId: {3}group:{0} local : {3}

flatId: {8}group:{2} local : {6}

flatId: {9}group:{1} local : {0}

flatId: {7}group:{0} local : {7}

flatId: {10}group:{1} local : {3}

....

and if I read that scansBuffer, I should get an array like

0, 1, 2, 3, 4, .... , workgrupoSize

0, 1, 2, 3, 4, .... , workgrupoSize

...

but instead, I get this:

- m_cpuBuffer2 { size=65536 } StlVector<int> [capacity] 65536 unsigned __int64 + [allocator] {...} std::_Compressed_pair<StlAllocator<int>,std::_Vector_val<std::_Simple_types<int>>,1> [0] 0 int [1] 0 int [2] 0 int [3] 0 int [4] 0 int [5] 0 int [6] 0 int [7] 0 int [8] 0 int [9] 1 int [10] 2 int [11] 3 int [12] 0 int [13] 0 int [14] 0 int [15] 0 int [16] 0 int [17] 1 int [18] 2 int [19] 3 int [20] 0 int [21] 0 int [22] 0 int [23] 0 int [24] 0 int

not even in the ballpark and I do not see any patern

It is almost like the paraller_for only schedule on work item per work group.

and somehow a nested loop is necessary, by is some I do not see any example on the

book Data Parallel C++, which in the was written the same Intel engineer how design this.

again this code work perfectly in CUDA.

Re: confused over the different parallel for

NewtonDynamics — Tue, 24 Jan 2023 20:24:45 GMT

one way I can get this above code working is if I write like this.

template <class T, class ndEvaluateKey, int exponentRadix> void SyclCountItems(sycl::queue& queue, sycl::buffer<T>& src, sycl::buffer<int>& scansBuffer) { ndAssert((1 << exponentRadix) <= D_COUNTING_SORT_LOCAL_BLOCK_SIZE); queue.submit([&](sycl::handler& handler) { ndEvaluateKey evaluator; int arraySize = src.size(); int workGroupSize = 1 << exponentRadix; int workGroupCount = (arraySize + workGroupSize - 1) / workGroupSize; sycl::range<1> workGroupSizeRange(workGroupSize); sycl::range<1> workGroupCountRange(workGroupCount); sycl::accessor srcAccessor(src, handler); sycl::accessor scanAccessor(scansBuffer, handler); sycl::local_accessor<int, 1> counters(D_COUNTING_SORT_LOCAL_BLOCK_SIZE, handler); handler.parallel_for_work_group(workGroupCountRange, workGroupSizeRange, [=](sycl::group<1> group) { sycl::id<1> groupId = group.get_group_id(); int base = groupId * workGroupSize; group.parallel_for_work_item([&](sycl::h_item<1> item) { sycl::id<1> localId = item.get_local_id(); counters[localId] = 0; }); group.parallel_for_work_item([&](sycl::h_item<1> item) { sycl::id<1> localId = item.get_local_id(); int srcIndex = base + localId; int scanIndex = evaluator.GetCount(srcAccessor[srcIndex]); sycl::atomic_ref<int, sycl::memory_order::relaxed, sycl::memory_scope::work_item> atomicIndex(counters[scanIndex]); atomicIndex++; }); group.parallel_for_work_item([&](sycl::h_item<1> item) { sycl::id<1> localId = item.get_local_id(); scanAccessor[base + localId] = counters[localId]; }); }); }); }

which is rather very elegant, but the part that I am dubious about is, the code is scheduling separate kernels.

for example, what happens at the end of a

group.parallel_for_work_item([&](sycl::h_item<1> item)

the code does not seem to allow using a barriers.

Re: confused over the different parallel for

NewtonDynamics — Wed, 25 Jan 2023 00:12:20 GMT

here is some information related about this problem with local shared memory and registers.

imagine you get a sequence, of number and you want to get the prefix sum scan

ex:

3, 4, 1, 0, 0, 0, 0, 0

for simplicity the count is a power of two.

the expected result should be.

0, 3, 7, 8, 8, 8, 8, 8

using Cuda, this code accomplice that:

template <typename BufferItem, typename SortKeyPredicate> __global__ void ndCudaCountingSortCountShuffleItemsInternal(const BufferItem* src, BufferItem* dst, unsigned* histogram, unsigned size, SortKeyPredicate GetSortKey, unsigned prefixKeySize) { __shared__ unsigned cacheItemCount[D_COUNTING_SORT_BLOCK_SIZE / 2 + D_COUNTING_SORT_BLOCK_SIZE + 1]; const unsigned blockId = blockIdx.x; const unsigned threadId = threadIdx.x; for (int i = 1; i < prefixKeySize; i = i << 1) { const unsigned countSum = cacheItemCount[prefixBase + threadId] + cacheItemCount[prefixBase - i + threadId]; __syncthreads(); cacheItemCount[prefixBase + threadId] = countSum; __syncthreads(); } }

the intermediate result will be offset by the word_group / 2, therefore the result is.

0, 0, 0, 0, 0, 3, 7, 8, 8, 8, 8, 8

I am trying to write sycl equivalent, and this is what I get to running,

template <class T, class ndEvaluateKey, int exponentRadix> void SyclMergeBuckects(sycl::queue& queue, sycl::buffer<T>& src, sycl::buffer<T>& dst, sycl::buffer<int>& scansBuffer) { queue.submit([&](sycl::handler& handler) { ndEvaluateKey evaluator; int arraySize = src.size(); int workGroupSize = 1 << exponentRadix; int workGroupCount = (arraySize + workGroupSize - 1) / workGroupSize; sycl::range<1> workGroupSizeRange(workGroupSize); sycl::range<1> workGroupCountRange(workGroupCount); sycl::local_accessor<int, 1> cacheItemCount(D_COUNTING_SORT_LOCAL_BLOCK_SIZE / 2 + D_COUNTING_SORT_LOCAL_BLOCK_SIZE + 1, handler); //sycl::stream out(1024, 256, handler); handler.parallel_for_work_group(workGroupCountRange, workGroupSizeRange, [=](sycl::group<1> group) { for (int i = 1; i < workGroupSize; i = i << 1) { group.parallel_for_work_item([&](sycl::h_item<1> item) { int localId = item.get_local_id(); int countSum = cacheItemCount[prefixBase + localId] + cacheItemCount[prefixBase - i + localId]; group.mem_fence(); cacheItemCount[prefixBase + localId] = countSum; group.mem_fence(); }); } }); }); }

but that produces this sequence:

cacheItemCount {...} const {0, 0, 0, 0, 0, 0, 3, 7, 11, 15, 22, 30, 0} int[13]

so somehow, there is a race condition. and the values in register count is invalid.

I did not see any function other than group.mem_fence to make sure the date is all read before the local buffer is overwritten.

Thanks

Re:confused over the different parallel for

SeshaP_Intel — Wed, 25 Jan 2023 12:05:16 GMT

Hi,

Thank you for posting in Intel Communities.

Could you please provide both (DPC++ and CUDA) complete reproducer codes so that we can investigate this issue more from our end?

Thanks and Regards,

Pendyala Sesha Srinivas

Re: Re:confused over the different parallel for

NewtonDynamics — Thu, 26 Jan 2023 18:50:44 GMT

sure. Here is a very simplified version of the problem with running code examples.

here is a sycl hello world program

template <class T, class ndEvaluateKey, int exponentRadix> void SyclMergeBuckects(sycl::queue& queue, sycl::buffer<T>& src, sycl::buffer<T>& dst, sycl::buffer<int>& scansBuffer) { queue.submit([&](sycl::handler& handler) { sycl::accessor dstAccessor(dst, handler); sycl::stream out(4096, 256, handler); sycl::range<1> range(8); handler.parallel_for(range, [=](sycl::id<1> id) { out << id << sycl::endl; }); }); }

the function above generates this sequence.

{1}
{5}
{0}
{6}
{4}
{2}
{7}
{3}

the problem is that the simplistic kernel does not allow for data synchronization using work_group of some application defined size

so the same function si now written using the nd_range instead.

template <class T, class ndEvaluateKey, int exponentRadix> void SyclMergeBuckects(sycl::queue& queue, sycl::buffer<T>& src, sycl::buffer<T>& dst, sycl::buffer<int>& scansBuffer) { queue.submit([&](sycl::handler& handler) { sycl::accessor dstAccessor(dst, handler); sycl::stream out(4096, 256, handler); sycl::range<1> localSize(4); sycl::range<1> globalSize(2); handler.parallel_for(sycl::nd_range<1>(globalSize, localSize), [=](sycl::nd_item<1> item) { sycl::id<1> id_x = item.get_local_id(0); sycl::id<1> id_y = item.get_global_id(0); size_t width = item.get_local_range(0); size_t index = id_y * width + id_x; out << index << sycl::endl; }); }); }

I expect the above kernel to generate a very similar sequance, but instead produces this

0
3

it seems it schedule two work groups, which is fine, but each group only has one id, when it should have 4.

below is the cuda equivalent.

__global__ void ndCudaMerge() { const unsigned threadId = threadIdx.x; const unsigned index = threadId + blockDim.x * blockIdx.x; printf("cuda %d\n", index); } //this is called with this ndCudaMerge << <2, 4 >> > (); cudaDeviceSynchronize();

the cuda kernel generates this sequence

cuda 0
cuda 1
cuda 2
cuda 3
cuda 4
cuda 5
cuda 6
cuda 7

so the questin is how do I read the work_group items, from each group?

Re: confused over the different parallel for

NewtonDynamics — Thu, 26 Jan 2023 20:20:42 GMT

sure, here are the samples.

this is s hello word kind of kernel that I need to convert to and nd_range kerenel.

template <class T, class ndEvaluateKey, int exponentRadix> void SyclMergeBuckects(sycl::queue& queue, sycl::buffer<T>& src, sycl::buffer<T>& dst, sycl::buffer<int>& scansBuffer) { queue.submit([&](sycl::handler& handler) { sycl::accessor dstAccessor(dst, handler); sycl::stream out(4096, 256, handler); sycl::range<1> range(8); handler.parallel_for(range, [=](sycl::id<1> id) { out << id << sycl::endl; }); }); }

that kernel produces the fallowing sequence

{1}
{5}
{0}
{6}
{4}
{2}
{7}
{3}

the nd_range version is this:

template <class T, class ndEvaluateKey, int exponentRadix> void SyclMergeBuckects(sycl::queue& queue, sycl::buffer<T>& src, sycl::buffer<T>& dst, sycl::buffer<int>& scansBuffer) { queue.submit([&](sycl::handler& handler) { sycl::accessor dstAccessor(dst, handler); sycl::stream out(4096, 256, handler); sycl::range<1> localSize(4); sycl::range<1> globalSize(2); handler.parallel_for(sycl::nd_range<1>(globalSize, localSize), [=](sycl::nd_item<1> item) { sycl::id<1> id_x = item.get_local_id(0); sycl::id<1> id_y = item.get_global_id(0); size_t width = item.get_local_range(0); size_t index = id_y * width + id_x; out << index << sycl::endl; }); }); }

but that kernel produces this sequence.

0
3

here is the working Cuda equivalent.

which generates:

cuda 0
cuda 1
cuda 2
cuda 3
cuda 4
cuda 5
cuda 6
cuda 7

so my question is how to I write a proper nd_range kernel that allows me to iterate by the specified work_group size.

I tried almost all possible combinations and they all fail.

what am I missing?

Re: Re:confused over the different parallel for

NewtonDynamics — Thu, 26 Jan 2023 21:31:34 GMT

well, I am not sure why the code same does show up, I posted it twice already.

Re: Re:confused over the different parallel for

NewtonDynamics — Thu, 26 Jan 2023 22:56:57 GMT

Alright since the code sniped does shows on the post, I placed in a file and attached it to this post.

I have spent over two weeks try to resolve this problem, I have been programing professionally for almost 35 years,

I bought the book on Parallel programming, I read the Chronos spec on SYCL, I browsed many of the popular sight and

notone seems to have this problem, so it most I who is doing something wrong, but for the light of me, I can't determine what.

The test sample I send are copied almost verbatim from snipe of code from Intel own documentation.

Intel docs

any help with this would be greatly appreciated.

Julio Jerez

Thanks.

Re: confused over the different parallel for

SeshaP_Intel — Wed, 01 Feb 2023 10:44:13 GMT

Hi,

Please find the below DPC++ code using nd_range kernel for the corresponding CUDA code.

#include <CL/sycl.hpp> using namespace sycl; void ndCudaMerge(nd_item<3> item1, const stream &out) { const unsigned threadId = item1.get_local_id(2); const unsigned index = threadId + item1.get_local_range(2) * item1.get_group(2); out << "DPCPP "<< index <<"\n"; } int main() { queue q; q.submit([&](handler &cgh) { stream out(64 * 1024, 80, cgh); cgh.parallel_for(nd_range<3>(range<3>(1, 1, 2) * range<3>(1, 1, 4),range<3>(1, 1, 4)), [=](nd_item<3> item1) { ndCudaMerge(item1, out); }); }).wait(); return 0; }

Thanks and Regards,

Pendyala Sesha Srinivas

Re: confused over the different parallel for

NewtonDynamics — Wed, 01 Feb 2023 22:06:22 GMT

awesome !!

that worked very nice.

thank you very much.

Julio

Re:confused over the different parallel for

SeshaP_Intel — Thu, 02 Feb 2023 07:13:27 GMT

Hi,

Thanks for accepting the solution. If you need any additional information, please post a new question as this thread will no longer be monitored by Intel.

Thanks and Regards,

Pendyala Sesha Srinivas