- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
hi, I am trying to write code that I previously wrote for cuda,
but so far, I still find the learning curve is a hard as it was two years ago.
This code is a straight counting sort that works perfectly on VS cpp and cuda
here is a very base function that, I can't get to work as expected.
template <class T, class ndEvaluateKey, int exponentRadix>
void SyclCountItems(sycl::queue& queue, buffer<T>& src, buffer<int>& scansBuffer)
{
ndAssert((1 << exponentRadix) <= D_COUNTING_SORT_LOCAL_BLOCK_SIZE);
queue.submit([&](sycl::handler& handler)
{
//ndEvaluateKey evaluator;
int arraySize = src.size();
int workGroupSize = 1 << exponentRadix;
int workGroupCount = (arraySize + workGroupSize - 1) / workGroupSize;
sycl::accessor<T, 1> srcAccessor(src, handler);
sycl::accessor<int, 1> scanAccessor(scansBuffer, handler);
sycl::local_accessor<int, 1> counters(D_COUNTING_SORT_LOCAL_BLOCK_SIZE, handler);
sycl::stream out(1024, 256, handler);
handler.parallel_for(nd_range<1>{ {workGroupCount}, {workGroupSize}}, [=](nd_item<1> item)
{
id<1> localId = item.get_local_id();
counters[localId] = localId;
item.barrier();
id<1> flatId = item.get_global_id();
scanAccessor[flatId] = counters[localId];
id<1> globalId = item.get_global_id();
out << "flatId: " << flatId << "group:" << item.get_global_id() << " local : " << localId << sycl::endl;
});
#endif
});
}
the log produces this:
flatId: {0}group:{0} local : {0}
flatId: {1}group:{1} local : {1}
flatId: {2}group:{2} local : {2}
flatId: {3}group:{3} local : {3}
but I expect something like,
flatId: {0}group:{0} local : {0}
flatId: {1}group:{0} local : {1}
flatId: {9}group:{2} local : {2}
flatId: {8}group:{1} local : {1}
flatId: {3}group:{0} local : {3}
flatId: {8}group:{2} local : {6}
flatId: {9}group:{1} local : {0}
flatId: {7}group:{0} local : {7}
flatId: {10}group:{1} local : {3}
....
and if I read that scansBuffer, I should get an array like
0, 1, 2, 3, 4, .... , workgrupoSize
0, 1, 2, 3, 4, .... , workgrupoSize
0, 1, 2, 3, 4, .... , workgrupoSize
...
but instead, I get this:
- m_cpuBuffer2 { size=65536 } StlVector<int>
[capacity] 65536 unsigned __int64
+ [allocator] {...} std::_Compressed_pair<StlAllocator<int>,std::_Vector_val<std::_Simple_types<int>>,1>
[0] 0 int
[1] 0 int
[2] 0 int
[3] 0 int
[4] 0 int
[5] 0 int
[6] 0 int
[7] 0 int
[8] 0 int
[9] 1 int
[10] 2 int
[11] 3 int
[12] 0 int
[13] 0 int
[14] 0 int
[15] 0 int
[16] 0 int
[17] 1 int
[18] 2 int
[19] 3 int
[20] 0 int
[21] 0 int
[22] 0 int
[23] 0 int
[24] 0 int
not even in the ballpark and I do not see any patern
It is almost like the paraller_for only schedule on work item per work group.
and somehow a nested loop is necessary, by is some I do not see any example on the
book Data Parallel C++, which in the was written the same Intel engineer how design this.
again this code work perfectly in CUDA.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
one way I can get this above code working is if I write like this.
template <class T, class ndEvaluateKey, int exponentRadix>
void SyclCountItems(sycl::queue& queue, sycl::buffer<T>& src, sycl::buffer<int>& scansBuffer)
{
ndAssert((1 << exponentRadix) <= D_COUNTING_SORT_LOCAL_BLOCK_SIZE);
queue.submit([&](sycl::handler& handler)
{
ndEvaluateKey evaluator;
int arraySize = src.size();
int workGroupSize = 1 << exponentRadix;
int workGroupCount = (arraySize + workGroupSize - 1) / workGroupSize;
sycl::range<1> workGroupSizeRange(workGroupSize);
sycl::range<1> workGroupCountRange(workGroupCount);
sycl::accessor srcAccessor(src, handler);
sycl::accessor scanAccessor(scansBuffer, handler);
sycl::local_accessor<int, 1> counters(D_COUNTING_SORT_LOCAL_BLOCK_SIZE, handler);
handler.parallel_for_work_group(workGroupCountRange, workGroupSizeRange, [=](sycl::group<1> group)
{
sycl::id<1> groupId = group.get_group_id();
int base = groupId * workGroupSize;
group.parallel_for_work_item([&](sycl::h_item<1> item)
{
sycl::id<1> localId = item.get_local_id();
counters[localId] = 0;
});
group.parallel_for_work_item([&](sycl::h_item<1> item)
{
sycl::id<1> localId = item.get_local_id();
int srcIndex = base + localId;
int scanIndex = evaluator.GetCount(srcAccessor[srcIndex]);
sycl::atomic_ref<int, sycl::memory_order::relaxed, sycl::memory_scope::work_item> atomicIndex(counters[scanIndex]);
atomicIndex++;
});
group.parallel_for_work_item([&](sycl::h_item<1> item)
{
sycl::id<1> localId = item.get_local_id();
scanAccessor[base + localId] = counters[localId];
});
});
});
}
which is rather very elegant, but the part that I am dubious about is, the code is scheduling separate kernels.
for example, what happens at the end of a
group.parallel_for_work_item([&](sycl::h_item<1> item)
the code does not seem to allow using a barriers.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
here is some information related about this problem with local shared memory and registers.
imagine you get a sequence, of number and you want to get the prefix sum scan
ex:
3, 4, 1, 0, 0, 0, 0, 0
for simplicity the count is a power of two.
the expected result should be.
0, 3, 7, 8, 8, 8, 8, 8
using Cuda, this code accomplice that:
template <typename BufferItem, typename SortKeyPredicate>
__global__ void ndCudaCountingSortCountShuffleItemsInternal(const BufferItem* src, BufferItem* dst, unsigned* histogram, unsigned size, SortKeyPredicate GetSortKey, unsigned prefixKeySize)
{
__shared__ unsigned cacheItemCount[D_COUNTING_SORT_BLOCK_SIZE / 2 + D_COUNTING_SORT_BLOCK_SIZE + 1];
const unsigned blockId = blockIdx.x;
const unsigned threadId = threadIdx.x;
for (int i = 1; i < prefixKeySize; i = i << 1)
{
const unsigned countSum = cacheItemCount[prefixBase + threadId] + cacheItemCount[prefixBase - i + threadId];
__syncthreads();
cacheItemCount[prefixBase + threadId] = countSum;
__syncthreads();
}
}
the intermediate result will be offset by the word_group / 2, therefore the result is.
0, 0, 0, 0, 0, 3, 7, 8, 8, 8, 8, 8
I am trying to write sycl equivalent, and this is what I get to running,
template <class T, class ndEvaluateKey, int exponentRadix>
void SyclMergeBuckects(sycl::queue& queue, sycl::buffer<T>& src, sycl::buffer<T>& dst, sycl::buffer<int>& scansBuffer)
{
queue.submit([&](sycl::handler& handler)
{
ndEvaluateKey evaluator;
int arraySize = src.size();
int workGroupSize = 1 << exponentRadix;
int workGroupCount = (arraySize + workGroupSize - 1) / workGroupSize;
sycl::range<1> workGroupSizeRange(workGroupSize);
sycl::range<1> workGroupCountRange(workGroupCount);
sycl::local_accessor<int, 1> cacheItemCount(D_COUNTING_SORT_LOCAL_BLOCK_SIZE / 2 + D_COUNTING_SORT_LOCAL_BLOCK_SIZE + 1, handler);
//sycl::stream out(1024, 256, handler);
handler.parallel_for_work_group(workGroupCountRange, workGroupSizeRange, [=](sycl::group<1> group)
{
for (int i = 1; i < workGroupSize; i = i << 1)
{
group.parallel_for_work_item([&](sycl::h_item<1> item)
{
int localId = item.get_local_id();
int countSum = cacheItemCount[prefixBase + localId] + cacheItemCount[prefixBase - i + localId];
group.mem_fence();
cacheItemCount[prefixBase + localId] = countSum;
group.mem_fence();
});
}
});
});
}
but that produces this sequence:
cacheItemCount {...} const
{0, 0, 0, 0, 0, 0, 3, 7, 11, 15, 22, 30, 0} int[13]
so somehow, there is a race condition. and the values in register count is invalid.
I did not see any function other than group.mem_fence to make sure the date is all read before the local buffer is overwritten.
Thanks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thank you for posting in Intel Communities.
Could you please provide both (DPC++ and CUDA) complete reproducer codes so that we can investigate this issue more from our end?
Thanks and Regards,
Pendyala Sesha Srinivas
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
sure. Here is a very simplified version of the problem with running code examples.
here is a sycl hello world program
template <class T, class ndEvaluateKey, int exponentRadix>
void SyclMergeBuckects(sycl::queue& queue, sycl::buffer<T>& src, sycl::buffer<T>& dst, sycl::buffer<int>& scansBuffer)
{
queue.submit([&](sycl::handler& handler)
{
sycl::accessor dstAccessor(dst, handler);
sycl::stream out(4096, 256, handler);
sycl::range<1> range(8);
handler.parallel_for(range, [=](sycl::id<1> id)
{
out << id << sycl::endl;
});
});
}
the function above generates this sequence.
{1}
{5}
{0}
{6}
{4}
{2}
{7}
{3}
the problem is that the simplistic kernel does not allow for data synchronization using work_group of some application defined size
so the same function si now written using the nd_range instead.
template <class T, class ndEvaluateKey, int exponentRadix>
void SyclMergeBuckects(sycl::queue& queue, sycl::buffer<T>& src, sycl::buffer<T>& dst, sycl::buffer<int>& scansBuffer)
{
queue.submit([&](sycl::handler& handler)
{
sycl::accessor dstAccessor(dst, handler);
sycl::stream out(4096, 256, handler);
sycl::range<1> localSize(4);
sycl::range<1> globalSize(2);
handler.parallel_for(sycl::nd_range<1>(globalSize, localSize), [=](sycl::nd_item<1> item)
{
sycl::id<1> id_x = item.get_local_id(0);
sycl::id<1> id_y = item.get_global_id(0);
size_t width = item.get_local_range(0);
size_t index = id_y * width + id_x;
out << index << sycl::endl;
});
});
}
I expect the above kernel to generate a very similar sequance, but instead produces this
0
3
it seems it schedule two work groups, which is fine, but each group only has one id, when it should have 4.
below is the cuda equivalent.
__global__ void ndCudaMerge()
{
const unsigned threadId = threadIdx.x;
const unsigned index = threadId + blockDim.x * blockIdx.x;
printf("cuda %d\n", index);
}
//this is called with this
ndCudaMerge << <2, 4 >> > ();
cudaDeviceSynchronize();
the cuda kernel generates this sequence
cuda 0
cuda 1
cuda 2
cuda 3
cuda 4
cuda 5
cuda 6
cuda 7
so the questin is how do I read the work_group items, from each group?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
well, I am not sure why the code same does show up, I posted it twice already.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Alright since the code sniped does shows on the post, I placed in a file and attached it to this post.
I have spent over two weeks try to resolve this problem, I have been programing professionally for almost 35 years,
I bought the book on Parallel programming, I read the Chronos spec on SYCL, I browsed many of the popular sight and
notone seems to have this problem, so it most I who is doing something wrong, but for the light of me, I can't determine what.
The test sample I send are copied almost verbatim from snipe of code from Intel own documentation.
any help with this would be greatly appreciated.
Julio Jerez
Thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
sure, here are the samples.
this is s hello word kind of kernel that I need to convert to and nd_range kerenel.
template <class T, class ndEvaluateKey, int exponentRadix>
void SyclMergeBuckects(sycl::queue& queue, sycl::buffer<T>& src, sycl::buffer<T>& dst, sycl::buffer<int>& scansBuffer)
{
queue.submit([&](sycl::handler& handler)
{
sycl::accessor dstAccessor(dst, handler);
sycl::stream out(4096, 256, handler);
sycl::range<1> range(8);
handler.parallel_for(range, [=](sycl::id<1> id)
{
out << id << sycl::endl;
});
});
}
that kernel produces the fallowing sequence
{1}
{5}
{0}
{6}
{4}
{2}
{7}
{3}
the nd_range version is this:
template <class T, class ndEvaluateKey, int exponentRadix>
void SyclMergeBuckects(sycl::queue& queue, sycl::buffer<T>& src, sycl::buffer<T>& dst, sycl::buffer<int>& scansBuffer)
{
queue.submit([&](sycl::handler& handler)
{
sycl::accessor dstAccessor(dst, handler);
sycl::stream out(4096, 256, handler);
sycl::range<1> localSize(4);
sycl::range<1> globalSize(2);
handler.parallel_for(sycl::nd_range<1>(globalSize, localSize), [=](sycl::nd_item<1> item)
{
sycl::id<1> id_x = item.get_local_id(0);
sycl::id<1> id_y = item.get_global_id(0);
size_t width = item.get_local_range(0);
size_t index = id_y * width + id_x;
out << index << sycl::endl;
});
});
}
but that kernel produces this sequence.
0
3
here is the working Cuda equivalent.
__global__ void ndCudaMerge()
{
const unsigned threadId = threadIdx.x;
const unsigned index = threadId + blockDim.x * blockIdx.x;
printf("cuda %d\n", index);
}
//this is called with this
ndCudaMerge << <2, 4 >> > ();
cudaDeviceSynchronize();
which generates:
cuda 0
cuda 1
cuda 2
cuda 3
cuda 4
cuda 5
cuda 6
cuda 7
so my question is how to I write a proper nd_range kernel that allows me to iterate by the specified work_group size.
I tried almost all possible combinations and they all fail.
what am I missing?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Please find the below DPC++ code using nd_range kernel for the corresponding CUDA code.
#include <CL/sycl.hpp>
using namespace sycl;
void ndCudaMerge(nd_item<3> item1, const stream &out)
{
const unsigned threadId = item1.get_local_id(2);
const unsigned index = threadId + item1.get_local_range(2) * item1.get_group(2);
out << "DPCPP "<< index <<"\n";
}
int main()
{
queue q;
q.submit([&](handler &cgh) {
stream out(64 * 1024, 80, cgh);
cgh.parallel_for(nd_range<3>(range<3>(1, 1, 2) * range<3>(1, 1, 4),range<3>(1, 1, 4)),
[=](nd_item<3> item1) {
ndCudaMerge(item1, out);
});
}).wait();
return 0;
}
Thanks and Regards,
Pendyala Sesha Srinivas
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
awesome !!
that worked very nice.
thank you very much.
Julio
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thanks for accepting the solution. If you need any additional information, please post a new question as this thread will no longer be monitored by Intel.
Thanks and Regards,
Pendyala Sesha Srinivas

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page