Solved: confused over the different parallel for

NewtonDynamics · ‎01-24-2023

hi, I am trying to write code that I previously wrote for cuda,

but so far, I still find the learning curve is a hard as it was two years ago.

This code is a straight counting sort that works perfectly on VS cpp and cuda

here is a very base function that, I can't get to work as expected.

template <class T, class ndEvaluateKey, int exponentRadix>
void SyclCountItems(sycl::queue& queue, buffer<T>& src, buffer<int>& scansBuffer)
{
	ndAssert((1 << exponentRadix) <= D_COUNTING_SORT_LOCAL_BLOCK_SIZE);
	queue.submit([&](sycl::handler& handler)
	{
		//ndEvaluateKey evaluator;
		int arraySize = src.size();
		int workGroupSize = 1 << exponentRadix;
		int workGroupCount = (arraySize + workGroupSize - 1) / workGroupSize;

		sycl::accessor<T, 1> srcAccessor(src, handler);
		sycl::accessor<int, 1>  scanAccessor(scansBuffer, handler);
		sycl::local_accessor<int, 1> counters(D_COUNTING_SORT_LOCAL_BLOCK_SIZE, handler);

		sycl::stream out(1024, 256, handler);
		handler.parallel_for(nd_range<1>{ {workGroupCount}, {workGroupSize}}, [=](nd_item<1> item)
		{
			id<1> localId = item.get_local_id();
			counters[localId] = localId;
			item.barrier();
			
			id<1> flatId = item.get_global_id();
			scanAccessor[flatId] = counters[localId];

			id<1> globalId = item.get_global_id();
			out << "flatId: " << flatId << "group:" << item.get_global_id() << "  local : " << localId << sycl::endl;
		});
#endif
	});
}

the log produces this:

flatId: {0}group:{0} local : {0}
flatId: {1}group:{1} local : {1}
flatId: {2}group:{2} local : {2}
flatId: {3}group:{3} local : {3}

but I expect something like,

flatId: {0}group:{0} local : {0}

flatId: {1}group:{0} local : {1}

flatId: {9}group:{2} local : {2}

flatId: {8}group:{1} local : {1}

flatId: {3}group:{0} local : {3}

flatId: {8}group:{2} local : {6}

flatId: {9}group:{1} local : {0}

flatId: {7}group:{0} local : {7}

flatId: {10}group:{1} local : {3}

....

and if I read that scansBuffer, I should get an array like

0, 1, 2, 3, 4, .... , workgrupoSize

0, 1, 2, 3, 4, .... , workgrupoSize

...

but instead, I get this:

-		m_cpuBuffer2	{ size=65536 }	StlVector<int>
		[capacity]	65536	unsigned __int64
+		[allocator]	{...}	std::_Compressed_pair<StlAllocator<int>,std::_Vector_val<std::_Simple_types<int>>,1>
		[0]	0	int
		[1]	0	int
		[2]	0	int
		[3]	0	int
		[4]	0	int
		[5]	0	int
		[6]	0	int
		[7]	0	int
		[8]	0	int
		[9]	1	int
		[10]	2	int
		[11]	3	int
		[12]	0	int
		[13]	0	int
		[14]	0	int
		[15]	0	int
		[16]	0	int
		[17]	1	int
		[18]	2	int
		[19]	3	int
		[20]	0	int
		[21]	0	int
		[22]	0	int
		[23]	0	int
		[24]	0	int

not even in the ballpark and I do not see any patern

It is almost like the paraller_for only schedule on work item per work group.

and somehow a nested loop is necessary, by is some I do not see any example on the

book Data Parallel C++, which in the was written the same Intel engineer how design this.

again this code work perfectly in CUDA.

NewtonDynamics · ‎02-01-2023

awesome !!

that worked very nice.

thank you very much.

Julio

View solution in original post

NewtonDynamics · ‎01-24-2023

one way I can get this above code working is if I write like this.

template <class T, class ndEvaluateKey, int exponentRadix>
void SyclCountItems(sycl::queue& queue, sycl::buffer<T>& src, sycl::buffer<int>& scansBuffer)
{
	ndAssert((1 << exponentRadix) <= D_COUNTING_SORT_LOCAL_BLOCK_SIZE);
	queue.submit([&](sycl::handler& handler)
	{
		ndEvaluateKey evaluator;
		int arraySize = src.size();
		int workGroupSize = 1 << exponentRadix;
		int workGroupCount = (arraySize + workGroupSize - 1) / workGroupSize;
		sycl::range<1> workGroupSizeRange(workGroupSize);
		sycl::range<1> workGroupCountRange(workGroupCount);

		sycl::accessor srcAccessor(src, handler);
		sycl::accessor scanAccessor(scansBuffer, handler);
		sycl::local_accessor<int, 1> counters(D_COUNTING_SORT_LOCAL_BLOCK_SIZE, handler);

		handler.parallel_for_work_group(workGroupCountRange, workGroupSizeRange, [=](sycl::group<1> group)
		{
			sycl::id<1> groupId = group.get_group_id();
			int base = groupId * workGroupSize;
			group.parallel_for_work_item([&](sycl::h_item<1> item)
			{
				sycl::id<1> localId = item.get_local_id();
				counters[localId] = 0;
			});

			group.parallel_for_work_item([&](sycl::h_item<1> item)
			{
				sycl::id<1> localId = item.get_local_id();
				int srcIndex = base + localId;
				int scanIndex = evaluator.GetCount(srcAccessor[srcIndex]);
				sycl::atomic_ref<int, sycl::memory_order::relaxed, sycl::memory_scope::work_item> atomicIndex(counters[scanIndex]);
				atomicIndex++;
			});

			group.parallel_for_work_item([&](sycl::h_item<1> item)
			{
				sycl::id<1> localId = item.get_local_id();
				scanAccessor[base + localId] = counters[localId];
			});
		});

	});
}

which is rather very elegant, but the part that I am dubious about is, the code is scheduling separate kernels.

for example, what happens at the end of a

group.parallel_for_work_item([&](sycl::h_item<1> item)

the code does not seem to allow using a barriers.

NewtonDynamics · ‎01-24-2023

here is some information related about this problem with local shared memory and registers.

imagine you get a sequence, of number and you want to get the prefix sum scan

ex:

3, 4, 1, 0, 0, 0, 0, 0

for simplicity the count is a power of two.

the expected result should be.

0, 3, 7, 8, 8, 8, 8, 8

using Cuda, this code accomplice that:

template <typename BufferItem, typename SortKeyPredicate>
__global__ void ndCudaCountingSortCountShuffleItemsInternal(const BufferItem* src, BufferItem* dst, unsigned* histogram, unsigned size, SortKeyPredicate GetSortKey, unsigned prefixKeySize)
{
	__shared__  unsigned cacheItemCount[D_COUNTING_SORT_BLOCK_SIZE / 2 + D_COUNTING_SORT_BLOCK_SIZE + 1];

	const unsigned blockId = blockIdx.x;
	const unsigned threadId = threadIdx.x;
	
	for (int i = 1; i < prefixKeySize; i = i << 1)
	{
		const unsigned countSum = cacheItemCount[prefixBase + threadId] + cacheItemCount[prefixBase - i + threadId];
		__syncthreads();

		cacheItemCount[prefixBase + threadId] = countSum;
		__syncthreads();
	}
}

the intermediate result will be offset by the word_group / 2, therefore the result is.

0, 0, 0, 0, 0, 3, 7, 8, 8, 8, 8, 8

I am trying to write sycl equivalent, and this is what I get to running,

template <class T, class ndEvaluateKey, int exponentRadix>
void SyclMergeBuckects(sycl::queue& queue, sycl::buffer<T>& src, sycl::buffer<T>& dst, sycl::buffer<int>& scansBuffer)
{
	queue.submit([&](sycl::handler& handler)
	{
		ndEvaluateKey evaluator;
		int arraySize = src.size();
		int workGroupSize = 1 << exponentRadix;
		int workGroupCount = (arraySize + workGroupSize - 1) / workGroupSize;
		sycl::range<1> workGroupSizeRange(workGroupSize);
		sycl::range<1> workGroupCountRange(workGroupCount);
		sycl::local_accessor<int, 1> cacheItemCount(D_COUNTING_SORT_LOCAL_BLOCK_SIZE / 2 + D_COUNTING_SORT_LOCAL_BLOCK_SIZE + 1, handler);

		//sycl::stream out(1024, 256, handler);
		handler.parallel_for_work_group(workGroupCountRange, workGroupSizeRange, [=](sycl::group<1> group)
		{
			for (int i = 1; i < workGroupSize; i = i << 1)
			{
				group.parallel_for_work_item([&](sycl::h_item<1> item)
				{
					int localId = item.get_local_id();
					int countSum = cacheItemCount[prefixBase + localId] + cacheItemCount[prefixBase - i + localId];
					group.mem_fence();

					cacheItemCount[prefixBase + localId] = countSum;
					group.mem_fence();
				});
			}
		});
	});
}

but that produces this sequence:

cacheItemCount	{...}	const 
{0, 0, 0, 0, 0, 0, 3, 7, 11, 15, 22, 30, 0}	int[13]

so somehow, there is a race condition. and the values in register count is invalid.

I did not see any function other than group.mem_fence to make sure the date is all read before the local buffer is overwritten.

Thanks

SeshaP_Intel · ‎01-25-2023

Hi,

Thank you for posting in Intel Communities.

Could you please provide both (DPC++ and CUDA) complete reproducer codes so that we can investigate this issue more from our end?

Thanks and Regards,

Pendyala Sesha Srinivas

NewtonDynamics · ‎01-26-2023

sure. Here is a very simplified version of the problem with running code examples.

here is a sycl hello world program

template <class T, class ndEvaluateKey, int exponentRadix>
void SyclMergeBuckects(sycl::queue& queue, sycl::buffer<T>& src, sycl::buffer<T>& dst, sycl::buffer<int>& scansBuffer)
{
	queue.submit([&](sycl::handler& handler)
	{
		sycl::accessor dstAccessor(dst, handler);
		sycl::stream out(4096, 256, handler);

		sycl::range<1> range(8);
		handler.parallel_for(range, [=](sycl::id<1> id)
		{
			out << id << sycl::endl;
		});
	});
}

the function above generates this sequence.

{1}
{5}
{0}
{6}
{4}
{2}
{7}
{3}

the problem is that the simplistic kernel does not allow for data synchronization using work_group of some application defined size

so the same function si now written using the nd_range instead.

template <class T, class ndEvaluateKey, int exponentRadix>
void SyclMergeBuckects(sycl::queue& queue, sycl::buffer<T>& src, sycl::buffer<T>& dst, sycl::buffer<int>& scansBuffer)
{
	queue.submit([&](sycl::handler& handler)
	{
		sycl::accessor dstAccessor(dst, handler);

		sycl::stream out(4096, 256, handler);

		sycl::range<1> localSize(4);
		sycl::range<1> globalSize(2);
		handler.parallel_for(sycl::nd_range<1>(globalSize, localSize), [=](sycl::nd_item<1> item)
		{
			sycl::id<1> id_x = item.get_local_id(0);
			sycl::id<1> id_y = item.get_global_id(0);
			size_t width = item.get_local_range(0);
			size_t index = id_y * width + id_x;
			out << index << sycl::endl;
		});
	});
}

I expect the above kernel to generate a very similar sequance, but instead produces this

0
3

it seems it schedule two work groups, which is fine, but each group only has one id, when it should have 4.

below is the cuda equivalent.

__global__ void ndCudaMerge()
{
	const unsigned threadId = threadIdx.x;
	const unsigned index = threadId + blockDim.x * blockIdx.x;
	printf("cuda %d\n", index);
}

//this is called with this 

	ndCudaMerge << <2, 4 >> > ();
	cudaDeviceSynchronize();

the cuda kernel generates this sequence

cuda 0
cuda 1
cuda 2
cuda 3
cuda 4
cuda 5
cuda 6
cuda 7

so the questin is how do I read the work_group items, from each group?

NewtonDynamics · ‎01-26-2023

well, I am not sure why the code same does show up, I posted it twice already.

NewtonDynamics · ‎01-26-2023

Alright since the code sniped does shows on the post, I placed in a file and attached it to this post.

I have spent over two weeks try to resolve this problem, I have been programing professionally for almost 35 years,

I bought the book on Parallel programming, I read the Chronos spec on SYCL, I browsed many of the popular sight and

notone seems to have this problem, so it most I who is doing something wrong, but for the light of me, I can't determine what.

The test sample I send are copied almost verbatim from snipe of code from Intel own documentation.

Intel docs

any help with this would be greatly appreciated.

Julio Jerez

Thanks.

NewtonDynamics · ‎01-26-2023

sure, here are the samples.

this is s hello word kind of kernel that I need to convert to and nd_range kerenel.

template <class T, class ndEvaluateKey, int exponentRadix>
void SyclMergeBuckects(sycl::queue& queue, sycl::buffer<T>& src, sycl::buffer<T>& dst, sycl::buffer<int>& scansBuffer)
{
	queue.submit([&](sycl::handler& handler)
	{
		sycl::accessor dstAccessor(dst, handler);
		sycl::stream out(4096, 256, handler);

		sycl::range<1> range(8);
		handler.parallel_for(range, [=](sycl::id<1> id)
		{
			out << id << sycl::endl;
		});
	});
}

that kernel produces the fallowing sequence

{1}
{5}
{0}
{6}
{4}
{2}
{7}
{3}

the nd_range version is this:

template <class T, class ndEvaluateKey, int exponentRadix>
void SyclMergeBuckects(sycl::queue& queue, sycl::buffer<T>& src, sycl::buffer<T>& dst, sycl::buffer<int>& scansBuffer)
{
	queue.submit([&](sycl::handler& handler)
	{
		sycl::accessor dstAccessor(dst, handler);

		sycl::stream out(4096, 256, handler);

		sycl::range<1> localSize(4);
		sycl::range<1> globalSize(2);
		handler.parallel_for(sycl::nd_range<1>(globalSize, localSize), [=](sycl::nd_item<1> item)
		{
			sycl::id<1> id_x = item.get_local_id(0);
			sycl::id<1> id_y = item.get_global_id(0);
			size_t width = item.get_local_range(0);
			size_t index = id_y * width + id_x;
			out << index << sycl::endl;
		});
	});
}

but that kernel produces this sequence.

0
3

here is the working Cuda equivalent.

__global__ void ndCudaMerge()
{
	const unsigned threadId = threadIdx.x;
	const unsigned index = threadId + blockDim.x * blockIdx.x;
	printf("cuda %d\n", index);
}

//this is called with this 

	ndCudaMerge << <2, 4 >> > ();
	cudaDeviceSynchronize();

which generates:

cuda 0
cuda 1
cuda 2
cuda 3
cuda 4
cuda 5
cuda 6
cuda 7

so my question is how to I write a proper nd_range kernel that allows me to iterate by the specified work_group size.

I tried almost all possible combinations and they all fail.

what am I missing?

SeshaP_Intel · ‎02-01-2023

Hi,

Please find the below DPC++ code using nd_range kernel for the corresponding CUDA code.

#include <CL/sycl.hpp>
using namespace sycl;

void ndCudaMerge(nd_item<3> item1, const stream &out)
{
    const unsigned threadId = item1.get_local_id(2);
    const unsigned index = threadId + item1.get_local_range(2) * item1.get_group(2);

    out << "DPCPP "<< index <<"\n";
}

int main() 
{
    queue q;
    q.submit([&](handler &cgh) {
        stream out(64 * 1024, 80, cgh);
        cgh.parallel_for(nd_range<3>(range<3>(1, 1, 2) * range<3>(1, 1, 4),range<3>(1, 1, 4)),
                         [=](nd_item<3> item1) {
                             ndCudaMerge(item1, out); 
                         });
    }).wait();

    return 0;
}

Thanks and Regards,

Pendyala Sesha Srinivas

NewtonDynamics · ‎02-01-2023

awesome !!

that worked very nice.

thank you very much.

Julio

SeshaP_Intel · ‎02-01-2023

Hi,

Thanks for accepting the solution. If you need any additional information, please post a new question as this thread will no longer be monitored by Intel.

Thanks and Regards,

Pendyala Sesha Srinivas