<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Re:confused over the different parallel for in Intel® oneAPI DPC++/C++ Compiler</title>
    <link>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/confused-over-the-different-parallel-for/m-p/1450497#M2751</link>
    <description>&lt;P&gt;well, I am not sure why the code same does show up, I posted it twice already.&lt;/P&gt;</description>
    <pubDate>Thu, 26 Jan 2023 21:31:34 GMT</pubDate>
    <dc:creator>NewtonDynamics</dc:creator>
    <dc:date>2023-01-26T21:31:34Z</dc:date>
    <item>
      <title>confused over the different parallel for</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/confused-over-the-different-parallel-for/m-p/1449727#M2742</link>
      <description>&lt;P&gt;hi, I am trying to write code that I previously wrote for cuda,&lt;/P&gt;
&lt;P&gt;but so far, I still find the learning curve is a hard as it was two years ago.&lt;/P&gt;
&lt;P&gt;This code is a straight counting sort that works perfectly on VS cpp and cuda&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;here is a very base function that, I can't get to work as expected.&lt;/P&gt;
&lt;LI-CODE lang="markup"&gt;template &amp;lt;class T, class ndEvaluateKey, int exponentRadix&amp;gt;
void SyclCountItems(sycl::queue&amp;amp; queue, buffer&amp;lt;T&amp;gt;&amp;amp; src, buffer&amp;lt;int&amp;gt;&amp;amp; scansBuffer)
{
	ndAssert((1 &amp;lt;&amp;lt; exponentRadix) &amp;lt;= D_COUNTING_SORT_LOCAL_BLOCK_SIZE);
	queue.submit([&amp;amp;](sycl::handler&amp;amp; handler)
	{
		//ndEvaluateKey evaluator;
		int arraySize = src.size();
		int workGroupSize = 1 &amp;lt;&amp;lt; exponentRadix;
		int workGroupCount = (arraySize + workGroupSize - 1) / workGroupSize;

		sycl::accessor&amp;lt;T, 1&amp;gt; srcAccessor(src, handler);
		sycl::accessor&amp;lt;int, 1&amp;gt;  scanAccessor(scansBuffer, handler);
		sycl::local_accessor&amp;lt;int, 1&amp;gt; counters(D_COUNTING_SORT_LOCAL_BLOCK_SIZE, handler);

		sycl::stream out(1024, 256, handler);
		handler.parallel_for(nd_range&amp;lt;1&amp;gt;{ {workGroupCount}, {workGroupSize}}, [=](nd_item&amp;lt;1&amp;gt; item)
		{
			id&amp;lt;1&amp;gt; localId = item.get_local_id();
			counters[localId] = localId;
			item.barrier();
			
			id&amp;lt;1&amp;gt; flatId = item.get_global_id();
			scanAccessor[flatId] = counters[localId];

			id&amp;lt;1&amp;gt; globalId = item.get_global_id();
			out &amp;lt;&amp;lt; "flatId: " &amp;lt;&amp;lt; flatId &amp;lt;&amp;lt; "group:" &amp;lt;&amp;lt; item.get_global_id() &amp;lt;&amp;lt; "  local : " &amp;lt;&amp;lt; localId &amp;lt;&amp;lt; sycl::endl;
		});
#endif
	});
}
&lt;/LI-CODE&gt;
&lt;P&gt;&amp;nbsp;the log produces this:&amp;nbsp;&lt;/P&gt;
&lt;P&gt;flatId: {0}group:{0} local : {0}&lt;BR /&gt;flatId: {1}group:{1} local : {1}&lt;BR /&gt;flatId: {2}group:{2} local : {2}&lt;BR /&gt;flatId: {3}group:{3} local : {3}&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;but I expect&amp;nbsp; something like,&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;flatId: {0}group:{0} local : {0}&lt;/P&gt;
&lt;P&gt;flatId: {1}group:{0} local : {1}&lt;/P&gt;
&lt;P&gt;flatId: {9}group:{2} local : {2}&lt;/P&gt;
&lt;P&gt;flatId: {8}group:{1} local : {1}&lt;/P&gt;
&lt;P&gt;flatId: {3}group:{0} local : {3}&lt;/P&gt;
&lt;P&gt;flatId: {8}group:{2} local : {6}&lt;/P&gt;
&lt;P&gt;flatId: {9}group:{1} local : {0}&lt;/P&gt;
&lt;P&gt;flatId: {7}group:{0} local : {7}&lt;/P&gt;
&lt;P&gt;flatId: {10}group:{1} local : {3}&lt;/P&gt;
&lt;P&gt;....&lt;/P&gt;
&lt;P&gt;&lt;BR /&gt;and if I read that scansBuffer, I should get an array like&lt;BR /&gt;&lt;BR /&gt;0, 1, 2, 3, 4, .... , workgrupoSize&lt;/P&gt;
&lt;P&gt;0, 1, 2, 3, 4, ....&amp;nbsp;, workgrupoSize&amp;nbsp;&lt;/P&gt;
&lt;P&gt;0, 1, 2, 3, 4, ....&amp;nbsp;, workgrupoSize&lt;/P&gt;
&lt;P&gt;...&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;but instead, I get this:&lt;/P&gt;
&lt;LI-CODE lang="markup"&gt;-		m_cpuBuffer2	{ size=65536 }	StlVector&amp;lt;int&amp;gt;
		[capacity]	65536	unsigned __int64
+		[allocator]	{...}	std::_Compressed_pair&amp;lt;StlAllocator&amp;lt;int&amp;gt;,std::_Vector_val&amp;lt;std::_Simple_types&amp;lt;int&amp;gt;&amp;gt;,1&amp;gt;
		[0]	0	int
		[1]	0	int
		[2]	0	int
		[3]	0	int
		[4]	0	int
		[5]	0	int
		[6]	0	int
		[7]	0	int
		[8]	0	int
		[9]	1	int
		[10]	2	int
		[11]	3	int
		[12]	0	int
		[13]	0	int
		[14]	0	int
		[15]	0	int
		[16]	0	int
		[17]	1	int
		[18]	2	int
		[19]	3	int
		[20]	0	int
		[21]	0	int
		[22]	0	int
		[23]	0	int
		[24]	0	int
&lt;/LI-CODE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;BR /&gt;not even in the ballpark and I do not see any patern&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;
&lt;P&gt;It is almost like the paraller_for only schedule on work item per work group.&lt;/P&gt;
&lt;P&gt;and somehow a nested loop is necessary, by is some I do not see any example on the&lt;/P&gt;
&lt;P&gt;book Data Parallel C++, which in the was written the same Intel engineer how design this.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;again this code work perfectly in CUDA.&lt;/P&gt;</description>
      <pubDate>Tue, 24 Jan 2023 19:48:22 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/confused-over-the-different-parallel-for/m-p/1449727#M2742</guid>
      <dc:creator>NewtonDynamics</dc:creator>
      <dc:date>2023-01-24T19:48:22Z</dc:date>
    </item>
    <item>
      <title>Re: confused over the different parallel for</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/confused-over-the-different-parallel-for/m-p/1449738#M2743</link>
      <description>&lt;P&gt;one way I can get this above code working is if I write like this.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;LI-CODE lang="markup"&gt;template &amp;lt;class T, class ndEvaluateKey, int exponentRadix&amp;gt;
void SyclCountItems(sycl::queue&amp;amp; queue, sycl::buffer&amp;lt;T&amp;gt;&amp;amp; src, sycl::buffer&amp;lt;int&amp;gt;&amp;amp; scansBuffer)
{
	ndAssert((1 &amp;lt;&amp;lt; exponentRadix) &amp;lt;= D_COUNTING_SORT_LOCAL_BLOCK_SIZE);
	queue.submit([&amp;amp;](sycl::handler&amp;amp; handler)
	{
		ndEvaluateKey evaluator;
		int arraySize = src.size();
		int workGroupSize = 1 &amp;lt;&amp;lt; exponentRadix;
		int workGroupCount = (arraySize + workGroupSize - 1) / workGroupSize;
		sycl::range&amp;lt;1&amp;gt; workGroupSizeRange(workGroupSize);
		sycl::range&amp;lt;1&amp;gt; workGroupCountRange(workGroupCount);

		sycl::accessor srcAccessor(src, handler);
		sycl::accessor scanAccessor(scansBuffer, handler);
		sycl::local_accessor&amp;lt;int, 1&amp;gt; counters(D_COUNTING_SORT_LOCAL_BLOCK_SIZE, handler);

		handler.parallel_for_work_group(workGroupCountRange, workGroupSizeRange, [=](sycl::group&amp;lt;1&amp;gt; group)
		{
			sycl::id&amp;lt;1&amp;gt; groupId = group.get_group_id();
			int base = groupId * workGroupSize;
			group.parallel_for_work_item([&amp;amp;](sycl::h_item&amp;lt;1&amp;gt; item)
			{
				sycl::id&amp;lt;1&amp;gt; localId = item.get_local_id();
				counters[localId] = 0;
			});

			group.parallel_for_work_item([&amp;amp;](sycl::h_item&amp;lt;1&amp;gt; item)
			{
				sycl::id&amp;lt;1&amp;gt; localId = item.get_local_id();
				int srcIndex = base + localId;
				int scanIndex = evaluator.GetCount(srcAccessor[srcIndex]);
				sycl::atomic_ref&amp;lt;int, sycl::memory_order::relaxed, sycl::memory_scope::work_item&amp;gt; atomicIndex(counters[scanIndex]);
				atomicIndex++;
			});

			group.parallel_for_work_item([&amp;amp;](sycl::h_item&amp;lt;1&amp;gt; item)
			{
				sycl::id&amp;lt;1&amp;gt; localId = item.get_local_id();
				scanAccessor[base + localId] = counters[localId];
			});
		});

	});
}
&lt;/LI-CODE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;which is rather very elegant, but&amp;nbsp;the part that I am dubious about is, the code is scheduling separate kernels.&amp;nbsp;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;for example, what happens at the end of a&amp;nbsp;&lt;/P&gt;
&lt;P&gt;group.parallel_for_work_item([&amp;amp;](sycl::h_item&amp;lt;1&amp;gt; item)&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;the code does not seem to allow using a barriers.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 24 Jan 2023 20:24:45 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/confused-over-the-different-parallel-for/m-p/1449738#M2743</guid>
      <dc:creator>NewtonDynamics</dc:creator>
      <dc:date>2023-01-24T20:24:45Z</dc:date>
    </item>
    <item>
      <title>Re: confused over the different parallel for</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/confused-over-the-different-parallel-for/m-p/1449819#M2744</link>
      <description>&lt;P&gt;here is some information related about this problem with local shared memory and registers.&lt;/P&gt;
&lt;P&gt;imagine you get a sequence, of number and you want to get the prefix sum scan&lt;/P&gt;
&lt;P&gt;ex:&amp;nbsp;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;3, 4, 1, 0, 0, 0, 0, 0&amp;nbsp;&lt;/P&gt;
&lt;P&gt;for simplicity the count is a power of two.&lt;/P&gt;
&lt;P&gt;the expected result should be.&lt;/P&gt;
&lt;P&gt;0, 3, 7, 8, 8, 8, 8, 8&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;using Cuda, this code accomplice that:&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;LI-CODE lang="markup"&gt;template &amp;lt;typename BufferItem, typename SortKeyPredicate&amp;gt;
__global__ void ndCudaCountingSortCountShuffleItemsInternal(const BufferItem* src, BufferItem* dst, unsigned* histogram, unsigned size, SortKeyPredicate GetSortKey, unsigned prefixKeySize)
{
	__shared__  unsigned cacheItemCount[D_COUNTING_SORT_BLOCK_SIZE / 2 + D_COUNTING_SORT_BLOCK_SIZE + 1];

	const unsigned blockId = blockIdx.x;
	const unsigned threadId = threadIdx.x;
	
	for (int i = 1; i &amp;lt; prefixKeySize; i = i &amp;lt;&amp;lt; 1)
	{
		const unsigned countSum = cacheItemCount[prefixBase + threadId] + cacheItemCount[prefixBase - i + threadId];
		__syncthreads();

		cacheItemCount[prefixBase + threadId] = countSum;
		__syncthreads();
	}
}
&lt;/LI-CODE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;the intermediate result will be offset by the word_group / 2, therefore the result is.&lt;/P&gt;
&lt;P&gt;0, 0, 0, 0, 0, 3, 7, 8, 8, 8, 8, 8&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I am trying to write sycl equivalent, and this is what I get to running,&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;LI-CODE lang="markup"&gt;template &amp;lt;class T, class ndEvaluateKey, int exponentRadix&amp;gt;
void SyclMergeBuckects(sycl::queue&amp;amp; queue, sycl::buffer&amp;lt;T&amp;gt;&amp;amp; src, sycl::buffer&amp;lt;T&amp;gt;&amp;amp; dst, sycl::buffer&amp;lt;int&amp;gt;&amp;amp; scansBuffer)
{
	queue.submit([&amp;amp;](sycl::handler&amp;amp; handler)
	{
		ndEvaluateKey evaluator;
		int arraySize = src.size();
		int workGroupSize = 1 &amp;lt;&amp;lt; exponentRadix;
		int workGroupCount = (arraySize + workGroupSize - 1) / workGroupSize;
		sycl::range&amp;lt;1&amp;gt; workGroupSizeRange(workGroupSize);
		sycl::range&amp;lt;1&amp;gt; workGroupCountRange(workGroupCount);
		sycl::local_accessor&amp;lt;int, 1&amp;gt; cacheItemCount(D_COUNTING_SORT_LOCAL_BLOCK_SIZE / 2 + D_COUNTING_SORT_LOCAL_BLOCK_SIZE + 1, handler);

		//sycl::stream out(1024, 256, handler);
		handler.parallel_for_work_group(workGroupCountRange, workGroupSizeRange, [=](sycl::group&amp;lt;1&amp;gt; group)
		{
			for (int i = 1; i &amp;lt; workGroupSize; i = i &amp;lt;&amp;lt; 1)
			{
				group.parallel_for_work_item([&amp;amp;](sycl::h_item&amp;lt;1&amp;gt; item)
				{
					int localId = item.get_local_id();
					int countSum = cacheItemCount[prefixBase + localId] + cacheItemCount[prefixBase - i + localId];
					group.mem_fence();

					cacheItemCount[prefixBase + localId] = countSum;
					group.mem_fence();
				});
			}
		});
	});
}
&lt;/LI-CODE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;but that produces this sequence:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;LI-CODE lang="markup"&gt;cacheItemCount	{...}	const 
{0, 0, 0, 0, 0, 0, 3, 7, 11, 15, 22, 30, 0}	int[13]
&lt;/LI-CODE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;so somehow, there is a race condition.&amp;nbsp; and the values in register count is invalid.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I did not see any function other than group.mem_fence&amp;nbsp; &amp;nbsp;to make sure the date is all read before the local buffer is overwritten.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Thanks&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 25 Jan 2023 00:12:20 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/confused-over-the-different-parallel-for/m-p/1449819#M2744</guid>
      <dc:creator>NewtonDynamics</dc:creator>
      <dc:date>2023-01-25T00:12:20Z</dc:date>
    </item>
    <item>
      <title>Re:confused over the different parallel for</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/confused-over-the-different-parallel-for/m-p/1449953#M2747</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;Thank you for posting in Intel Communities.&lt;/P&gt;&lt;P&gt;Could you please provide both (DPC++ and CUDA) complete reproducer codes so that we can investigate this issue more from our end?&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;Thanks and Regards,&lt;/P&gt;&lt;P&gt;Pendyala Sesha Srinivas&lt;/P&gt;&lt;BR /&gt;</description>
      <pubDate>Wed, 25 Jan 2023 12:05:16 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/confused-over-the-different-parallel-for/m-p/1449953#M2747</guid>
      <dc:creator>SeshaP_Intel</dc:creator>
      <dc:date>2023-01-25T12:05:16Z</dc:date>
    </item>
    <item>
      <title>Re: Re:confused over the different parallel for</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/confused-over-the-different-parallel-for/m-p/1450447#M2749</link>
      <description>&lt;P&gt;sure. Here is a very simplified version of the problem with running code examples.&lt;/P&gt;
&lt;P&gt;here is a sycl hello world program&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;LI-CODE lang="markup"&gt;template &amp;lt;class T, class ndEvaluateKey, int exponentRadix&amp;gt;
void SyclMergeBuckects(sycl::queue&amp;amp; queue, sycl::buffer&amp;lt;T&amp;gt;&amp;amp; src, sycl::buffer&amp;lt;T&amp;gt;&amp;amp; dst, sycl::buffer&amp;lt;int&amp;gt;&amp;amp; scansBuffer)
{
	queue.submit([&amp;amp;](sycl::handler&amp;amp; handler)
	{
		sycl::accessor dstAccessor(dst, handler);
		sycl::stream out(4096, 256, handler);

		sycl::range&amp;lt;1&amp;gt; range(8);
		handler.parallel_for(range, [=](sycl::id&amp;lt;1&amp;gt; id)
		{
			out &amp;lt;&amp;lt; id &amp;lt;&amp;lt; sycl::endl;
		});
	});
}&lt;/LI-CODE&gt;
&lt;P&gt;the function above generates this sequence.&lt;/P&gt;
&lt;P&gt;{1}&lt;BR /&gt;{5}&lt;BR /&gt;{0}&lt;BR /&gt;{6}&lt;BR /&gt;{4}&lt;BR /&gt;{2}&lt;BR /&gt;{7}&lt;BR /&gt;{3}&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;the problem is that the simplistic kernel does not allow for data synchronization using work_group of some application defined size&amp;nbsp;&lt;/P&gt;
&lt;P&gt;so the same function si now written using the nd_range instead.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;LI-CODE lang="markup"&gt;template &amp;lt;class T, class ndEvaluateKey, int exponentRadix&amp;gt;
void SyclMergeBuckects(sycl::queue&amp;amp; queue, sycl::buffer&amp;lt;T&amp;gt;&amp;amp; src, sycl::buffer&amp;lt;T&amp;gt;&amp;amp; dst, sycl::buffer&amp;lt;int&amp;gt;&amp;amp; scansBuffer)
{
	queue.submit([&amp;amp;](sycl::handler&amp;amp; handler)
	{
		sycl::accessor dstAccessor(dst, handler);

		sycl::stream out(4096, 256, handler);

		sycl::range&amp;lt;1&amp;gt; localSize(4);
		sycl::range&amp;lt;1&amp;gt; globalSize(2);
		handler.parallel_for(sycl::nd_range&amp;lt;1&amp;gt;(globalSize, localSize), [=](sycl::nd_item&amp;lt;1&amp;gt; item)
		{
			sycl::id&amp;lt;1&amp;gt; id_x = item.get_local_id(0);
			sycl::id&amp;lt;1&amp;gt; id_y = item.get_global_id(0);
			size_t width = item.get_local_range(0);
			size_t index = id_y * width + id_x;
			out &amp;lt;&amp;lt; index &amp;lt;&amp;lt; sycl::endl;
		});
	});
}
&lt;/LI-CODE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I expect the above kernel to generate a very similar sequance, but instead produces this&amp;nbsp;&lt;/P&gt;
&lt;P&gt;0&lt;BR /&gt;3&lt;/P&gt;
&lt;P&gt;&lt;BR /&gt;it seems it schedule two work groups, which is fine, but each group only has one id, when it should have 4.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;below is the cuda equivalent.&lt;/P&gt;
&lt;LI-CODE lang="markup"&gt;__global__ void ndCudaMerge()
{
	const unsigned threadId = threadIdx.x;
	const unsigned index = threadId + blockDim.x * blockIdx.x;
	printf("cuda %d\n", index);
}

//this is called with this 

	ndCudaMerge &amp;lt;&amp;lt; &amp;lt;2, 4 &amp;gt;&amp;gt; &amp;gt; ();
	cudaDeviceSynchronize();&lt;/LI-CODE&gt;
&lt;P&gt;&amp;nbsp;the&amp;nbsp; cuda kernel generates this sequence&lt;/P&gt;
&lt;P&gt;cuda 0&lt;BR /&gt;cuda 1&lt;BR /&gt;cuda 2&lt;BR /&gt;cuda 3&lt;BR /&gt;cuda 4&lt;BR /&gt;cuda 5&lt;BR /&gt;cuda 6&lt;BR /&gt;cuda 7&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;so the questin is how do I read the work_group items,&amp;nbsp; from each group?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 26 Jan 2023 18:50:44 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/confused-over-the-different-parallel-for/m-p/1450447#M2749</guid>
      <dc:creator>NewtonDynamics</dc:creator>
      <dc:date>2023-01-26T18:50:44Z</dc:date>
    </item>
    <item>
      <title>Re: confused over the different parallel for</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/confused-over-the-different-parallel-for/m-p/1450472#M2750</link>
      <description>&lt;P&gt;sure, here are the samples.&lt;/P&gt;
&lt;P&gt;this is s hello word kind of kernel that I need to convert to and nd_range kerenel.&lt;/P&gt;
&lt;LI-CODE lang="markup"&gt;template &amp;lt;class T, class ndEvaluateKey, int exponentRadix&amp;gt;
void SyclMergeBuckects(sycl::queue&amp;amp; queue, sycl::buffer&amp;lt;T&amp;gt;&amp;amp; src, sycl::buffer&amp;lt;T&amp;gt;&amp;amp; dst, sycl::buffer&amp;lt;int&amp;gt;&amp;amp; scansBuffer)
{
	queue.submit([&amp;amp;](sycl::handler&amp;amp; handler)
	{
		sycl::accessor dstAccessor(dst, handler);
		sycl::stream out(4096, 256, handler);

		sycl::range&amp;lt;1&amp;gt; range(8);
		handler.parallel_for(range, [=](sycl::id&amp;lt;1&amp;gt; id)
		{
			out &amp;lt;&amp;lt; id &amp;lt;&amp;lt; sycl::endl;
		});
	});
}
&lt;/LI-CODE&gt;
&lt;P&gt;that kernel produces the fallowing sequence&lt;/P&gt;
&lt;P&gt;{1}&lt;BR /&gt;{5}&lt;BR /&gt;{0}&lt;BR /&gt;{6}&lt;BR /&gt;{4}&lt;BR /&gt;{2}&lt;BR /&gt;{7}&lt;BR /&gt;{3}&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;the nd_range version is this:&lt;/P&gt;
&lt;LI-CODE lang="markup"&gt;template &amp;lt;class T, class ndEvaluateKey, int exponentRadix&amp;gt;
void SyclMergeBuckects(sycl::queue&amp;amp; queue, sycl::buffer&amp;lt;T&amp;gt;&amp;amp; src, sycl::buffer&amp;lt;T&amp;gt;&amp;amp; dst, sycl::buffer&amp;lt;int&amp;gt;&amp;amp; scansBuffer)
{
	queue.submit([&amp;amp;](sycl::handler&amp;amp; handler)
	{
		sycl::accessor dstAccessor(dst, handler);

		sycl::stream out(4096, 256, handler);

		sycl::range&amp;lt;1&amp;gt; localSize(4);
		sycl::range&amp;lt;1&amp;gt; globalSize(2);
		handler.parallel_for(sycl::nd_range&amp;lt;1&amp;gt;(globalSize, localSize), [=](sycl::nd_item&amp;lt;1&amp;gt; item)
		{
			sycl::id&amp;lt;1&amp;gt; id_x = item.get_local_id(0);
			sycl::id&amp;lt;1&amp;gt; id_y = item.get_global_id(0);
			size_t width = item.get_local_range(0);
			size_t index = id_y * width + id_x;
			out &amp;lt;&amp;lt; index &amp;lt;&amp;lt; sycl::endl;
		});
	});
}&lt;/LI-CODE&gt;
&lt;P&gt;but that kernel produces this sequence.&lt;/P&gt;
&lt;P&gt;0&lt;BR /&gt;3&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;here is the working Cuda equivalent.&lt;/P&gt;
&lt;LI-CODE lang="markup"&gt;__global__ void ndCudaMerge()
{
	const unsigned threadId = threadIdx.x;
	const unsigned index = threadId + blockDim.x * blockIdx.x;
	printf("cuda %d\n", index);
}

//this is called with this 

	ndCudaMerge &amp;lt;&amp;lt; &amp;lt;2, 4 &amp;gt;&amp;gt; &amp;gt; ();
	cudaDeviceSynchronize();&lt;/LI-CODE&gt;
&lt;P&gt;which generates:&lt;/P&gt;
&lt;P&gt;cuda 0&lt;BR /&gt;cuda 1&lt;BR /&gt;cuda 2&lt;BR /&gt;cuda 3&lt;BR /&gt;cuda 4&lt;BR /&gt;cuda 5&lt;BR /&gt;cuda 6&lt;BR /&gt;cuda 7&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;so my question is how to I write a proper nd_range kernel that allows me to iterate by the specified work_group size.&lt;/P&gt;
&lt;P&gt;I tried almost all possible combinations and they all fail.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;what am I missing?&lt;/P&gt;</description>
      <pubDate>Thu, 26 Jan 2023 20:20:42 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/confused-over-the-different-parallel-for/m-p/1450472#M2750</guid>
      <dc:creator>NewtonDynamics</dc:creator>
      <dc:date>2023-01-26T20:20:42Z</dc:date>
    </item>
    <item>
      <title>Re: Re:confused over the different parallel for</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/confused-over-the-different-parallel-for/m-p/1450497#M2751</link>
      <description>&lt;P&gt;well, I am not sure why the code same does show up, I posted it twice already.&lt;/P&gt;</description>
      <pubDate>Thu, 26 Jan 2023 21:31:34 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/confused-over-the-different-parallel-for/m-p/1450497#M2751</guid>
      <dc:creator>NewtonDynamics</dc:creator>
      <dc:date>2023-01-26T21:31:34Z</dc:date>
    </item>
    <item>
      <title>Re: Re:confused over the different parallel for</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/confused-over-the-different-parallel-for/m-p/1450518#M2752</link>
      <description>&lt;P&gt;Alright since the code sniped does shows on the post, I placed in a file and attached it to this post.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I have spent over two weeks try to resolve this problem, I have been programing professionally for almost 35 years,&lt;/P&gt;
&lt;P&gt;I bought the book on Parallel programming, I read the Chronos spec on SYCL, I browsed many of the popular sight and&lt;/P&gt;
&lt;P&gt;not&lt;SPAN&gt;one seems to have this problem, so it most I who is doing something wrong, but for the light of me, I can't determine what.&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;The test sample I send are copied almost verbatim from snipe of code from Intel own documentation.&lt;/P&gt;
&lt;P&gt;&lt;A href="https://www.intel.com/content/www/us/en/developer/articles/training/programming-data-parallel-c.html#gs.oaeyft" target="_self"&gt;Intel docs&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;any help with this would be greatly appreciated.&lt;/P&gt;
&lt;P&gt;Julio Jerez&lt;/P&gt;
&lt;P&gt;Thanks.&lt;/P&gt;</description>
      <pubDate>Thu, 26 Jan 2023 22:56:57 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/confused-over-the-different-parallel-for/m-p/1450518#M2752</guid>
      <dc:creator>NewtonDynamics</dc:creator>
      <dc:date>2023-01-26T22:56:57Z</dc:date>
    </item>
    <item>
      <title>Re: confused over the different parallel for</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/confused-over-the-different-parallel-for/m-p/1452155#M2756</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Please find the below DPC++ code using nd_range kernel for the corresponding CUDA code.&lt;/P&gt;
&lt;LI-CODE lang="cpp"&gt;#include &amp;lt;CL/sycl.hpp&amp;gt;
using namespace sycl;

void ndCudaMerge(nd_item&amp;lt;3&amp;gt; item1, const stream &amp;amp;out)
{
    const unsigned threadId = item1.get_local_id(2);
    const unsigned index = threadId + item1.get_local_range(2) * item1.get_group(2);

    out &amp;lt;&amp;lt; "DPCPP "&amp;lt;&amp;lt; index &amp;lt;&amp;lt;"\n";
}

int main() 
{
    queue q;
    q.submit([&amp;amp;](handler &amp;amp;cgh) {
        stream out(64 * 1024, 80, cgh);
        cgh.parallel_for(nd_range&amp;lt;3&amp;gt;(range&amp;lt;3&amp;gt;(1, 1, 2) * range&amp;lt;3&amp;gt;(1, 1, 4),range&amp;lt;3&amp;gt;(1, 1, 4)),
                         [=](nd_item&amp;lt;3&amp;gt; item1) {
                             ndCudaMerge(item1, out); 
                         });
    }).wait();

    return 0;
}&lt;/LI-CODE&gt;
&lt;P&gt;Thanks and Regards,&lt;/P&gt;
&lt;P&gt;Pendyala Sesha Srinivas&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 01 Feb 2023 10:44:13 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/confused-over-the-different-parallel-for/m-p/1452155#M2756</guid>
      <dc:creator>SeshaP_Intel</dc:creator>
      <dc:date>2023-02-01T10:44:13Z</dc:date>
    </item>
    <item>
      <title>Re: confused over the different parallel for</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/confused-over-the-different-parallel-for/m-p/1452382#M2759</link>
      <description>&lt;P&gt;awesome !!&lt;/P&gt;
&lt;P&gt;that worked very nice.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;thank you very much.&lt;/P&gt;
&lt;P&gt;Julio&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 01 Feb 2023 22:06:22 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/confused-over-the-different-parallel-for/m-p/1452382#M2759</guid>
      <dc:creator>NewtonDynamics</dc:creator>
      <dc:date>2023-02-01T22:06:22Z</dc:date>
    </item>
    <item>
      <title>Re:confused over the different parallel for</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/confused-over-the-different-parallel-for/m-p/1452530#M2762</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;Thanks for accepting the solution. If you need any additional information, please post a new question as this thread will no longer be monitored by Intel.&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;Thanks and Regards,&lt;/P&gt;&lt;P&gt;Pendyala Sesha Srinivas&lt;/P&gt;&lt;BR /&gt;</description>
      <pubDate>Thu, 02 Feb 2023 07:13:27 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-DPC-C-Compiler/confused-over-the-different-parallel-for/m-p/1452530#M2762</guid>
      <dc:creator>SeshaP_Intel</dc:creator>
      <dc:date>2023-02-02T07:13:27Z</dc:date>
    </item>
  </channel>
</rss>

