Re: ND-Range kernels vs SingleThread?

SBioo · ‎08-23-2018

Dear All,

I got a question with regard to choosing between ND-Range and Single Thread kernels. Intel FPGA best practices stresses out that it is always preferred to choose Single Thread model kernels than ND-Range. I have already tried several real and synthetic kernels. Also many real applications. I can claim that in 95% of the situations, NDRange kernels perform much faster compared to Single Thread mode. Looking at the compiler report, I can see that single thread modes kernels are scheduled with Initiation Interval of 1. Even the operational frequencies are not that much different. Besides, I also make sure memory access patterns are defined to be able to fully coalesced.

I'm using SDK 16.0 with a Nallatech p385a FPGA card. I'm really hustling to explain the huge performance difference, and even convince myself that what the Intel documentation claims is not totally true.

So my question is, does dynamic interleaving of threads in ND-Range mode is much more powerful compared to static scheduling of single thread kernels? Also, is it ok to claim that kernel with a single level of parallelism are always better to be developed using NDRange mode, and kernels with no parallelism and high degree of dependencies are good for single thread? In addition, is there any scenario that a kernel can be written as both ND-Range and SingleThread, while SingleThread outperforms the other?

I do appreciate if you clarify what are the reasons for my observations.

Best,

Saman

HRZ · ‎08-26-2018

Single Work-item is indeed the "preferred" method but most certainly not "always" preferred. e.g. in kernels with non-pipelineable loops, random memory accesses, or cases where the memory access and compute parts are in separate loops, NDRange kernels are preferred. However, I have to say that with basic optimizations, it is probably easier to get good performance out of NDRange compared to Single Work-item. On the other hand, the NDRange model will never allow maximizing the potential of the FPGA (cannot infer shift registers, cannot resolve dependencies other than by relying on barriers, operating frequency is limited due to Block RAM double-pumping, no user control over the number of simultaneous work-groups, etc.)

Have you checked the report to make sure your memory accesses are actually coalesced at compile-time? You can clearly see in the "System Viewer" tab that the size of the ports to memory get larger when correct coalescing happens. Also note that you MUST use SIMD in NDRange or loop unrolling in Single Work-item to enable compile-time coalescing; without these, no actual parallelism will exist in the design and no memory access coalescing will be performed (there is no run-time coalescing).

Dynamic thread-scheduling of NDRange kernels is preferred over the static scheduling of Single Work-item if the design is not pipelineable, since the former can potentially achieve a lower average initiation interval. Other than that, if it is possible to achieve an II of one in a Single Work-item kernel, I don't see why the NDRange equivalent would be faster at all, let alone "much faster". If you post some of your code examples (both NDRange and Single Work-item), I might be able to tell you why the NDRange is faster and how you can possibly fix the Single Work-item equivalent.

Since this topic requires a lot of discussion and I have already written a whole thesis chapter on this, I will just attach the relevant chapters of my thesis instead of putting everything here. Chapter 3 includes performance model and in-depth discussion on differences between the two programming models and when and why one should be preferred over the other. Chapter 4 includes multiple benchmarks developed and optimized both in NDRange and Single Work-item and compared with respect to performance alongside with discussion as to why the performance differences exist.

SBioo · ‎09-04-2018

Thanks for the answer,

Here is the kind of code I'm compiling and running on the FPGA:

#ifdef INT_PRECISION
 
#define DTYPE int
 
#elif SINGLE_PRECISION
 
#define DTYPE float
 
#elif DOUBLE_PRECISION
 
#pragma OPENCL EXTENSION cl_khr_fp64: enable
 
#define DTYPE double
 
#endif
 
 
#ifdef FPGA_NDRANGE
 
__attribute__((reqd_work_group_size(256, 1, 1)))
 
__attribute__((num_simd_work_items(16)))
 
__attribute__((num_compute_units(NUM_COMPUTE_UNITS)))
 
#endif
 
 
__kernel void S1119 (__global DTYPE* restrict AA,
 
					    __global const DTYPE* restrict BB,
 
                                            const int lllX
#ifdef FPGA_SINGLE
                                           ,const int lllY)
#else
 )
#endif
{
#ifdef GPU
	const int gid = get_global_id(0);
	const int size = get_global_size(0);
 
	for (int i = 1; i < lllX; i++) {
  	        AA[i*size+gid] = AA[(i-1)*size+gid] + BB[i*size+gid];
	}
#endif
 
#ifdef FPGA_NDRANGE
	const int gid = get_global_id(0);
	const int size = get_global_size(0);
 
	#pragma unroll UNROLL_FACTOR
	for (int i = 1; i < lllX; i++) {
		AA[i*size+gid] = AA[(i-1)*size+gid] + BB[i*size+gid];
	}
#endif
 
#ifdef FPGA_SINGLE
        for (int i = 1; i < lllX; i++) {
  	     #pragma ivdep
             #pragma unroll UNROLL_FACTOR
  	     for (int j = 0; j < lllY; j++) {
			AA[i*lllY+j] = AA[(i-1)*lllY+j] + BB[i*lllY+j];
	     }
       }
#endif
 
}

As you can see, there is a potential for memory coalescing for both FPGA_NDRANGE and FPGA_SINGLE.

HRZ · ‎09-06-2018

In this example, there are multiple reasons why the Single Work-item kernel is going to be slower than the NDRange one:

1- The outer loop in the Single Work-item kernel is not pipelined at all. Yes, the report says its II is one, but it executes serially over the inner loop which means your Single Work-item kernel is going to be extremely slow unless the trip count of the outer loop is very small and the trip count of the inner loop is very large. The NDRange kernel will achieve much better performance here due to dynamic scheduling at run-time.

2- Even though there is potential for memory access coalescing in the single work-item kernel as you mentioned, based on the report, the compiler is not actually coalescing the accesses. I have seen multiple patterns where the compiler simply refuses to coalesce the accesses even though they are clearly consecutive, this is one of those cases. The problem here is that there are two accesses to the same external buffer in the loop which, if coalesced, could potentially overlap with each other. In such cases, the compiler instead creates multiple 32-bit ports (for float) to memory, resulting in a huge amount of contention on the memory bus and very poor memory throughput. You might be able to get the accesses to coalesce correctly if you perform the memory accesses outside of the compute loop.

3- Since your NDRange kernel uses both SIMD and unrolling, while the Single Work-item kernel uses only unrolling with a factor the same as the NDRange kernel, you will have a 16 times higher degree of parallelism in the NDRange kernel which would give it an edge over the Single Work-item equivalent. However, you should note that the memory accesses in the NDRange kernel are only consecutive in the SIMD direction and the unrolling will result in multiple non-coalesced accesses.

SBioo · ‎09-06-2018

Thanks for the response.

One question: How can I make sure that the outer-loop of the single-thread mode kernel is pipelined every single iteration? How the structure of the code should be changed to serve that purpose?

In addition, how can I change the kernel to make sure the memory access for the single-thread kernel is fully coalesced?

Thanks,

HRZ · ‎09-07-2018

Well, it is not possible to give general guidelines that work for every case. Correct pipelining depends on loop-carried and load/store dependencies (either on local or global memory buffers). You should pay attention to the loop analysis part of the report and then try to reconstruct your code to resolve the dependency. However, dependencies are not necessarily always resolvable. One very useful approach to handling both of these types of dependencies is loop blocking. With blocking, since the bound of the inner loop become compile-time constant, the compiler can perform extra optimizations to handle loop-carried dependencies. In your current code example, the dependency is to the previous row (assume "i" is looping over rows) and hence, the dependency distance is relatively large. You can avoid the dependency by blocking the loop on "i" and using a large block size. In this case the compiler will analyze the dependency distance since the trip count of the blocked loop is known and insert enough stages into the pipeline to avoid the dependency and allow correct pipelining. Of course in this case you will still have an unpipelineable loop over the blocks but that loop will have a very small trip count and its negative effect will be minimized. Fixing your code example requires some work. You can take a look at the transformation I have performed from v1 to v5 of this benchmark as an example of how to resolve such dependencies:

https://github.com/zohourih/rodinia_fpga/tree/master/opencl/pathfinder

Though that code is quite a bit more complex than your example.

Regarding coalescing, mostly you just need to make sure the accesses are consecutive over the SIMD/unroll direction. However, as seen in your example, that is not always enough. I have encountered multiple cases where I could simply not get the compiler to coalesce accesses that were obviously coalesceable. However, in your case, I made a quick transformation that seems to at least allow correct coalescing:

#define UNROLL_FACTOR 16
 
__kernel void SWI(__global DTYPE* restrict AA, __global const DTYPE* restrict BB, const int lllX, const int lllY)
{
	for (int i = 1; i < lllX; i++)
	{
		int exit = (lllY % UNROLL_FACTOR == 0) ? (lllY / UNROLL_FACTOR) : (lllY / UNROLL_FACTOR) + 1;
		#pragma ivdep
		for (int j = 0; j < exit; j++)
		{
			float a[UNROLL_FACTOR];
 
			#pragma unroll
			for (int k = 0; k < UNROLL_FACTOR; k++)
			{
				int j_real = j * UNROLL_FACTOR + k;
				a[k] = AA[(i-1)*lllY+j_real];
			}
 
			#pragma unroll
			for (int k = 0; k < UNROLL_FACTOR; k++)
			{
				int j_real = j * UNROLL_FACTOR + k;
				if (j_real < lllY)
				{
					AA[i*lllY+j_real] = a[k] + BB[i*lllY+j_real];
				}
			}
		}
	}
}

I basically detached the two accesses to the AA buffer. However, this required that I perform manual loop unrolling rather than rely on the compiler's unroll pragma. In general, I would advise against performing partial unrolling in Single Work-item kernels using the pramga; manual partial unrolling as I did above pretty much always achieves better results. Note that if you merge the two fully-unrolled loops, you will again get non-coalesced accesses. It is worth noting that the mod and division operations are relatively cheap in hardware if the unroll factor is a power of two. Furthermore, "exit" can be calculated on the host and passed to the kernel as an argument to save some area on the FPGA.

In the end, I have to say that for such examples, it is certainly much easier to get good performance using an NDRange kernel; however, with correct optimization, the Single Work-item equivalent will usually result in better performance. Unfortunately, the learning curve for optimizing Single Work-item kernels is relatively steep and requires a lot of experience and knowledge of how the compiler works.

SBioo · ‎09-09-2018

Thanks for the response,

Just one more question. You said compiler reports can indicate whether it is able to coalesce memory accesses or not. I'm using SDK 16.0-pro and I can't see any report about memory coalescing. Could you please help me in which file you see these information?

Thanks,

Saman

HRZ · ‎09-09-2018

I don't remember what report was available in v16.0 but in v16.1 and above an HTML report is available which includes a "System Viewer" tab that shows the size, type and number of ports to external memory. You can check that part of the report to see if the accesses are being coalesced correctly or not. If you get a few wide ports under the presence of unrolling/SIMD, then coalescing is working correctly. However, if you get a lot of narrow ports, then coalescing is not being performed for some reason. In the older versions of the report you could also tell whether coalescing is working or not based on the reported number of memory ports; coalesced accesses count as one port.

SBioo · ‎09-09-2018

Thanks much,

For the blocking thing, I barely can understand the thing you have done in the line you provided into the Rodinia benchmark. I've checked your papers, but haven't found anything specific to that. Do you have any paper that discuss it, or any other simpler example?

Thanks,

Saman

HRZ · ‎09-10-2018

There are some descriptions in the "README_fpga.md" file for each version and more description in the snippet of my thesis I attached above. However, understanding the transformation is going to be difficult unless you are familiar with the benchmark. I recommend searching for general examples of loop blocking and starting from simple examples until you get to more complex ones. The following document includes code examples of multiple HLS-based transformations including loop blocking:

https://arxiv.org/abs/1805.08288

SBioo · ‎09-19-2018

Dear HRZ,

I have gone through the tips you have given me on advanced techniques of optimization of HLS codes. As a result I have applied some of the techniques to optimize the code I have given you before. This time I am writing my kernel as a single-thread mode kernel. For the first step, I'm applying blocking of the code (Based on what I've learnt on the net). After that, since there was some dependency, I have applied another technique to interleave computation between different inputs (The inputs are basically iterations of the outermost loop). Here is my developed code:

#ifdef INT_PRECISION
#define DTYPE int
#elif SINGLE_PRECISION
#define DTYPE float
#elif DOUBLE_PRECISION
#pragma OPENCL EXTENSION cl_khr_fp64: enable
#define DTYPE double
#endif
 
__kernel void S1119 (__global DTYPE* restrict AA,
					   __global const DTYPE* restrict BB,
                    const int lllX
                    ,const int lllY)
{
	
	int exit = lllY / BLOCK_SIZE;
 
	for (int i = 0; i < exit; i+=4) {
 
		int i_real[4];
 
		i_real[0] = i*BLOCK_SIZE;
		i_real[1] = (i+1)*BLOCK_SIZE;
		i_real[2] = (i+2)*BLOCK_SIZE;
		i_real[3] = (i+3)*BLOCK_SIZE;
 
		// start processing
    	       for (int j = 1; j < lllX; j++) {
 
			DTYPE BB_SR[BLOCK_SIZE][4];
			DTYPE AA_SR[BLOCK_SIZE][4];
 
			if (j == 1) {
				#pragma unroll
				for (int k = 0; k < BLOCK_SIZE; k++) {
					for (int ii = 0; ii < 4; ii++)
						AA_SR[k][ii] = AA[i_real[ii]+k];
				}
			}
 
			#pragma ivdep
			for (int ii = 0; ii < 4; ii++){
	
				#pragma ivdep
				#pragma unroll
				for (int k = 0; k < BLOCK_SIZE; k++) {
					BB_SR[k][ii] = BB[j*lllY+k+i_real[ii]];
				}
		
    			        #pragma ivdep
      			        #pragma unroll UNROLL_FACTOR
				for (int k = 0; k < BLOCK_SIZE; k++) {
					AA_SR[k][ii] = AA_SR[k][ii] * BB_SR[k][ii];
				}
 
				#pragma unroll
				for (int k = 0; k < BLOCK_SIZE; k++) {
					AA[j*lllY+k+i_real[ii]] = AA_SR[k][ii];
				}
			}
		}
	
	}
}

Now my question is, The block that I'm doing the computation cannot be fully unrolled, and I can only partially unroll it (defined with UNROLL_FACTOR). Since I'm interleaving computation of various inputs, I don't know why that happens. I just wanna know how compiler takes care of this situation.

In addition, except from memory access re-arrangement, what other kinds of obvious optimizations can be applied to unroll more and achieve II=1?

Thanks

HRZ · ‎10-17-2018

Sorry for the [very] late reply, I have not been able to get time to look at your new code yet. Have you been able to make any more progress?