Is HyperFlex Optimization disabled by Non-Aligned LSU?

tde_m · ‎02-05-2020

Hello,

I'm trying to understand why HyperFlex optimization is disabled in the following toy example.

#pragma OPENCL EXTENSION cl_intel_channels : enable
 
#define BLOCK_ROWS 8
#define BLOCK_COLS 8
 
channel float channel_A[BLOCK_COLS] __attribute__((depth(32)));
 
__kernel void readA(__global volatile const float * restrict A, const unsigned int N, const unsigned int M)
{
    const uint BlocksN = 1 + (int)((N-1) / BLOCK_ROWS);
    const uint BlocksM = 1 + (int)((M-1) / BLOCK_COLS);
 
    for(int bi=0; bi < BlocksN; bi++){
        for(int bj=0;bj<BlocksM;bj++){
            for(int i=0;i<BLOCK_ROWS;i++){
                //reading offset
                const int offset = (bi*BLOCK_ROWS+i)*M+bj*BLOCK_COLS;
                // dummy offset that will let hyper opt ON
                // const int offset =0;
                #pragma unroll 
                for(int j=0;j<BLOCK_COLS;j++){
                    float value = A[offset + j];
                    write_channel_intel(channel_A[j],value);
                }
            }
        }    
    }
}
 
 
__kernel void dummy(__global volatile  float * restrict A, const unsigned int N){
    for(int i=0;i<N;i++){
        #pragma unroll 
        for(int j=0;j<BLOCK_COLS;j++){
                A[i]=read_channel_intel(channel_A[j]);
        }
    }
}

In the example, I want to read a matrix block by block, and I would like to have multiple reads (unrolled innermost loop). A dummy kernel is in charge of receiving data for the sake of producing a report.

If I compile this with aoc 19.1 targeting a Stratix 10, I obtain that the HyperOptimization is disabled "because of the Load Operation that does not support it". The corresponding LSU is generated as bursted and non-aligned.

If I simplify accesses to matrix A (for example by using an offset equal to zero), the LSU is generated as bursted and Hyper Optimization is ON.

Could it be because of the non-aligned LSU? And if yes, is there some way to prevent the compiler to infer the accesses as non-aligned?

Thanks,

MEIYAN_L_Intel · ‎02-06-2020

Hi,

I am looking into the information.

Thanks

HRZ · ‎02-06-2020

I am sure with 18.1.x, hyperflex optimization was automatically disabled with burst non-aligned LSUs. I never tested 19.0 or 19.1, but in 19.2 and above this does not seem to happen anymore. You can infer aligned coalesced ports if you avoid access coalescing using loop unrolling and instead use OpenCL vector variables or a struct with one array as its member with as many indexes as the unroll factor, so that you are technically just reading (and writing) one wide value each loop iteration. Of course this will only be feasible if all your memory accesses have a minimum alignment size equal to the width of the coalesced port (i.e. your offset in bytes should always be a multiple of the port size in bytes).

tde_m · ‎02-06-2020

This happens with 19.2 as well, or at least it returns the same warning message an disable Hyper Optimization.

Regarding your suggestion of using vector variables (or custom data types), I can see that this will work, but this will render more difficult to handle the case in which the sizes of the matrix are not a multiple of the used vector data type.

HRZ · ‎02-08-2020

I stand corrected, it is actually 19.3+ that supports the hyperflex optimization for non-aligned LSUs. However, both 19.3 and 19.4 generate a new warning in your code as follows:

warning: test.cl:21:21: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering
warning: test.cl:34:13: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering

It seems they don't like the unrolled channel operation, even though it doesn't create multiple call-sites. A quick transformation as follows can solve this issue:

#pragma OPENCL EXTENSION cl_intel_channels : enable
 
#define BLOCK_ROWS 8
#define BLOCK_COLS 8
 
typedef struct
{
	float data[BLOCK_COLS];
} CHAN_WIDTH;
 
channel CHAN_WIDTH channel_A __attribute__((depth(32)));
 
__kernel void readA(__global volatile const float * restrict A, const unsigned int N, const unsigned int M)
{
   const uint BlocksN = 1 + (int)((N-1) / BLOCK_ROWS);
   const uint BlocksM = 1 + (int)((M-1) / BLOCK_COLS);
 
   for(int bi=0; bi < BlocksN; bi++){
	  for(int bj=0;bj<BlocksM;bj++){
		 for(int i=0;i<BLOCK_ROWS;i++){
			//reading offset
			const int offset = (bi*BLOCK_ROWS+i)*M+bj*BLOCK_COLS;
			// dummy offset that will let hyper opt ON
			// const int offset =0;
			CHAN_WIDTH temp;
			#pragma unroll 
			for(int j=0;j<BLOCK_COLS;j++){
			    temp.data[j] = A[offset + j];
			}
			write_channel_intel(channel_A,temp);
		 }
	  }    
   }
}
 
 
__kernel void dummy(__global volatile  float * restrict A, const unsigned int N){
   for(int i=0;i<N;i++){
	  CHAN_WIDTH temp;
	  temp = read_channel_intel(channel_A);
	  #pragma unroll 
	  for(int j=0;j<BLOCK_COLS;j++){
			A[i]=temp.data[j];
	  }
   }
}

i.e., instead of using multiple narrow channels, just use one wide channel so that the channel operations can be moved outside of the unrolled loops.

>but this will render more difficult to handle the case in which the sizes of the matrix are not a multiple of the used vector data type.

Indeed you will have to manually pad the input row by row on the host if the row size is not a multiple of the vector size, but implementing it will not too difficult and there is no other way to have fully aligned accesses anyway. You would want to absolutely avoid unaligned accesses on Intel FPGAs since they will kill your memory performance (reference: https://arxiv.org/abs/1910.06726).