Solved: Load/store cannot be vectorised - local memory

GRodr25 · ‎03-31-2020

Hello,

I'm having some trouble with local memory and SIMD in a matrix transpose kernel I'm adapting from GPU. The code:

#define TILE_DIM 4
 
__attribute__((reqd_work_group_size(TILE_DIM, TILE_DIM, 1)))
			__attribute__((num_simd_work_items(TILE_DIM)))
__kernel void MatTranspose(__global float* restrict dest,
                                            __global float* restrict src)
{
    __local float tile[TILE_DIM][TILE_DIM];
 
    int tx = get_local_id(0);
    int ty = get_local_id(1);
    int bx = get_group_id(0);
    int by = get_group_id(1);
    int x = bx * TILE_DIM + tx;
    int y = by * TILE_DIM + ty;
    int width = get_num_groups(0) * TILE_DIM;
 
    for(int j = 0; j < TILE_DIM; j += TILE_DIM) {
        tile[ty + j][tx] = src[(y + j) * width + x];
    }
 
    barrier(CLK_LOCAL_MEM_FENCE);
 
    x = by * TILE_DIM + tx;
    y = bx * TILE_DIM + ty;
 
    for(int j = 0; j < TILE_DIM; j += TILE_DIM) {
        dest[(y + j) * width + x] = tile[tx][ty + j];
    }
}

The compiler warns:

Compiler Warning: Vectorized kernel contains loads/stores that cannot be vectorized. This might reduce performance.

I don't see why these memory operations cannot be vectorised. I know the problem is caused by local memory because replacing the accesses to tile by a constant both in lines 17 and 26 solves the issue.

Moreover I was expecting tile to be split in 4 memory banks so that 4 load/store operations occur at the same time, but only one is generated. Forcing this to occur with the attribute numbanks results in pipelined never stall memory, when it should be burst coalesced (i guess this comes from the warning) as global memory accesses are contiguous in every cycle.

Could someone explain me what is going on, please?

P.D: Although this kernel can be implemented avoiding SIMD and achieve a good performance this is part of some benchmarking work I'm undertaking, so I want it to remain vectorised.

HRZ · ‎04-03-2020

That compiler warning in particular is a very misleading warning and it does not always point to an actual problem in your code. Looking at the report, both the load from and the store to global memory are coalesced into 128-bit accesses which points to correct vectorization. The local buffer "tile" is also replicated by 28 times to provide fully-parallel non-stallable accesses. 4 times of it is because your code has 4 non-coalescable reads on line 28, and one coalescable write on line 19 (each Block RAM has two ports, writes are connected to all replicas while reads are connected to one, resulting in a replication factor of 4 for 4 reads and one write). The buffer is also replicated by 7 extra times to support 7 work-groups running concurrently in the same compute unit; this latter replication factor is a compiler decision that cannot be overridden by the user. All in all there is nothing wrong with your code and I would say you can safely ignore the warning.

View solution in original post

HRZ · ‎04-03-2020

That compiler warning in particular is a very misleading warning and it does not always point to an actual problem in your code. Looking at the report, both the load from and the store to global memory are coalesced into 128-bit accesses which points to correct vectorization. The local buffer "tile" is also replicated by 28 times to provide fully-parallel non-stallable accesses. 4 times of it is because your code has 4 non-coalescable reads on line 28, and one coalescable write on line 19 (each Block RAM has two ports, writes are connected to all replicas while reads are connected to one, resulting in a replication factor of 4 for 4 reads and one write). The buffer is also replicated by 7 extra times to support 7 work-groups running concurrently in the same compute unit; this latter replication factor is a compiler decision that cannot be overridden by the user. All in all there is nothing wrong with your code and I would say you can safely ignore the warning.

GRodr25 · ‎04-03-2020

Hello HRZ.

Thank you for your answer, it was really helpful. Just one more question: I'm not sure I understand the 7 work-groups running concurrently in the same compute unit part. AFAIK a work-group maps to a single CU. I wondered in the past about what happened when the pipeline had more stages than the size of the work-group, but no one gave me the answer. Will the first stages remain idle whilst the current work-group hasn't finished yet or will a new work-group enter the pipeline? I assume what happens is the latter for performance and that's the reason why the buffer is replicated 7 times, so that up to 7 work-groups can be running concurrently and the pipeline is full at every moment. Am I right?

Edit: the numbers you are providing don't match my report's. For tile, mine says there is a single bank replicated twice and each replica has 5 private copies. I'm using AOC 19.3. Are you using the same compiler version?

Thank you

HRZ · ‎04-04-2020

Yes, essentially, on top of the work-item pipelining within the same work-group, the compiler also pipelines multiple work-groups within the same CU to keep the CU as busy as possible. I think the number "7" is calculated by dividing the CU pipeline depth by the work-group size.

I generated the report with both v16.1.2 and 19.4, both gave me the same replication factors. Can you archive and attach your report folder so that I would take a look at it?

GRodr25 · ‎04-26-2020

Hello again,

Why are you saying the reads on line 28 are not coalescable? And if you look at the LSUs latencies it seems the load unit generated for line 21 has a latency of 144 (in my experience this will lead to really poor performance), whilst the store unit has a latency of 2. As you pointed out the LSUs width is 128 bits, so it seems vectorisation is working. Moreover, both LSUs are burst-coalesced type. What is going on here? Why does the load unit have such a high latency?

Thank you.

GRodr25 · ‎04-04-2020

Yes, sure. I attached the report to this message. I really appreciate your help.

HRZ · ‎04-04-2020

I made a small mistake, my Arria 10 compile with v16.1.2 did indeed give the same results as your report, but I didn't pay proper attention to the numbers and thought the results were the same as my compile with v19.4 on Stratix 10. On Arria 10, the compiler chooses a "bandwidth" of 64 bits, resulting in four reads and two 64-bit writes which, coupled with double-pumping, results in a replication factor of two for parallel accesses. It can be forced to 4 reads and one write by adding "__attribute__((memory, bankwidth(4*TILE_DIM)))", but that seems to slightly increase area usage and that is likely why the compiler is opting for a bandwidth of 64 bits. On Stratix 10, however, Block RAM double-pumping is not supported for some magical reason and hence, the compiler has to choose a "bandwidth" size of 128 bits to reduce number of writes to one or else it would be impossible to have stall-free accesses with two writes without double-pumping. In this case the replication factor will be what I mentioned in my first post.

AnilErinch_A_Intel · ‎04-30-2020

Hi All,

I appreciate the fruitful discussions happened here.

Thanks and Regards

Anil