Re: NDRnage Kernels Global Memory Write Pattern

Altera_Forum · ‎11-06-2017

I came up with some basic questions, which I would like to discuss:

1) I still don't understand how SIMD is being implemented in FPGA. In GPU, workitems being assigned to SIMDs are being executed concurrently. Is it the same case for FPGA? Or they are just going to be interleaved? For example in case of simd 16, Does 16 workitems being scheduled to a compute unit and being executed in interleaved fashion?

2) In case SIMD in FPGA is not as parallel as GPU, does 16 workitems being scheduled at once to a compute unit and wait to finish? or the next work items can still come in and be pushed into the pipeline?

3) Imagine a case where every work-item only writes one value at the end of execution into the global memory, and it writes it to index of "globalid" of that work item. In case of having many compute units and having SIMD of 16, at each clock cycle many write operations will be issued with non-continuous addresses (Based on my understanding). This seems to be inefficient with regards to high performance memory access. Does that mean, kernels designed for GPU are not suitable for FPGA, with regards to their memory access pattern?

4) Does LSU (Load Store Unit) performs memory coalescing? In other words, does it have any kind of buffer to receive memory write operations, and then flush them into the memory after grouping them into multiple continuous blocks of data?

Altera_Forum · ‎11-06-2017

1) On the FPGA, by default, a deep pipeline is created so work-items can go in and come out every clock cycle. Optionally, the single pipeline can be vectored to bring more work-items in simultaneously or the entire pipeline can be duplicated to handle different workgroups simultaneously.

2) Depends on the implementation. Without vectorization, a work item goes in and comes out every clock cycle. With vectorization, all 16 can be processed in parallel. The tradeoff is always performance vs. FPGA resource use.

3) You would not want to do this. Better to use a barrier to synchronize all the work-items and write all the work-item data to global memory in one shot.

4) Yes. The compiler will coalesce memory accesses where it can. If it can't, the optimization report will indicate what implementation was selected and why coalescing could not be performed.

Altera_Forum · ‎11-07-2017

Thanks much sstrell,

Here some more questions I have:

1) If the value of number of SIMD, really duplicates the pipelines in the compute units, then there should be a significant difference in resource usage between SIMD 8 and SIMD16. Looking at the resource usage, it tells me increasing SIMD value does not necessarily increases the amount of resource usage. Does it mean SIMD pipeline replication is efficient in terms of area growth?

3) Can you elaborate more on this? Isn't barrier introduces severe performance penalty though?

4) So you are telling there is no runtime mechanism for memory coalescing, and it's all compile time. Is that true? Cause I think for GPU is the other way.

Altera_Forum · ‎11-07-2017

1) No. You should try vectorization (num_simd_work_items) first before CU replication (num_compute_units). Both use more resources, but num_simd_work_items will use less.

3) The penalty of syncing work-items before performing a memory access is much less than constant calls to global. Again, check the optimization report and use the profiler to see the affects on your design.

4) All pipeline hardware is created with the offline compile so the choice of load/store units is done at that point as well, including whether coalescing can be performed or not.

Altera_Forum · ‎11-07-2017

1) Alright, I totally understand the difference between num_simd_work_items and num_compute_units. What I don't understand is, how SIMDs are being implemented to achieve parallelism and low resource consumption. By low I mean really really low. I barely see increase or decrease in area by playing with the value of num_simd_work_items. That's why I came up with the conclusion that num_compute_units achieves real parallelism and num_simd_work_items just interleaves work item one after the other.

2) Can you elaborate more on Barrier implementation? I believe even after all workitems in the workgroup hit the barrier, then they should do their write operations one after the other. I doubt after hitting barrier all 256 workitems in the workgroup can execute their write instruction in the OpenCL code.

Altera_Forum · ‎11-07-2017

1) With vectorization, all the control logic is shared, minimizing the extra resources required, so more work-items go into the pipeline in parallel. It is still pipelined, so if you vectorize to, say, 16, you can still put in 16 work-items each clock cycle and get 16 work-items out each clock cycle. If the pipeline is not vectorized, work-items go in the pipeline one at a time, but you get a constant throughput of 1 work-item per clock cycle assuming no stalls in the pipeline.

2) No, the barrier works just as you say. All work-items pause at the barrier, usually done before a memory operation. The hardware is implemented to handle this. Remember, you're creating FPGA hardware, which is completely customized on the way you write the code, so if you use a barrier and then perform a memory operation, LSUs are selected to handle this in the most efficient way possible (coalescence). Thus, in most cases, you want to also specify a required or maximum number of work-items since this tells the compiler how to organize the LSUs to be most efficient.

Altera_Forum · ‎11-07-2017

Hi,

I have done some compilation on the same kernel and just played with the SIMD number (1, 2, 4, 8, 16) and watched over the resource utilization percentage. Here is the result I've got for every compilation:

16 SIMD:

+---------------------------------------------------------------------+

; Estimated Resource Usage Summary ;

+----------------------------------------+---------------------------+

; Resource + Usage ;

+----------------------------------------+---------------------------+

; Logic utilization ; 33% ;

; ALUTs ; 13% ;

; Dedicated logic registers ; 20% ;

; Memory blocks ; 12% ;

; DSP blocks ; 6% ;

+----------------------------------------+---------------------------;

8 SIMD:

+---------------------------------------------------------------------+

; Estimated Resource Usage Summary ;

+----------------------------------------+---------------------------+

; Resource + Usage ;

+----------------------------------------+---------------------------+

; Logic utilization ; 33% ;

; ALUTs ; 13% ;

; Dedicated logic registers ; 20% ;

; Memory blocks ; 12% ;

; DSP blocks ; 6% ;

+----------------------------------------+---------------------------;

4 SIMD:

+---------------------------------------------------------------------+

; Estimated Resource Usage Summary ;

+----------------------------------------+---------------------------+

; Resource + Usage ;

+----------------------------------------+---------------------------+

; Logic utilization ; 33% ;

; ALUTs ; 13% ;

; Dedicated logic registers ; 20% ;

; Memory blocks ; 12% ;

; DSP blocks ; 6% ;

+----------------------------------------+---------------------------;

2 SIMD:

+---------------------------------------------------------------------+

; Estimated Resource Usage Summary ;

+----------------------------------------+---------------------------+

; Resource + Usage ;

+----------------------------------------+---------------------------+

; Logic utilization ; 33% ;

; ALUTs ; 13% ;

; Dedicated logic registers ; 20% ;

; Memory blocks ; 12% ;

; DSP blocks ; 6% ;

+----------------------------------------+---------------------------;

1 SIMD:

+---------------------------------------------------------------------+

; Estimated Resource Usage Summary ;

+----------------------------------------+---------------------------+

; Resource + Usage ;

+----------------------------------------+---------------------------+

; Logic utilization ; 33% ;

; ALUTs ; 13% ;

; Dedicated logic registers ; 20% ;

; Memory blocks ; 12% ;

; DSP blocks ; 6% ;

+----------------------------------------+---------------------------;

I still don't see any difference in utilization. Can you please elaborate more how different size for SIMD changes the area usage? I still feel like it's not clear how things are being mapped deep in the FPGA.

Altera_Forum · ‎11-08-2017

You should post your kernel code; what you are observing is very likely because you have something in your code that is preventing vectorization (e.g. work-item ID-dependent branching) and hence, SIMD is not being applied at all. If this is indeed the case, you will see a message form the compiler about this in the log.

Altera_Forum · ‎11-08-2017

Here is the kernel I compile:

__attribute__((num_compute_units(5)))
__attribute__((num_simd_work_items(16)))
__attribute__((reqd_work_group_size(256,1,1)))
__kernel void WGSXMAPIXLLXOPS16(const __global float * restrict GIn, __global float * restrict GOut, const float M, const float N, const float P) {
    const int XGL = get_global_id(0);
    const int XGRid = get_group_id(0);
    const int XGRnum = get_num_groups(0);
    const int XLSize = get_local_size(0);
    const int XLid = get_local_id(0);
    // Just a private variable
    float MF = (float) M;
    float NF = (float) N;
    float PF = (float) P;
    // Start of a new level of for loop
    long baseIndex1 = XGRid*XLSize*2+XLid;
    float temp1 = 1.0;
    float temp2 = 1.0;
    float temp3 = 1.0;
    float temp4 = 1.0;
    float tempOut;
    temp1 += temp1 * MF;
    temp1 += temp1 * MF;
    temp1 += temp1 * MF;
    temp1 += temp1 * MF;
    temp1 += temp1 * MF;
    temp1 += temp1 * MF;
    temp1 += temp1 * MF;
    temp1 += temp1 * MF;
    temp1 += temp1 * MF;
    temp1 += temp1 * MF;
    temp1 += temp1 * MF;
    temp1 += temp1 * MF;
    temp1 += temp1 * MF;
    temp1 += temp1 * MF;
    temp1 += temp1 * MF;
    temp1 += temp1 * MF;
    tempOut = temp1 + temp2 + temp3 + temp4;
    GOut = tempOut;
}

Basically I use it for some performance measurement and I made sure no optimization is going to happen.

Altera_Forum · ‎11-08-2017

You're not even using GIn. You're just doing math on constant values. tempout is the same value for all work-items. Am I missing something here?

Altera_Forum · ‎11-08-2017

Yes I'm not reading anything from memory, since I want to role out memory effect as much as possible. But I don't understand why tempout is the same for all! it's a private variable in the kernel. Beside, none of the operations perform on constant values. As a result compiler cannot optimize them out. I have checked the verilog code to make sure all fmas are out there.

Altera_Forum · ‎11-08-2017

M, N, and P are all constants and MF is basically the same as M. You're doing the same math on the same inputs for all work-items. Your calculations are not dependent on work-item number or group number or anything, so tempout should always be the same for all work-items.

Altera_Forum · ‎11-08-2017

My assumption was compiler cannot optimize out my kernel aggressively. Now if you claim the compiler is smart enough to understand the constant behaviour of my kernel, then what would be it's effect on further consideration on SIMD and CU factors, and in general any optimization?

Altera_Forum · ‎11-08-2017

There are most likely no further optimizations possible. Does the optimization report say anything? I'd be surprised if it did.

Altera_Forum · ‎11-08-2017

I don't see any evidence of optimization in the report. Now considering all these, Does vectorization is still happening in the Kernel? I still believe the compiler cannot optimize kernels like the one I've provided. It can only optimize the logic of the code itself.

Altera_Forum · ‎11-09-2017

SIMD vectorization is for the data passed into kernel, only when your input data can be vectorized should it benefit the performance.

In your code only constant M is passed in and it can't be vectorized, I would guess that's why the resource usage is the same.

If your goal is to do parallel execution like how it does on GPU, you should experiment with compute unit settings, but it's still not quite the same with GPU in some aspect.

Bottom line you can launch parallel kernels separately under different kernel name and different queue, this way it's definitely paralleled:p

Altera_Forum · ‎11-09-2017

The compile does NOT optimize out the computation (it actually isn't smart enough to do that), and it is indeed vectorizing your kernel; however, the compiler can easily tell that your computation does not depend on the work-item ID and hence, does NOT vectorize the computation, but it does vectorize the write to memory which depends on the work-item ID. In this case the compiler creates the logic in a way that the computation is only done once, but copied back SIMD times (in a single coalesced write). The reason why you don not see any difference in the logic utilization is that the difference is so small, it does not reflect in the "percentage" values. If you check the actual numbers in the HTML report, there are small differences in LUT and FF utilization. The difference is of course only caused by the line with the memory write, and the resource utilization for the rest of the lines is exactly the same. Furthermore, You can clearly see in the system viewer that the write port is getting wider.