Showing results for 
Search instead for 
Did you mean: 

How to add the number of work items in flight for the NDRange kernel?


Since the NDRange is implemented as work item based pipeline on FPGA, if I understand it correctly, the maximum number of work items in flight should be determined by the complexity (or stage) of the kernel, right?


Take the following kernel code for example (from the beginning of Chapter 4 of Intel "Best Practices")

__kernel void add (__global int * a, __global int * b, __global int * c) { int gid = get_global_id(0); c[gid] = a[gid]+b[gid]; }

The compiler generates a 3-stage pipeline for it:

1) Two Load units (load a and b simultaneously)

2) One Add unit

3) One Store unit


So for this 3-stage pipeline, at most only 3 work items can be in flight no matter how many work items are specified in the host code. If we want to get more in-flight work items, we have to add more computation or operations that will be translated into extra stages. Do I understand this correctly?


Since a deeper pipeline provides more parallelism, if my understanding above is correct, a simple kernel with few operations actually is not able to benefit much from the NDRange implementation (no matter how many work items are used or specified), right?



0 Kudos
24 Replies
Valued Contributor II

The number of work-items that can be in-flight simultaneously depends on the pipeline depth; even though you see only three units in the report, the total length of the pipeline should be in the order of 50-200 stages which would allow the same number of work-items be pipelined at the same time. Note that if you want work-item parallelism, you should use SIMD. By default, work-items are only pipelined in NDRange kernels.


Thank you, HRZ.


Actually I did not compile this example code. I just read the description about how hardware pipeline stages are generated for a given kernel code in Intel's "Best Practices Guide". The guide provides many similar but simple examples to help people understand how the pipeline parallelism can be got.


I'm still curious why only the single statement "c[gid] = a[gid]+b[gid];" can get a pipeline depth of the order of 50 - 200 stages by the compiler. It seems that the guide does not mention such implicit stages. Would you like to provide more details?

Valued Contributor II

Latency of most operations on the FPGA is higher than one cycle to allow reasonable operating frequency. For the particular case of external memory accesses, the latency is in the order of a few hundred cycles. Generally the compiler generates a deep-enough pipeline to be able to absorb the majority of the external memory stalls and at the same time accommodate all the necessary operations in the pipeline targeting a specific operating frequency (240 MHz by default). If you check the "System viewer" tab of the HTML report, you can find the latency of each block in your code and calculate the total pipeline depth by adding up all the latency values.


I see. Nice explanations! I just looked at the "System viewer" tab of the HTML report and it indeed shows the latency of each block in my code. Good info!


Thanks again!


One more question, the purpose of unrolling a loop is to add the depth of the pipeline (for single work item), not to let the unrolled iterations become a SIMD circuit (real parallel execution), right? If so, for NDRange version, since the loop cannot be pipelined as it is in the single work item, putting a "#pragma unroll" before a loop actually cannot bring some benefit (but add some extra area), right? (Note: when saying "the loop cannot be pipelined as ..." above, I mean their iterations cannot be pipelined. Instead, the loop is viewed as a whole and constructs the pipeline with other code. As a result, the loop becomes a stage as a whole. In this case, there is no difference between unrolling or not unrolling the loop. This is just my understanding.)


BTW, I'm curious why the compiler still can unroll a loop whose loop bound is a run-time value. For example, "while(i < n) {i++; do sth.}" (assume n is not changed in the loop body). If n is pretty large, there will be no enough area for the compiler to unroll the loop. (Please correct me if I understand this incorrectly.)



Valued Contributor II

Actually loop unrolling has a similar effect to that of SIMD: it allows multiple loop iterations (rather than work-items) to be executed in parallel. For Single work-item kernels, unrolling is the main method of achieving parallelism (the other is to use multiple kernels in different queues or automatic kernel replication using the autorun attribute). For NDRange kernels you have SIMD, unrolling and compute unit replication. The difference between SIMD and unrolling in this case will be that SIMD enables multiple work-items to be executed in parallel, while unrolling allows each work-item execute multiple loop iterations in parallel. Indeed loop unrolling can have a big effect on the performance of NDRange kernels if each work-item has to go through a loop with a large trip count. This benchmark is probably a good example (compare v0 and v2):


Though, unrolling loops in NDRange kernels can have a negative effect on performance if it results in non-coalesced memory ports.


With respect to unrolling loops with unknown loop bounds, the compiler cannot perform full unrolling on such loops but partial unrolling is possible in which case the compiler will create a branch inside of the loop to avoid going out of bounds. Partial unrolling of loops with unknown bounds is likely not a good idea since the branch will result in inefficient area utilization; a better method would be to perform manual loop unrolling as described in Section 3-5 in this document:


Thanks for the the benchmark link. v2 uses a lot of loop unrolling and should be faster than v0 (I also looked at your SC16 paper but not yet finished).


Sounds like loop unrolling is used for parallel execution for both single-workitem and NDRange. So it cannot increase the number of stages (pipeline depth) in a loop for single work-item, right? For example, assume a kernel only contains one loop with a const bound N, after it is completely unrolled, all iterations will disappear and no pipeline will exist any more (the real parallel circuit is generated).


One relevant topic is manual unrolling. Please see the following three different code snippets. Do you believe the compiler will generate the same circuit for them? (Assume we use single work-item). I doubt code 3 which is a sequence of regular statements. Without the keyword "#pragma unroll" or "for", they may not be compiled into a parallel circuit.

Code 1: #pragma unroll for(i = 0; i < 10; i++) a[i] = b[i] + c[i];   Code 2: #pragma unroll for(i = 0; i < 10; i+=2) { a[i] = b[i] + c[i]; a[i+1] = b[i+1] + c[i+1]; }   Code 3: a[0] = b[0] + c[0]; a[1] = b[1] + c[1]; ... a[9] = b[9] + c[9];


By saying "automatic kernel replication using the autorun attribute", what do you mean by "replication"? As far as I know, the "autorun" makes the kernel keep being launched automatically (and repeatedly) like a infinite loop. Unlike compute unit copy, the "autorun" kernel should be only one copy. Do I miss something?


I've been reading your thesis since you showed me it in another post last time. I notice that you posted the third version just several days ago. Is there any big change between it and the second version?


In Sec., you mention 3-4 b) is worse than 3-5.


"Even though the resulting optimized loop can be partially unrolled by using the unroll pragma to further improve the performance, doing so will break the shift register optimization and requires that the size of the shift register is increased further to accommodate for the unrolling. With large unroll factors, this method can result in large area overhead to implement the shift register."


By "partially unrolled by using the unroll pragma", do you mean the outer loop (because your "shifting" loop and "final reduction" loop have fixed loo p bound, so they do not need a partial unrolling)?


Since the "shifting" loop in both 3-4 b) and 3-5 have fixed bound ("FADD_LATENCY"), why do you say "the size of the shift register" need to be increased? Again, for "With large unroll factors", I'm confused with this because you have a fixed "FADD_LATENCY" bound. Sorry I may not totally understand the above quoted descriptions.


Two typos I found:


Page 19: "on the memory buss", "buss" should be "bus"

Page 20 : "read form the head", "form" should be "from"



Valued Contributor II

Actually, most of the unrolling in the v2 version of LUD is not used in practice due to increase in ports to local buffers resulting in port sharing. The main performance difference is caused by the unrolling in the internal kernel. Details of the performance difference are mentioned in paragraph 2 of Section in my thesis. By the way, the content of the SC paper is out of date by now; the thesis has the most up-to-date results.


Loop unrolling does increase the pipeline depth, but not relative to the unroll factor. My assumption is that the reason for the increase in the depth is the increase in circuit complexity that requires more registers inserted into the pipeline to meet the target frequency, rather than increase as a direct result of the unrolling. Note that it is incorrectly stated in the SC paper that pipeline depth increases relative to the unroll factor. This has been corrected in the thesis (Section 3.1.2). If you fully unroll a loop, you will still have a shallow pipeline with a depth that accommodates one instance of the loop body, but a width that accommodates all the unrolled loop iterations.


Regarding your code snippet, you can put that in the compiler and check ( ;) ); all three examples generate the exact same circuit with the exact same latency and area utilization. The compiler is actually smart enough to parallelize code segments that do not depend on each other.


Regarding replication of autorun kernels, check “Section 12.4.1. Customization of Replicated Kernels Using the get_compute_id() Function” of the Programming Guide. This feature is very useful for creating rings or systolic array of processing elements.


Regarding the recent thesis update, I just fixed a few typos here and there, no new content. I am already done with the thesis. ;)


Regarding the example in Section, indeed the quote refers to the outer loop; every other loop is already fully unrolled. The problem with required shift register size is not general. That example deals with the specific case of unrolling a reduction loop which requires shift register inference to achieve an II of one in the first place (Paragraph one of Section in thesis and Section 5.1.5 of the Best Practices Guide). If the outer loop is partially unrolled using #pragma unroll, then the latency of the reduction operation will increase and hence, a bigger shift register will be required. With manual unrolling, this problem will be avoided. This problem will not exist in case of a standard loop that does not involve reduction; however, it is always best to avoid using partial unrolling using #pragma unroll unless the loop bound is known and is a multiple of the unroll factor.


And thank you for pointing out the typos, now I need to submit a v4. :D


P.S. This forum really needs a proper means of quoting...


Sorry for the late reply. In the past few days, I was not working on my research.


It's so nice that you point out that the SC paper is out of date now and some statements there are not correctly stated. I'll focus on your thesis :)


Your comments about "partial loop unrolling" helps me a lot. For my zfp project, I use various methods to replace a loop bound that is a run-time variable with a constant value. I show how I did this below:

#pragma unroll for(int i = 0; i < n; i++) do sth.

This is the original loop with "#pragma unroll" which causes the partial loop unrolling. What I did is:

1 If the n is actually a constant value N, I just change n to N; If I am sure that n is not changed during the run time by profiling, I did the same thing;

2 If n is not a constant during the run time and is a relatively large number, I change it like your 3-5:


int len = n/N, r = n%N; for(int i = 0; i < len ; i++) { #pragma unroll for(int j = 0; j < N; j++) do sth. } for(int i = 0; i < r; i++) do the remaining stuff

3 If I know the n has a upper bound by profiling, I can replace n with a constant "M". Here M > n, but extra iterations will not generate wrong results.


With these rewriting, I got a 2x speedup.


Now, two new questions come to my mind because I want to dig more concurrency of multiple work-items:


1) Since work items execute the kernel in a pipeline style, can I assume that only one work item is executing a given statement at any time point during the kernel's running? For example, for some statement "var = val + 1" (assume its initial value is 1), if only one work item executes it at any time point, we will get a result of N+1 (say we have N work items). This implicitly implements a mutex lock.


2) Currently once a work item finishes its execution of a kernel, it quits the pipeline. I was wondering if we can implement a thread-pool model like the case in the CPU-based programming model, where we pre-creates N threads and once new data arrives, they will be assigned to a idle thread or we actively pick one idle thread to handle the data.

The background is: my data to be processed is large but the global memory budget is limited. So I have to use a loop on the host code to transfer partial data to the FPGA global memory and launch the kernel to handle them many times. For each launching, assume we launch N work items, we can only process partial data. If N is pretty large, many work items that quit the pipeline early are actually not used fully. If we can let them stand by once they finish processing previous data, we can keep sending data to the kernel. As a result, we can get a longer streaming processing.

Do you think it is possible to do so with currently OpenCL FPGA mechanism?


Thank you!

Valued Contributor II

I would say your approach to unrolling the loops is correct.


1) Indeed the loop iterations are run sequentially and there is an implicit guarantee for data consistency. This guarantee comes from the compiler's loop dependency analysis. However, in many cases, the compiler will detect a loop-carried dependency. In such cases, it will either increase the loop II to resolve the dependency, or if it cannot resolve it, it will serialize the loop. Of course this only applies to Single Work-item kernels. For the "var = val + 1" example, each iteration has a dependency on the execution of the previous one due to reuse of the same variable in both sides of the statement. If the operation can be performed in one clock cycle (e.g. in case of integer values), then the compiler can resolve the dependency without increasing the II. However, if the operation takes more than one clock cycle (e.g. in case of floating-point values), then the II will increase by the latency of the operation. This is essentially the problem of "reduction" that I mentioned in the previous replies and can be resolved by inferring a shift register.


2) Probably the best approach to solve the problem in your case would be to use the host channel/pipe feature, if your board supports it. In that case, you can stream data directly from the host to the kernel and process it all using the same set of work-items and the channel will enable implicit synchronization in this case. In general, however, as long as each chunk of data you pass to the FPGA is large enough, the overhead of work-items finishing early will be quite small. You can also try implementing global memory double-buffering to overlap computing of each chunk with the PCI-E transfer of the next one. In such streaming application, your performance bottleneck will likely be the PCI-E transfer rather than the actual computation. I am not sure if creating a pool of threads is at all possible using the OpenCL standard.


For 1), I did tests and it turned out that I cannot get an effect of mutex lock for a shared variable among work items. The test kernel code I used are as follows:

#define MAX_SEG (2048) #define attr_max_wg __attribute__((max_work_group_size(MAX_SEG)))   attr_max_wg __kernel void test1(buf_addr_space double * restrict buf_in) { size_t gid = get_global_id(0); size_t gsize = get_global_size(0); local double share_var; if(gid == 0) share_var = 1; else share_var++; if(gid == gsize - 1) buf_in[0] = share_var; }   attr_max_wg __kernel void test2(buf_addr_space double * restrict buf_in) { size_t gid = get_global_id(0); buf_in[0]++; }

test1 and test2 are two different versions to implement a mutex lock by pipelined work items. test1 uses a local shared variable "share_var" and test2 uses "buf_in[0]" directly. The kernel is compiled into a NDRange type. Assume I launch it with 8 work items and the initial value of buf_in[0] is assigned to 1 on the host code, the emulation will give me the results for test1 and test2 respectively: 8 and 9, which are I want. However, the hardware .aocx gave me the results: 1 and 2, respectively. So it looks like that no strict execution order is guaranteed and all work items seem to finish at the same time (where is the pipeline order?). So the hardware execution cannot implement an exclusive access to a shared variable by the pipeline mechanism. I may be wrong or missed something.


2) Unfortunately, host pipe is not supported on Intel Harp platform yet. Maybe double-buffering is the only choice.


Thank you!

Valued Contributor II

1) The consistency I mentioned above is only for loop iterations in single work-item kernels. Work-items in NDRange kernel can, and will, be executed out of order. The only means of achieving data consistency in NDRange kernels is to use local memory barriers. However, you will not be able to get share_var=*num_work-items* from the first code snippet even with barriers, unless you use a switch case and add as many barriers as the thread number in each case to make sure every previous thread has already updated the shared variable; this would effectively sequentialize the computation. It might be easier to achieve your purpose using other parallel programming techniques like waiting on a shared flag. The second code snippet is functionally incorrect from the point of view of the OpenCL standard since the standard does not guarantee global memory consistency except at the end of kernel execution; hence, you should not use global memory for implementing a shared variable. You can, however, use atomic memory operations and you will correct results in that case but it will be extremely slow.


2) I believe the Harp system should support shared memory between the CPU and the FPGA, eliminating the need for double-buffering unless your data cannot fit on the host memory.


Thanks for your detailed clarification!


Currently I'm trying to implement a on-chip cache, but the things are complex, to be honest. The model is:

I have an extremely large 2D input size: say 4096 x 4096 double values. The input is placed into the global memory. This input size is split into 1024 x 1024 blocks of 4 x 4. Correspondingly, I launch 1024 x 1024 work items, so each one processes only one block.


The original code just directly read/write the global memory, which is slow. Now I want to first copy a data chunk to an on-chip cache and then let the work items read data from the cache. Unfortunately, the total number of work items is very large. So it is feasible to build such a cache only for partial work items. In this scenario, knowing how many at most work items are in flight is important. Previously I assumed (now we know it is wrong!) all work items are executed in a strict pipeline style, i.e., 0, 1, 2, ..., N-1. If at most M (say 500) work items are in flight, I can build a cache whose size is 500 blocks (500 x 16 x sizeof(double) = 62.5 KB). After that, when work item 0 ... 499 begin to execute, they first copy their own block data to the corresponding position in the cache from the global memory. For the work item 500, 501, ...., they put their respective data to the position 500 % M, 501 % M, ... in the cache (or use a shared variable to calculate the index). According to the above assumption, when work item 500 executes, work item 0 should be finished. So no data conflict should happen. But now we know this assumption is completely wrong! (My experiments also verified this mechanism does not work.) The shared variable also cannot be easily used.


So it looks like that such a cache mechanism is very difficult to be used within my current code framework (or say working model), right?


(In addition, the "Loops analysis" tab of the HTML report shows a thread capacity for each non-unrolled loop. For example, 6, 29, 179, 549, and so on. I guess this number indicates how many work items can be launched (or say pushed into the pipeline) for that loop. right?)


Thanks again!


Valued Contributor II

You don't really have to use 1024x1024 work-items, but use, for example, only 1024, get them to fill the cache, add a barrier, do computation, write back to global memory and then adjust the memory address and use the same threads to redo the process. Essentially, you will have a loop that is run by all work-items with a barrier in the middle. However, I would not say this would be an efficient design. And efficient design would be to have, for example, a work-group size of 32x32 or 64x64 but a lot of work-groups. Then, have each work-group load one 32x32 or 64x64 block into on-chip memory, do its computation, and write back. The memory addresses in this case will be adjusted using the group id (get_group_id). Here, the compiler, depending on the number of barriers and the total length of the pipeline, will replicate the on-chip cache by a specific degree to accommodate a specific number of work-groups running concurrently and allow overlapping of computation and memory accesses at a work-group level. You can very easily use SIMD in such kernel to saturate the memory bandwidth and get maximum-possible performance. The lud_internal kernel from the benchmark I pointed you to earlier is probably a very good example of this.


Regarding thread capacity, it essentially shows the pipeline depth for that specific block which means how many work-items can be in-flight in it simultaneously.


Nice suggestions!


I've read the lud_internal kernel and got a lot of thoughts and observations. Meanwhile, I implemented two plans for a on-chip cache for my kernel and they work well but have some difference with your method. I'll show them below and explain the reason and my thoughts/concerns in detail. (To make things clearer, I marked partial questions as "question 1", "question 2", ...)


Please note: I used the "lud_internal" in for my reference (only v2 and v4 use multiple work groups and according to, v4 is the best version.)


1 Difference between "lud_internal" and my kernel

If I understand it correctly, each work group contains BSIZE * BSIZE work-items that process a data block of BSIZE * BSIZE. Each work-item is only responsible for calculating the value in one position or point in this data block. And doing so needs a data row and a data col. So this data block is (and has to be) shared by all work-items. That's why you need a barrier before the for loop, which guarantees the whole data block has been cached into the two local buffers (peri_row and peri_col) before the calculation.


In my case, the concept "block" and its usage may be different from the "block" in your code or the "block" you mentioned in your last reply ("load one 32x32 or 64x64 block"). It has to be a 4x4 block (double type) which is the basic (de)compression unit and hence is indivisible. Each work-item can process (i.e., compress/decompress) one or multiple blocks but not a single value. Since each work-item just accesses its own block, it is completely independent of other work-items (in your case each data row/col needs to be used by more than one work-item). As a result, no barriers are needed.

Recall the zfp kernel code I sent to you in an earlier post. In that code, each work-item processes multiple blocks. That code, as we know, has a lot of issues (especially concurrency issues). Even with tons of useful suggestions of yours in that post, I still cannot make it work completely correctly. Later on, I rewrote some code and let each work-item just process one block. Then surprisingly the new version works correctly (though I still cannot find the root cause of those issues). Now you know why I have to launch a large number of work-items given a large 2D matrix input.


2 My on-chip cache implementations

Option 1:

As the code below shows (BLOCK_ITEMS = 16, BLOCK_SIZE = 16 * sizeof(double)), I just added two caches for input and output. They are private memory (should be inferred as registers, but not sure ??). In the beginning of the kernel, I fetch the data block from global memory (indexing with the work-item id) into the input_cache. During the processing, the frequent write operations write results to the output_cache. At the end of the kernel, the output_cache's data are copied to the global memory with the corresponding index.

double input_cache[BLOCK_ITEMS]; char output_cache[BLOCK_SIZE];

This is the easiest way to implement a on-chip cache. Previously I worried about if it would work because they might occupy too much memory resource with too many work-items. But it turned out that the code can be compiled successfully and work well. So I guess the compiler allocates these cache resources by the estimated maximum in-flight work-items (my report shows the maximum thread capacity is 500+), correct? (question 1)

This option allows multiple execution styles that can be setup on the host side: one work group, or multiple 1D work groups with each group having flexible size (16, 32, 64, ...) . (Note that the global_work_size is always setup to a fixed value: the number of total blocks.) There are no obvious performance difference among these configurations. If I remember it correctly, Intel's FPGA is viewed as a single CU(compute unit). So one or more than one groups actually make no big difference.


Option 2:

As the code below shows, I built two __local caches which can be shared by all work-items in a group (I tried to follow your structure in your code). Here the group contains 64 work-items. So the input_cache also has to contain 64 data blocks. The for loop fills the cache for each work-item. Note that even though the input_cache is shared, no data sharing across blocks actually exists!


With this code, I must setup the same work group size on the host side: 64, otherwise the results are not correct (the global_work_size is still the number of total blocks).

#define BLOCK_ITEMS 16 #define MAX_SEG (2048*2048)   #define GROUP_SIZE (64) // 64 Blocks #define GROUP_ITEMS (GROUP_SIZE * BLOCK_ITEMS) #define GROUP_BYTES (GROUP_ITEMS * 8)   __attribute__((max_work_group_size(MAX_SEG))) __kernel void process(buf_addr_space const double * restrict xy_buffer, buf_addr_space char * restrict xy_bs_out) { size_t gid = get_global_id(0); size_t group_id = get_group_id(0); size_t local_id = get_local_id(0);   __local double input_cache[GROUP_SIZE * BLOCK_ITEMS]; __local char output_cache[GROUP_BYTES];   size_t g_offset = group_id * GROUP_ITEMS + local_id * BLOCK_ITEMS; size_t l_offset = local_id * BLOCK_ITEMS;   #pragma unroll for(int i = 0; i < BLOCK_ITEMS; i++) input_cache[l_offset + i] = xy_buffer[g_offset + i];   ... }

Results Comparison:

Both option 1 and option 2 work correctly but option 2's performance is worse than option 1's. With a 8192 x 8192 input, option 2 is 3 ~ 4 s slower than option 1. I guess the reason is: __local memory cannot be inferred as a register but BRAMs. Do you think so? (question 2)


Another related question is: if multiple work-items are accessing consecutive addresses, will these accesses be coalesced? (question 3)

If so, that means that all accesses to the input_cache of 64 blocks in option 2 can be coalesced. By contrast, each work-item in option 1 has its own private input_cache of 1 block and these private input_caches are actually not address consecutive. So coalescing may not be able to happen there. (Actually I feel that memory coalescing needs consecutive addresses from BOTH sides: global memory and on-chip cache, not just one side. Please correct me if I'm wrong here.)


In addition, what's the difference between using 1D work group and using 2D work group (like your code)? From the perspective of logic, there seem to be no difference except that 2D work group provides a 2D coordinates. For the data indexing, 2D work group can correspond to a consecutive data region (say the first row in a 2D NxN matrix) or a non-consecutive region (say the left upper block: 2/N x 2/N). The latter does not provide better access performance than the former. If you can show an example that is only suitable for 1D or 2D or 3D group, that would be better.

Regarding the SIMD, since the compiler always complains about my kernel with "branching is thread ID dependent ... cannot vectorize." (as you know, my kernel is quite big), so I cannot use this optimization. Given that SIMD is applied to the whole kernel, is there any local SIMD optimization inside a kernel (besides loop unrolling)? 


See next reply (A message showed my text is too large) :(


Continue here:


3 An interesting discovery

Please see the following cache-filling code in option 1 (actually it is only used in option 1. The reason is obvious: only copy a single data block).

#ifdef VECTOR_TYPE ((double16*)input_cache)[0] = ((buf_addr_space double16*)(xy_buffer + g_offset))[0]; #else #pragma unroll for(int i = 0; i < BLOCK_ITEMS; i++) input_cache[i] = xy_buffer[g_offset + i]; #endif

There are two ways to fetch data from global memory to cache: use OpenCL vector type, or use "for loop" with loop unrolling for a performance guarantee. According to my tests, both methods have very close performance. But they generate very different resource utilizations:

With "for loop" with loop unrolling: Logic utilization (68%), ALUTs (46%), Dedicated logic registers (27%), Memory blocks (24%), DSP blocks (5%);

With vector type: these numbers are 59%, 39%, 24%, 24%, 5%, respectively.


Note: I have two kernels and each has two such code snippets (for input_cache and output_cache). So I have four such snippets. The output_cache code snippet is:

#ifdef VECTOR_TYPE ((buf_addr_space double16*)((buf_addr_space double*)xy_bs_out + g_offset))[0] = ((double16*)output_cache)[0]; #else #pragma unroll for(int i = 0; i < BLOCK_ITEMS; i++) ((buf_addr_space double *)xy_bs_out)[g_offset + i] = ((double *)output_cache)[i]; #endif

This is beyond my expectation. I thought using vector type might bring higher performance but it turned out to be not so. Instead, it significantly reduces the resource utilization.


4 Convert a "for loop" to a vector operation

Please see the following function (the arguments have been omitted. iblock (int64_t) and fblock (double) are such arguments).

double s = dequantize_double(1, emax);   #pragma unroll for(int i = 0; i < BLOCK_ITEMS; i++) fblock[i] = (int64)(s * iblock[i]);

Since BLOCK_ITEMS is 16, so I was thinking if I can use the double16 to replace this "for loop". Ideally, such a vector operation statement can be written as:

((double16*)fblock)[0] = s * ((long16*)iblock)[0]; (OpenCL allows a scalar to multiply a vector,

However, when compiling, the compiler showed an error which says it cannot convert long16 to double16 (no implicit conversion happened). So I rewrote it as

((double16*)fblock)[0] = s * convert_double16_sat(((long16*)iblock)[0]); (

But unfortunately, the function "convert_double16_sat" cannot be identified by the compiler. I also tried other types, and found "convert_float8/16_sat" also cannot be identified. But "convert_ulong_16" can.

I already added the line "#pragma OPENCL EXTENSION cl_khr_fp64 : enable" in my .cl file. Meanwhile, note that the above linked document is from 1.0 (Intel claims that they have implemented OpenCL 1.0 completely on their FPGA SDK), I was wondering if I did anything wrong? If not, how can I use the vector operation without using the convert_destType<_sat> function?


5 How to get the max/min value high-efficiently?

I have the following code which searches for the maximum value from BLOCK_ITEM ones. Though I can use loop unrolling to accelerate it, is there any other higher-efficient ways (or circuit) that can do it? Getting max/min value is a very commonly used operation.

double max = 0, f; #pragma unroll for(int i = 0; i < BLOCK_ITEMS; i++) { f = fabs(p[i]); if(max < f) max = f; }

Thank you!


Valued Contributor II

1. Okay, now I understand better how your code works. The new question that arises is, is there any overlapping between the blocks that are processed by each work-item? i.e., are there any redundant/repeated memory accesses between different work-items which can be predictably cached and reused? If not, then having an on-chip cache will likely not make any difference.



>They are private memory (should be inferred as registers, but not sure ??).

Depends on size and memory access pattern of the buffer.

question 1: The size of your buffers is quite small in this case and independent of work-group size (unlike the LUD code), hence it is not going to be a problem. The implemented size of on-chip buffers depends on the size of each instance and the total number of reads and write from and to that buffer which determines the replication factor. The “thread capacity” in this case makes no difference since no two work-items will perform the same operation at the same time (unless you use SIMD). You can check Section of my thesis for some more into. Further buffer replication is also done by the compiler for work-group pipelining in the same compute unit.


>So one or more than one groups actually make no big difference.

It can make a big difference in cases where you have a lot of barriers and big chunks of the pipeline would be left unused unless you have a lot of work-groups running concurrently in the same compute. Also makes a huge difference when you have separated computation from memory accesses. (Section of my thesis, paragraph 3)


question 2: I would need to see the full code of both cases to judge properly; however, your code is likely memory-bound and if there is no actual data sharing between work-items and having a cache does not reduce global memory traffic, the larger cache of the second case will just further serialize the computation and memory accesses, requiring even more work-groups to properly utilize the pipeline which could instead reduce performance if there are not enough work-groups in your code. Implementation using Block RAM or register will not have any performance difference unless the Block RAM-based implementation results in stallable memory accesses; check the HTML report for details.


question 3: Coalescing only happens when you use unrolling or SIMD over consecutive accesses. No coalescing is done in other cases. You can check the memory access port sizes in the HTML report to see if any coalescing is happening. (using vector types also results in large memory ports but that is not “coalescing” per say.)


>Actually I feel that memory coalescing needs consecutive addresses from BOTH sides: global memory and on-chip cache, not just one side. Please correct me if I'm wrong here.

Coalescing is not required for on-chip memory since such buffers will be replicated if the accesses are not consecutive; however, if the accesses are consecutive also on the on-chip memory side and replication is avoided, the area usage will be obviously much lower.


>In addition, what's the difference between using 1D work group and using 2D work group (like your code)?

Loops have higher area overhead in NDRange kernels compared to replacing the same loop with an extra NDRange dimension. Moreover, traversing a loop is done sequentially, while when it is replaced with an extra dimension, this limitation does not exist and the run-time scheduler has a lot more work-items at hand to re-order and minimize pipeline bubbles.


> Given that SIMD is applied to the whole kernel, is there any local SIMD optimization inside a kernel (besides loop unrolling)?

Well, apart from SIMD, it is more or less just vectorization in form of loop unrolling or vector types.


3. That is interesting, I wouldn’t have expected much of a difference in this case. Have you checked the area report to see where the area usage difference is coming from? It could also be from Intel’s private cache (no mention of it anymore in the new HTML report). Also, have you checked to make sure the accesses are coalesced properly in the unrolling case? (Check port sizes in the System Viewer tab of the report)


4. Actually, I wanted to ask why you are using double types in the first place? I would assume a lossless data compression/decompression algorithm should use integer data. Maybe you should just convert everything to long. If you check the documentation, they have a section where it is clearly described what parts of the OpenCL standard are implemented. You might also be able to use a custom “union” with both double16 and long16 to work around this issue.


5. What you have done already should create a balanced chain of MUXes and give a good balance between area and performance; however, comparison of large data types is generally expensive on FPGAs since implementing MUXes requires a lot of area and tends to hurt operating frequency quite a bit, too.


Thanks for the explanations!


1 No. The block that is processed by each work-item is not overlapping with any other block, that is, no data sharing between work-items.

Basically, each block is processed through several steps by the work-item. No repeated accesses or reuse happen to them.

 Though we know, most of the time, a program can benefit at most from cache when its memory accesses show a lot of data reuse, in some rare cases, program also can benefit from its short latency.


  At first glance, it seems that there should no difference between with and without cache if no data reuse happens. However, in my case, when the block is being processed, there is a need to read/write bitstream very frequently. Please see the following code examples:

The statement x does some calculation or local variable assignment. So it can be executed quickly. However, there are some memory access statements that scatter in these statement x. If we let these write statements write global memory, they must become the bottleneck of the execution flow. If we use cache, the write latency will be reduced and then the whole sequence of the statements can be finished earlier.

Please correct me if my understanding is not correct or misses something. Thanks.

... statement 1 statement 2 statement 3 write_bitstream_1bit statement 4 statement 5 write_bitstream_3bit statement 6 statement 7 statement 8 write_bitstream_5bit statement ...


2 I'll go over this part (question 1 and 2) after I read more about your thesis. I think I understand partial contents here.


You are right about "having a cache does not reduce global memroy traffic" without data sharing. Speaking of this, cache has two functions: 1) reduce memory traffic. The more the shared data is, the more the traffic is reduced. 2) reduce the access latency. So no data sharing may lose the benefit of the function 1), but still can benefit from the function 2).

 So I don't understand this statement: "the larger cache of the second case will just further serialize the computation and memory accesses,". Since no interference happen among work-items, so why and how does the "serialization" happen?


If coalescing only happens when using loop unrolling or SIMD, I think this limits its use. For example, if I have N work-items and each one needs to access one item in an global array (say buffer[gid], gid is get_global_id()). Assume the kernel cannot use SIMD. Without coalescing, there will be N memory separate (or say individual) transactions. If the compiler can identify such cases and allow coalescing across work-items, there will be less transactions, right?


 By saying "Loops have higher area overhead in NDRange" and "traversing a loop is done sequentially", do you mean it is a regular loop inside the kernel but can be rewritten by another an extra dimension? Could you show me a typical example?


3 Their different area usage sources are surprising to me after I checked the area report.

First, I show you the complete compress kernel as follows:


#define BLOCK_ITEMS 16 #define MAX_SEG (2048*2048) #define BLOCK_SIZE (BLOCK_ITEMS * 8)   #define attr_max_wg __attribute__((max_work_group_size(MAX_SEG)))   #define attr_setup attr_max_wg   #define VECTOR_TYPE   attr_setup __kernel void compress(buf_addr_space const double * restrict xy_buffer, buf_addr_space char * restrict xy_bs_out, int seg_blocks, int xy_i) { size_t gid = get_global_id(0);   double input_cache[BLOCK_ITEMS]; char output_cache[BLOCK_SIZE];   size_t g_offset = gid * BLOCK_ITEMS;   #ifdef VECTOR_TYPE ((double16*)input_cache)[0] = ((buf_addr_space double16*)(xy_buffer + g_offset))[0]; #else #pragma unroll for(int i = 0; i < BLOCK_ITEMS; i++) input_cache[i] = xy_buffer[g_offset + i]; #endif   zfp_stream zfp; bitstream stream;   size_t block_size = BLOCK_ITEMS * sizeof(double);   stream_open_device(&stream, output_cache); zfp_stream_open_device(&zfp, &stream); zfp_stream_set_accuracy(&zfp, 1e-3);   int64 iblock[BLOCK_ITEMS]; uint64 ublock[BLOCK_ITEMS];   zfp_encode_block_double_2(&zfp, input_cache, iblock, ublock); stream_flush(;   #ifdef VECTOR_TYPE ((buf_addr_space double16*)((buf_addr_space double*)xy_bs_out + g_offset))[0] = ((double16*)output_cache)[0]; #else #pragma unroll for(int i = 0; i < BLOCK_ITEMS; i++) ((buf_addr_space double *)xy_bs_out)[g_offset + i] = ((double *)output_cache)[i]; #endif }

By commenting/uncommenting the "#define VECTOR_TYPE", we can choose different cache filling styles. Surprisingly, the report shows the area consumption difference is mainly from the line 20 instead of line 24 - line 30 I thought.


If I use VECTOR_TYPE, line 20 consumes: Select (x7): 1539 (0%) 768 (0%) 0 (0%) 0 (0%) for ALUTs, FFs, RAMs, DSPs, respectively.


If I don't use VECTOR_TYPE, line 20 consumes: Select (x66): 58044 (8%) 17408 (1%) 0 (0%) 0 (0%), ...,


I don't know what does the "Select" means. But I know xN means this component is replicated N times. Since line 20 is just a variable definition, I've really no idea what the compiler did here ... Hope you can provide some more valuable analyses.


In addition, I never heard about "Intel's private cache". What's that?


I checked the the System Viewer tab of the loop unrolling version.

line 29 shows the following information:



Width: 1024 bits, Type: Streaming, Stall-free: No, Start Cycle: 1, Latency: 1, Reference: See Best Practices Guide: Load-Store Units for more information


line 52 shows the following information:


Width: 1024 bits, Type: Burst-coalesced, Stall-free: No, Start Cycle: 2, Latency: 2, Reference: See Best Practices Guide: Load-store Units for more information


I did not find the keyword "port sizes" in the System Viewer tab. I'm still using aoc 17.1.1. Maybe your version has the "port sizes"?


4 This is decided by the zfp itself. As you know, I'm implementing zfp that is a lossy floating point (de)compression algorithm. In early versions, they only supported float and double types. In recent versions, they also added the support for integer. Currently I'm only working on the double type.

I'm not sure if the resulted correctness AND precision can keep the same or close level as the original version if I "convert everything to long".

By saying "If you check the documentation", is the document the zfp one or OpenCL document ?


I'll think about your work around.


5 One different implementation comes to my mind that could reduce the computation latency.

Our goal is to choose the max of 16 values. We first split them into 8 value pairs and chooses one larger value from each pair. Then we get 4 value pairs and continue, then get 2 pairs, and so on. This process shapes a multi-level tree structure and at each level all pairs can be compared in parallel. So theoretically we need log16= 4 cycle latency (assume each comparing needs one cycle). In comparison, the original code I listed in the last reply needs 16 cycle latency (looks so, not very sure). I have not verified this idea. So please correct me if you find some problems with it. Thanks!

double max = 0, f1, f2, f_level1[8], f_level2[4], f_level3[2];   #pragma unroll for(int i = 0; i < 8; i++) { f1 = fabs(p[2* i]); f2 = fabs(p[2 * i + 1]; f_level1[i] = f1 > f2 ? f1 : f2; }   #pragma unroll for(int i = 0; i < 4; i++) { f1 = f_level1[2 * i]; f2 = f_level1[2 * i + 1]; f_level2[i] = f1 > f2 ? f1 : f2; }   #pragma unroll for(int i = 0; i < 2; i++) { f1 = f_level2[2 * i]; f2 = f_level2[2 * i + 1]; f_level3[i] = f1 > f2 ? f1 : f2; }   max = f_level3[0] > f_level3[1] ? f_level3[0] : f_level3[1];

Thank you!

Valued Contributor II

1. If the operation of work-items is completely independent, you do not need a shared cache. A set of private variables per work-item will be enough to hold intermediate values of each work-item until the final value is written back to global memory.


2. I am not sure what you mean by “reduce the access latency” here. If there is no data reuse, all data will need to be loaded onto the FPGA from global memory once, and then loaded again from the on-chip buffer, which will increase the latency. In fact, this is exactly why the compiler will remove local buffers that are only written and read once.


>So I don't understand this statement: "the larger cache of the second case will just further serialize the computation and memory accesses,". Since no interference happen among work-items, so why and how does the "serialization" happen?

Consider an NDRange kernel as a set of consecutive pipelines, each corresponding to a region between two consecutive barriers (with an implicit one being at the start and another at the end of the kernel). Since each work-group can only reside in the region between two consecutive barriers at any given moment, only one of these pipelines will be occupied at any given time if only one work-group is running in the compute unit. Hence, the compiler allows different work-groups occupy different pipelines of the same compute unit to increase compute unit efficiency. Loading data to a local buffer will require a barrier after the load and hence, creates such a scenario. The larger the buffer, the more time it will take to fill it and hence, more work-groups will be required to effectively keep the other pipelines of the compute unit busy to achieve good compute unit efficiency.


>If coalescing only happens when using loop unrolling or SIMD, I think this limits its use. For example, if I have N work-items and each one needs to access one item in an global array (say buffer[gid], gid is get_global_id()). Assume the kernel cannot use SIMD. Without coalescing, there will be N memory separate (or say individual) transactions. If the compiler can identify such cases and allow coalescing across work-items, there will be less transactions, right?

In such cases you can easily reduce the size of your work-group by X, and create a fully unrolled loop of X in the NDRange kernel to allow access coalescing. For example, check this kernel:

There is no run-time coalescing happening here. Moreover, there is no guarantee of work-item ordering and hence, the memory access pattern that you are describing might not even result in consecutive accesses during actual run-time. Coalescing happens only at compile-time and when SIMD or unrolling is used.


>do you mean it is a regular loop inside the kernel but can be rewritten by another an extra dimension? Could you show me a typical example?

Yes, check this commit:


3. Can you package the “report” folder and attach it here?

>In addition, I never heard about "Intel's private cache". What's that?

Check “Best Practices Guide, Section 2.8.7. Load-Store Units, Load-Store Unit Modifiers, Cached”. The cache was called “private cache” in the area report before the new HTML report was introduced.


>I did not find the keyword "port sizes" in the System Viewer tab.

The “Width” in the report is the memory port size.


4. Sorry, I thought this was a lossless algorithm.

Documentation is Intel’s documentation. Check “Appendix A” of the “Programming Guide”.

A union will likely work in your case.


5. You probably do not need to worry about the operation “latency”. The OpenCL compiler optimizes for “throughput” by creating a very deep pipeline (i.e. sacrificing latency) and trying to keep it as busy as possible; if your application is latency-bound, you should not be using OpenCL at all. Indeed, a balanced tree can theoretically lower the latency but probably the only thing you need to care about in this case is the area usage. I would assume the compiler itself should be smart enough to generate a balanced tree if possible.


For 3, Sure, I would like to send it to you but with a private channel. I got your email from your papers. Please make sure it is the correct one or you want to use another email address or some other communication methods. We will write a paper based on my current work, though there are still a lot of work to do; so it may not be appropriate to put my latest code here. I appreciate your understanding! (BTW, personally, I'll be happy to list your name in the "Acknowledgement" if the paper will be accepted by somewhere with good luck :) Of course this needs your consent.)