Intel® Quartus® Prime Software
Intel® Quartus® Prime Design Software, Design Entry, Synthesis, Simulation, Verification, Timing Analysis, System Design (Platform Designer, formerly Qsys)
17244 Discussions

About Altera OpenCL Compilation

Altera_Forum
Honored Contributor II
4,680 Views

Hello everyone, 

 

I am just wondering if anyone know the difference between compiling kernels in Linux or Windows (in term of run-time performance and total compilation time). I know i could use .aocx file compiled from windows to run in Linux, but there seems to be a difference in performance sometimes. I am trying to figure out which OS is better for compilation. 

 

Also, I am wondering if kernels that utilize task parallelism (1 workitem per workgroup) usually takes longer and uses more memory to compile than kernels utilizing data parallelism (many workitems per workgroup). 

 

In addition, I am just wondering what is the latency to access the local memory in term of clock cycles; or is the latency heavily dependent on the size of local memory/how the memory is accessed? 

 

Thanks! 

 

Ryan
0 Kudos
27 Replies
Altera_Forum
Honored Contributor II
375 Views

 

--- Quote Start ---  

Thank you! I'll try. What if I have 2d work groups instead of for loops, where each thread copies 1 item from the global memory to local memory? Would the compiler automatically merge memory accesses? Is using "num_simd_work_items" the only way to optimize the kernel? 

--- Quote End ---  

 

 

"num_simd_work_items" is an effective way for optimizing kernels. It is essentially similar to unrolling loops: the amount of hardware resources are replicated to increase throughput.  

 

There are two types of merging (i.e. coalescing) performed by the compiler.  

 

1) Compile-time coalescing performed by the compiler: This is when the compiler detects that there are consecutive (local or global) instructions in the kernel and merges them. This may increase fmax because it simplifies the design (fewer load/store instructions), and increase throughput because fewer memory requests are sent. 

 

2) Dynamic coalescing performed on the FPGA: This is when the same "global" load/store instruction sends consecutive memory requests; these requests are merged by the hardware before they are sent to memory to increase throughput. 

 

When you unroll loops or use use "num_simd_work_items", you can take advantage of both# 1 and# 2. If you do not, then only# 2 for the global accesses.
0 Kudos
Altera_Forum
Honored Contributor II
375 Views

Thank you!  

 

If I have a 1 dimensional work group kernel that needs cache a 2d data block from global to local memory (by using a for loop), should I use the for loop index as row index and work_id as column index, or the other way around? The data block is stored in row major form. I intend to use unroll and SIMD to increase throughput, but not sure if it's more effective to merge memory access indexed by loop index or workitem id. 

 

BTW: Regarding the compilation, I am just wondering if it's OK for me to compile aocl kernels on windows but execute them on Linux? Is there any disadvantages on executing kernels compiled from a different OS? 

 

Also, given enough DRAM, is it possible to compile multiple kernels on the same workstation (Linux) at the same time, where each compilation will be started in a separate "screen" session? My workstation has a 6 cores 12 threads processor, but the compiler only uses more than 3 cores in timing analysis stage, so I am trying to see if there is a way to save some time. 

 

 

--- Quote Start ---  

"num_simd_work_items" is an effective way for optimizing kernels. It is essentially similar to unrolling loops: the amount of hardware resources are replicated to increase throughput.  

 

There are two types of merging (i.e. coalescing) performed by the compiler.  

 

1) Compile-time coalescing performed by the compiler: This is when the compiler detects that there are consecutive (local or global) instructions in the kernel and merges them. This may increase fmax because it simplifies the design (fewer load/store instructions), and increase throughput because fewer memory requests are sent. 

 

2) Dynamic coalescing performed on the FPGA: This is when the same "global" load/store instruction sends consecutive memory requests; these requests are merged by the hardware before they are sent to memory to increase throughput. 

 

When you unroll loops or use use "num_simd_work_items", you can take advantage of both# 1 and# 2. If you do not, then only# 2 for the global accesses. 

--- Quote End ---  

0 Kudos
Altera_Forum
Honored Contributor II
375 Views

This is an interesting and deep question. The right answer may depend on many factors, but mainly the size of your kernel, the size of your buffers, and your SIMD parameter. I will try to give some insights. 

 

For global load, where N and M are constants: 

 

Case-G1: 

for(i = 0; i < N; i++) { 

val = A[i * M + get_global_id(0)]; 

If you unroll this loop 4 times, you will get 4 super-efficient loads (e.g. A[4 * M + gid]), regardless of the SIMD parameter. They will be super-efficient because each work-item will access consecutive addresses in each load, which FPGAs excel at. However, if you unroll the loop large number of times, then you may start slowing down your kernel because of the large number of loads. 

 

Case-G2: 

for(i = 0; i < M; i++) { 

val = A[get_global_id(0) * M + i]; 

If you unroll this loop, the consecutive accesses will be coalesced, and you will get wide some-what efficient loads. These loads will not be as efficient as the loads in Case-G1, however, you will have fewer of them because of the coalescing, which may be adventegous. On the other hand, you if use a large SIMD parameter, then you will start increasing the number of loads again, because each SIMD lane will have its own load. 

 

For local store; 

 

Case-L1: 

for(i = 0; i < N; i++) { 

B[get_local_id(0)] = val; 

when you unroll the loop and use simd parameter, then each store will be as wide as the simd parameter. the larger simd parameter is, the wider and more efficient the accesses will be. 

 

case-l2: 

for(i = 0; i < n; i++) { 

b[get_local_id(0)] = val; 

When you unroll the loop and use SIMD parameter, then the store width will be determined by the compiler, however, can potentially be very wide (because of unrolling), and wider than Case-L1.
0 Kudos
Altera_Forum
Honored Contributor II
375 Views

Assuming your kernel allows you to pick a large SIMD parameter (no work-id divergent paths, etc.) and can unroll the entire loop (removes the loop completely which is very resource efficient), then you can try G1 and L1 for small N, G2 and L2 for large N...

0 Kudos
Altera_Forum
Honored Contributor II
375 Views

 

--- Quote Start ---  

Thank you!  

 

BTW: Regarding the compilation, I am just wondering if it's OK for me to compile aocl kernels on windows but execute them on Linux? Is there any disadvantages on executing kernels compiled from a different OS? 

 

Also, given enough DRAM, is it possible to compile multiple kernels on the same workstation (Linux) at the same time, where each compilation will be started in a separate "screen" session? My workstation has a 6 cores 12 threads processor, but the compiler only uses more than 3 cores in timing analysis stage, so I am trying to see if there is a way to save some time. 

--- Quote End ---  

 

 

Q1: Yes, you should be able to generate .aocx files on Windows and execute them on Linux with no problem. 

 

Q2: Yes, you should be able to run multiple compilations in parallel, however, Quartus may require large amounts of RAM for big designs (may exceed 10 GB, so the more compilations you run, the more you risk running out of memory... I doubt that you can run more than 2.
0 Kudos
Altera_Forum
Honored Contributor II
375 Views

Thank you so much for the detailed answer! 

 

I have one more question regarding the local memory: if I have a kernel where all of its workgroups needs to read a common block of data (stored in global memory) which was updated by a previous launched kernel, is it possible to load this block of data only in the first workgroup to some kind of local memory space, and then share it across all others workgroups, so that the kernel don't have to load it multiple times? I know that the local memory can not be shared across workgroups, but I am just wondering if there is a mechanism in opencl that allow this kind of memory optimization. Basically I am wondering if there is a kind of memory that can be shared across workgroups of a single kernel that is faster than global memory. Thanks!
0 Kudos
Altera_Forum
Honored Contributor II
375 Views

To follow my last question, is it possible to use channel to pass data from one workgroup of a kernel to the next workgroup of the same kernel, or if this is just a bad way of programming?

0 Kudos
Reply