Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Valued Contributor III
1,239 Views

Kernel performance with profiling

Hello again guys. 

 

Im struggle to understand the results from the profiler of two kernel versions (one with unroll factor of 128 and another with 32) 

 

The 32 unroll factor outperforms the 128 factor by 5 seconds for an input matrix of 20000 x 1000. 

 

Stats are: 

 

32 | 128 

Activity: 96% | 25% 

Memory(global) BW: 15182 MB/s | 11885 MB/s 

Kernel Clock Freq: 244 MHz | 185 MHZ 

Stall %: 14,49% | 15,1 % 

 

 

I don't know what is happening because the stall increases while the best version has better memory bw..
0 Kudos
15 Replies
Highlighted
Valued Contributor III
14 Views

I have explained the math behind saturating memory bandwidth here (http://www.alteraforum.com/forum/showthread.php?t=57099&p=232613#post232613). An unroll factor of 32 is likely more than enough to fully saturate the memory bandwidth. Anything more than that will result in a lot of memory contention and lowered performance. You are losing performance due to two reasons: lowered operating frequency, and lowered memory bandwidth due to higher amount of contention. Trying to decipher the numbers reported by the profiler generally results in more confusion, and I have encountered many cases in which the numbers simply do not make sense and contradict each other or the timing results. My recommendation is not to rely too much on those numbers.

0 Kudos
Highlighted
Valued Contributor III
14 Views

 

--- Quote Start ---  

I have explained the math behind saturating memory bandwidth here (http://www.alteraforum.com/forum/showthread.php?t=57099&p=232613#post232613). An unroll factor of 32 is likely more than enough to fully saturate the memory bandwidth. Anything more than that will result in a lot of memory contention and lowered performance. You are losing performance due to two reasons: lowered operating frequency, and lowered memory bandwidth due to higher amount of contention. Trying to decipher the numbers reported by the profiler generally results in more confusion, and I have encountered many cases in which the numbers simply do not make sense and contradict each other or the timing results. My recommendation is not to rely too much on those numbers. 

--- Quote End ---  

 

 

Thanks mate! HRZ you're my salvation ;) 

 

My doubt was because of the stall parameter, it would make sense that were more high because of the memory contention...
0 Kudos
Highlighted
Valued Contributor III
14 Views

 

--- Quote Start ---  

My doubt was because of the stall parameter, it would make sense that were more high because of the memory contention... 

--- Quote End ---  

 

 

Yes, that is also exactly what I would expect to see here. However, I have seen so many cases where the stall percentage simply does not make sense and hence, eventually stopped relying on the profiler. The only numbers that generally seem to make sense are the reported memory bandwidth and the burst size.
0 Kudos
Highlighted
Valued Contributor III
14 Views

 

--- Quote Start ---  

Yes, that is also exactly what I would expect to see here. However, I have seen so many cases where the stall percentage simply does not make sense and hence, eventually stopped relying on the profiler. The only numbers that generally seem to make sense are the reported memory bandwidth and the burst size. 

--- Quote End ---  

 

 

Case solved then, thanks again HRZ B)
0 Kudos
Highlighted
Valued Contributor III
14 Views

 

--- Quote Start ---  

Case solved then, thanks again HRZ B) 

--- Quote End ---  

 

 

Sorry for reviving this topic, but how can i calculate the bandwidth value ? 

 

I can't get the right value from the profiler info in the unrolled loop. 

 

Also i have a latency of 144 LSU for that read at global memory. 

 

http://i64.tinypic.com/o6wr2b.png
0 Kudos
Highlighted
Valued Contributor III
14 Views

"Intel FPGA SDK for OpenCL Best Practices Guide, Section 4.3.1" outlines how the profiler calculates the bandwidth values. These values largely depend on the frequency of the accesses (called Occupancy in the profiler which depends on how frequently the access is triggered in the loop), their size (which depends on unroll size and alignment), access contention (which depends on number of accesses/memory ports) and kernel operating frequency. Predicting/calculating these values by hand will not be very easy.

0 Kudos
Highlighted
Valued Contributor III
14 Views

 

--- Quote Start ---  

"Intel FPGA SDK for OpenCL Best Practices Guide, Section 4.3.1" outlines how the profiler calculates the bandwidth values. These values largely depend on the frequency of the accesses (called Occupancy in the profiler which depends on how frequently the access is triggered in the loop), their size (which depends on unroll size and alignment), access contention (which depends on number of accesses/memory ports) and kernel operating frequency. Predicting/calculating these values by hand will not be very easy. 

--- Quote End ---  

 

 

Hum.. ok i discard the BW calculation by hand. 

 

Do you know how LSU is connected to the memory? Is connected directly to the memory controller? because it gives me different widths for different unroll factors.. 

 

Also, it says latency LSU, is in clock cycles? What means this latency? 

 

Thanks for always helping me bud :)
0 Kudos
Highlighted
Valued Contributor III
14 Views

The compiler combines unrolled consecutive accesses into larger coalesced accesses to improve memory throughput. I am not exactly sure where the LSU sits in the design, most likely between the kernel and memory controller. 

 

Regarding latency, my understanding is that the latency reported in the report for LSUs is the number of registers the compiler inserts into the pipeline to absorb stalls from memory accesses. If the access gets stalled for less clocks, only bubbles will be inserted into the pipeline. If the stall lasts longer, then the whole pipeline will be stalled. Please note that these stuff are not really documented anywhere, and I could as well be wrong.
0 Kudos
Highlighted
Valued Contributor III
14 Views

 

--- Quote Start ---  

The compiler combines unrolled consecutive accesses into larger coalesced accesses to improve memory throughput. I am not exactly sure where the LSU sits in the design, most likely between the kernel and memory controller. 

 

Regarding latency, my understanding is that the latency reported in the report for LSUs is the number of registers the compiler inserts into the pipeline to absorb stalls from memory accesses. If the access gets stalled for less clocks, only bubbles will be inserted into the pipeline. If the stall lasts longer, then the whole pipeline will be stalled. Please note that these stuff are not really documented anywhere, and I could as well be wrong. 

--- Quote End ---  

 

 

Thats why i cannot infer too much about LSU, i check the memory design and nothing about this resource. The burst size of the LSU is the number of request it can coalesce in bytes?
0 Kudos
Highlighted
Valued Contributor III
14 Views

 

--- Quote Start ---  

The burst size of the LSU is the number of request it can coalesce in bytes? 

--- Quote End ---  

 

If you are talking about the number reported by the profiler, I think it is in 32-bit words, not bytes.
0 Kudos
Highlighted
Valued Contributor III
14 Views

 

--- Quote Start ---  

If you are talking about the number reported by the profiler, I think it is in 32-bit words, not bytes. 

--- Quote End ---  

 

 

On profiler reports says that the max burst size is 16.. but 16 bits, bytes, number of requests? 

 

 

 

bandwidth 

Burst Size 

The average burst size of the memory operation.If the memory system does not support burst mode (for example, on-chip RAM), no burst information will be available. 

 

Average Burst Size=7.6(Max Burst=16) 

 

Global memory 

 

0 Kudos
Highlighted
Valued Contributor III
14 Views

Since the maximum read/write size per kernel clock cycle per DDR3/4 memory bank is 512-bits (standard value hardcoded in the BSP), for a total of 1024 bits for two banks, I have so far assumed that a burst size of 16 equals a read/write size of 1024 bits, hence the unit being 32-bit words. I could be wrong, though.

0 Kudos
Highlighted
Valued Contributor III
14 Views

 

--- Quote Start ---  

Since the maximum read/write size per kernel clock cycle per DDR3/4 memory bank is 512-bits (standard value hardcoded in the BSP), for a total of 1024 bits for two banks, I have so far assumed that a burst size of 16 equals a read/write size of 1024 bits, hence the unit being 32-bit words. I could be wrong, though. 

--- Quote End ---  

 

 

Thats seems logic :) Last question (for now i guess), im using the board de5_net (http://www.terasic.com.tw/cgi-bin/page/archive.pl?language=english&categoryno=158&no=526&partno=2

 

However, when i go to the BSP board spec, it says that i only have 2 GB for each bank (im assuming each dimm socket its a bank).. My question is, my frequency memory clock for DDR3 its 933 MHz?
0 Kudos
Highlighted
Valued Contributor III
14 Views

The standard DE5-Net board has two 2GB DDR3 banks, each running at 1600 MHz (800 MHz double datarate). There is another variation of the board with two 4GB DDR3 banks, but I am not sure how fast the memory is (probably the same speed as the 2GB one). The first variation must be the board you have.

0 Kudos
Highlighted
Beginner
14 Views

I am asking this question because I think this might be related. In profiler report I get average read burst size of 1 and average write burst size of 1 with optimal possible of 16. I am not sure how to utilize that burst size . Any suggestion to achieve that will really be helpful.

Just to give more information, I am reading 16 data at a time and storing in shift register

 

0 Kudos