Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Honored Contributor I
956 Views

Local variables in Kernel

Hi! 

 

I was getting bad results in my kernel, so i decided to printf the local variables marked on the code in bold and blue. 

 

At every iteration of the 2nd loop the value from previous iteration is stored instead of creating a new local variable. Whats going wrong? 

 

P.S- In red i have a memory dependency, how i resolve this? 

 

int total_gin; for(int w=0; w < row;w++) { int fcont = 0; int row_mat = w*col; for(i = 0; i < (col >> 3);i++) { int aux = i * 8; int lcontsum = 0; int ccol_co; int copl=0; int aux_tt; int lcont; # pragma unroll 8 for(int j=0;j < 8;j++) { int aux_g1 = aux + j; lcont = loc_col & (g1 != 0); lcontsum += lcont; if(lcont) { ccol_co = j+1; copl++; aux_tt=1; } } for(int bb=0; bb < 8;bb++){ total_gin += aux_tt; int ax_col = (aux + (ccol_co-1)); if(ccol_co != 0){ in_cols] = w; } } fcont += lcontsum; } }
0 Kudos
9 Replies
Highlighted
Honored Contributor I
19 Views

 

--- Quote Start ---  

At every iteration of the 2nd loop the value from previous iteration is stored instead of creating a new local variable. Whats going wrong? 

--- Quote End ---  

 

Most C compilers, including Altera's emulator, do not actually redefine the scoped variables and keep reusing them. When you are compiling for hardware execution, however, the variable scope will be taken into account. Still, you MUST initialize all your scoped variables. In your code, the first assignment to those two variables is conditional, hence it is possible to get incorrect output if the variable is not assigned any value in the conditional statement, but gets used in the statements after that. Depending on how your algorithm works, this might never happen but still, you should probably make sure lack of initialization on those variables will never cause trouble. 

 

 

--- Quote Start ---  

P.S- In red i have a memory dependency, how i resolve this? 

--- Quote End ---  

 

 

I would guess the "total_gin" variable is implemented using Block RAMs due to its size and since access latency to Block RAMs is NOT one clock cycle, you will get load/sore dependencies. To get single-cycle accesses, you should either use a smaller buffer that can be implemented using registers, or, if your algorithm allows, convert that buffer to a shift register. If none of these can be done, switching to NDRange could help since the initiation interval (II) is adjusted at runtime by the scheduler and hence, could allow better performance compared to the the fixed II in the equivalent single work-item kernel.
0 Kudos
Highlighted
Honored Contributor I
19 Views

 

--- Quote Start ---  

Most C compilers, including Altera's emulator, do not actually redefine the scoped variables and keep reusing them. When you are compiling for hardware execution, however, the variable scope will be taken into account. Still, you MUST initialize all your scoped variables. In your code, the first assignment to those two variables is conditional, hence it is possible to get incorrect output if the variable is not assigned any value in the conditional statement, but gets used in the statements after that. Depending on how your algorithm works, this might never happen but still, you should probably make sure lack of initialization on those variables will never cause trouble. 

 

 

 

I would guess the "total_gin" variable is implemented using Block RAMs due to its size and since access latency to Block RAMs is NOT one clock cycle, you will get load/sore dependencies. To get single-cycle accesses, you should either use a smaller buffer that can be implemented using registers, or, if your algorithm allows, convert that buffer to a shift register. If none of these can be done, switching to NDRange could help since the initiation interval (II) is adjusted at runtime by the scheduler and hence, could allow better performance compared to the the fixed II in the equivalent single work-item kernel. 

--- Quote End ---  

 

 

Brilliant mate, i didnt know that the emulator would not redefine the variable inside the loop. 

 

For the total_gin i have this info: 

Stall-free Yes Loads from total_gin Start-Cycle 1 Latency 3 

 

 

How can i implement a shfit register in OpenCL using the total_gin info?
0 Kudos
Highlighted
Honored Contributor I
19 Views

 

--- Quote Start ---  

For the total_gin i have this info: 

Stall-free Yes Loads from total_gin Start-Cycle 1 Latency 3 

--- Quote End ---  

 

 

This info doesn't show what type of storage resource is used for implementing the buffer. That info can be found in the "source view" tab of the area report. 

 

 

 

--- Quote Start ---  

How can i implement a shfit register in OpenCL using the total_gin info? 

--- Quote End ---  

 

 

Check "Intel FPGA SDK for OpenCL Programming Guide, Section 5.10.1 - Inferring a Shift Register". Note that unless accesses to the buffer are done in shifting manner, you will not be able to implement that buffer as a shift register.
0 Kudos
Highlighted
Honored Contributor I
19 Views

Thanks HRZ, 

 

Instead of opening a new thread i have a doubt. 

 

I have 2 kernels, that have the same command queue, then for kernel1 i enqueue some buffer lets say: 

 

buf1 = enqueueWrite(); 

buf2 = enqueueWrite(); 

buf3 = enqueueWrite(); 

runkernel1(kernel1,buf1...buf3); 

 

then it says on the console:  

Reprogramming device [0] with handle 1 

 

For kernel2 i use as argument the buf1 that i enqueued for kernel1 and enqueue new buffer: 

buf4 = write(); 

... 

buf7 = write(); 

runkernel(kernel2, buf1,buf4..buf7); 

 

then it says on the console:  

Reprogramming device [0] with handle 33 

 

After multiple calls on kernel 2 (inside a loop on the host), i get this error: 

 

Assertion failed: mem, file acl_mem.c, line 466 

 

What is the possible cause?
0 Kudos
Highlighted
Honored Contributor I
19 Views

You might have a memory leak somewhere and you are running out of host or device memory. Pay careful attention to all OpenCL-related buffers, events, etc. and make sure everything that is not reused, is freed.

0 Kudos
Highlighted
Honored Contributor I
19 Views

 

--- Quote Start ---  

You might have a memory leak somewhere and you are running out of host or device memory. Pay careful attention to all OpenCL-related buffers, events, etc. and make sure everything that is not reused, is freed. 

--- Quote End ---  

 

 

That's what i strange about my code. For example: 

Kernel 1 code  

 

//Create and write buffer buffer_input_matrix = clCreateBuffer(context, CL_MEM_READ_ONLY, datasize, NULL, &status); /* Write data from the input arrays to the buffers */ status = clEnqueueWriteBuffer(cmd_queue, buffer_input_matrix, CL_TRUE, 0, datasize, input_matrix_dis, 0, NULL, &int_str_input_matrix_enqueue); runKernel(kernel1,buffer_input_matrix..); 

 

Then i did not write the buffer_input_matrix anymore and reuse them for kernel2, how can i free the memory on the device? Releasing the memory object? Imagine i only want to copy the device but don't stored in there or keep them on the memory device. 

 

edit1: i got this error when creating the buffer: -5 (CL_OUT_OF_RESOURCES)
0 Kudos
Highlighted
Honored Contributor I
19 Views

Buffers created using clCreateBuffer can be released using "clReleaseMemObject".

0 Kudos
Highlighted
Honored Contributor I
19 Views

 

--- Quote Start ---  

Buffers created using clCreateBuffer can be released using "clReleaseMemObject". 

--- Quote End ---  

 

 

Now i only create the buffers once and updated them inside the loop by writing new values on them, before was creating new buffers with same name (inside the loop) and releasing them after the kernel executions but after all giving me that error. 

 

Thanks again HRZ :cool: 

 

Whats the diference between EnqueueMap and EnqueueWrite, the two also keep stored on the device memory?
0 Kudos
Highlighted
Honored Contributor I
19 Views

 

--- Quote Start ---  

Whats the diference between EnqueueMap and EnqueueWrite, the two also keep stored on the device memory? 

--- Quote End ---  

 

 

https://stackoverflow.com/questions/22057692/whats-the-difference-between-clenqueuemapbuffer-and-cle... 

 

For non-SoC FPGA devices, both functions will copy the buffer from host to device. For SoC FPGAs with shared memory, however, using EnqueueMap will allow you to avoid double buffer allocation.
0 Kudos