Re: Utilize M10k Memory block

Altera_Forum · ‎03-25-2018

Hi,

I have compiled my kernel and i know only uses 16% of the M10k memory. Is there way i can use more of it?

Also, i view my report, it says "loop sacrificed fmax to achieve II to 1." But in the optimisation report, it was fine, it says "Pipelined well. Successive iterations are launched every cycle."

Is there way i can avoid that? I wanted to maximize its performance (increase fmax).

Is 121Mhz for cyclone V is good enough? Not much data for me to benchmark

Altera_Forum · ‎03-25-2018

--- Quote Start ---

I have compiled my kernel and i know only uses 16% of the M10k memory. Is there way i can use more of it?

--- Quote End ---

The compiler will use as much resources as required; trying to use more could actually reduce performance since it complicates routing and reduces operating frequency.

M10k blocks are generally used for implementing large local memory buffers and FIFOs. The more local memory you use, the higher the M10k utilization will become.

--- Quote Start ---

Also, i view my report, it says "loop sacrificed fmax to achieve II to 1." But in the optimisation report, it was fine, it says "Pipelined well. Successive iterations are launched every cycle."

Is there way i can avoid that? I wanted to maximize its performance (increase fmax).

--- Quote End ---

You probably have a loop-carried dependency (feedback) somewhere in your code, forcing the compiler to create a large critical path to achieve an II of one, at the cost of lowered operating frequency.

--- Quote Start ---

Is 121Mhz for cyclone V is good enough? Not much data for me to benchmark

--- Quote End ---

For Cyclone V, 120 MHz is not too low, but it is also far from high. It is generally not hard to achieve 170-180 MHz on this device. You can try compiling Altera's reference OpenCL examples to see what operating frequency they achieve.

Altera_Forum · ‎03-25-2018

Thanks @HRZ , i thoughts using more m10k memory block could reduce the ALuTs and FF usage.

Altera_Forum · ‎03-25-2018

I am afraid that is not usually the case.

Altera_Forum · ‎03-26-2018

@HRZ I'm sorry but where can I view the FPGA SDram usage? Theres 64mb SDram on de1soc.

Altera_Forum · ‎03-26-2018

--- Quote Start ---

@HRZ I'm sorry but where can I view the FPGA SDram usage? Theres 64mb SDram on de1soc.

--- Quote End ---

The BSP of that board does not support the SDRAM memory and hence, it cannot be used with OpenCL.

Altera_Forum · ‎03-26-2018

Alright, guess i try shared memory method to decrease the required data bandwidth. Thanks HRZ

Altera_Forum · ‎03-27-2018

Hi HRZ,The programming guide quote "You cannot use the library function malloc or the operator new to allocate physically shared memory.". So, can i realloc the buffer? The thing is i want to resize the output buffer.

data_buffer = clCreateBuffer(context, CL_MEM_ALLOC_HOST_PTR, buffer_size, NULL, &status); //initiate 
            // clEnqueueTask // some process 
            x = clEnqueueMapBuffer(queue,data_buffer,CL_TRUE,CL_MAP_WRITE | CL_MAP_READ,0,size*sizeof(float),0,NULL,NULL,&status); //read output 
             if(something) x = realloc(x, newsize*sizeof(float) //realloc new size to the output buffer
             else //take back old size

Altera_Forum · ‎03-27-2018

Alloc in clCreateBuffer() from start to maximal size of all possible variants.

On host program buffers allocated with malloc() you may realloc() no problem.

Altera_Forum · ‎03-27-2018

No idea why, but my kernel freezes indefinitely during execution.

input/output declared as shared buffer between host/fpga.

input -> kernel1 -> channel -> kernel 2 -> output.

data_input = clCreateBuffer(context, CL_MEM_ALLOC_HOST_PTR, buffer_size, NULL, &status);
           data_output = clCreateBuffer(context, CL_MEM_ALLOC_HOST_PTR, buffer_size, NULL, &status);
            b = (float *) clEnqueueMapBuffer(queue_kernel1,data_input,CL_TRUE,CL_MAP_READ,0,size*sizeof(float),0,NULL,NULL,&status);   
            // put data in b 
            clEnqueueTask (kernel 1)
            clEnqueueTask (kernel 2)
            // take data out
            out = (float *)clEnqueueMapBuffer(queue_kernel2,data_output,CL_TRUE,CL_MAP_READ,0,size*sizeof(float),0,NULL,NULL,&status);


__kernel1(input){
//load input store in buffer 
//send to channel
writeintelchannel(something_ch,buffer);
}
__kernel2(output){
//load data from channel
//some add/mul
output = out;
}