Intel® Quartus® Prime Software
Intel® Quartus® Prime Design Software, Design Entry, Synthesis, Simulation, Verification, Timing Analysis, System Design (Platform Designer, formerly Qsys)
16596 Discussions

Utilize M10k Memory block

Altera_Forum
Honored Contributor II
2,620 Views

Hi, 

 

I have compiled my kernel and i know only uses 16% of the M10k memory. Is there way i can use more of it?  

 

Also, i view my report, it says "loop sacrificed fmax to achieve II to 1." But in the optimisation report, it was fine, it says "Pipelined well. Successive iterations are launched every cycle." 

Is there way i can avoid that? I wanted to maximize its performance (increase fmax). 

 

Is 121Mhz for cyclone V is good enough? Not much data for me to benchmark
0 Kudos
9 Replies
Altera_Forum
Honored Contributor II
1,031 Views

 

--- Quote Start ---  

I have compiled my kernel and i know only uses 16% of the M10k memory. Is there way i can use more of it?  

--- Quote End ---  

 

The compiler will use as much resources as required; trying to use more could actually reduce performance since it complicates routing and reduces operating frequency. 

M10k blocks are generally used for implementing large local memory buffers and FIFOs. The more local memory you use, the higher the M10k utilization will become. 

 

 

--- Quote Start ---  

Also, i view my report, it says "loop sacrificed fmax to achieve II to 1." But in the optimisation report, it was fine, it says "Pipelined well. Successive iterations are launched every cycle." 

Is there way i can avoid that? I wanted to maximize its performance (increase fmax). 

--- Quote End ---  

 

You probably have a loop-carried dependency (feedback) somewhere in your code, forcing the compiler to create a large critical path to achieve an II of one, at the cost of lowered operating frequency. 

 

 

--- Quote Start ---  

Is 121Mhz for cyclone V is good enough? Not much data for me to benchmark 

--- Quote End ---  

 

For Cyclone V, 120 MHz is not too low, but it is also far from high. It is generally not hard to achieve 170-180 MHz on this device. You can try compiling Altera's reference OpenCL examples to see what operating frequency they achieve.
0 Kudos
Altera_Forum
Honored Contributor II
1,031 Views

Thanks @HRZ , i thoughts using more m10k memory block could reduce the ALuTs and FF usage.

0 Kudos
Altera_Forum
Honored Contributor II
1,031 Views

I am afraid that is not usually the case.

0 Kudos
Altera_Forum
Honored Contributor II
1,031 Views

@HRZ I'm sorry but where can I view the FPGA SDram usage? Theres 64mb SDram on de1soc.

0 Kudos
Altera_Forum
Honored Contributor II
1,031 Views

 

--- Quote Start ---  

@HRZ I'm sorry but where can I view the FPGA SDram usage? Theres 64mb SDram on de1soc. 

--- Quote End ---  

 

 

The BSP of that board does not support the SDRAM memory and hence, it cannot be used with OpenCL.
0 Kudos
Altera_Forum
Honored Contributor II
1,031 Views

Alright, guess i try shared memory method to decrease the required data bandwidth. Thanks HRZ

0 Kudos
Altera_Forum
Honored Contributor II
1,031 Views

Hi HRZ,The programming guide quote "You cannot use the library function malloc or the operator new to allocate physically shared memory.". So, can i realloc the buffer? The thing is i want to resize the output buffer. 

 

data_buffer = clCreateBuffer(context, CL_MEM_ALLOC_HOST_PTR, buffer_size, NULL, &status); //initiate // clEnqueueTask // some process x = clEnqueueMapBuffer(queue,data_buffer,CL_TRUE,CL_MAP_WRITE | CL_MAP_READ,0,size*sizeof(float),0,NULL,NULL,&status); //read output if(something) x = realloc(x, newsize*sizeof(float) //realloc new size to the output buffer else //take back old size
0 Kudos
Altera_Forum
Honored Contributor II
1,031 Views

Alloc in clCreateBuffer() from start to maximal size of all possible variants. 

On host program buffers allocated with malloc() you may realloc() no problem.
0 Kudos
Altera_Forum
Honored Contributor II
1,031 Views

No idea why, but my kernel freezes indefinitely during execution. 

 

input/output declared as shared buffer between host/fpga.  

 

input -> kernel1 -> channel -> kernel 2 -> output. 

 

data_input = clCreateBuffer(context, CL_MEM_ALLOC_HOST_PTR, buffer_size, NULL, &status); data_output = clCreateBuffer(context, CL_MEM_ALLOC_HOST_PTR, buffer_size, NULL, &status); b = (float *) clEnqueueMapBuffer(queue_kernel1,data_input,CL_TRUE,CL_MAP_READ,0,size*sizeof(float),0,NULL,NULL,&status); // put data in b clEnqueueTask (kernel 1) clEnqueueTask (kernel 2) // take data out out = (float *)clEnqueueMapBuffer(queue_kernel2,data_output,CL_TRUE,CL_MAP_READ,0,size*sizeof(float),0,NULL,NULL,&status);  

 

__kernel1(input){ //load input store in buffer //send to channel writeintelchannel(something_ch,buffer); } __kernel2(output){ //load data from channel //some add/mul output = out; }
0 Kudos
Reply