- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I have compiled my kernel and i know only uses 16% of the M10k memory. Is there way i can use more of it? Also, i view my report, it says "loop sacrificed fmax to achieve II to 1." But in the optimisation report, it was fine, it says "Pipelined well. Successive iterations are launched every cycle." Is there way i can avoid that? I wanted to maximize its performance (increase fmax). Is 121Mhz for cyclone V is good enough? Not much data for me to benchmarkLink Copied
9 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
--- Quote Start --- I have compiled my kernel and i know only uses 16% of the M10k memory. Is there way i can use more of it? --- Quote End --- The compiler will use as much resources as required; trying to use more could actually reduce performance since it complicates routing and reduces operating frequency. M10k blocks are generally used for implementing large local memory buffers and FIFOs. The more local memory you use, the higher the M10k utilization will become. --- Quote Start --- Also, i view my report, it says "loop sacrificed fmax to achieve II to 1." But in the optimisation report, it was fine, it says "Pipelined well. Successive iterations are launched every cycle." Is there way i can avoid that? I wanted to maximize its performance (increase fmax). --- Quote End --- You probably have a loop-carried dependency (feedback) somewhere in your code, forcing the compiler to create a large critical path to achieve an II of one, at the cost of lowered operating frequency. --- Quote Start --- Is 121Mhz for cyclone V is good enough? Not much data for me to benchmark --- Quote End --- For Cyclone V, 120 MHz is not too low, but it is also far from high. It is generally not hard to achieve 170-180 MHz on this device. You can try compiling Altera's reference OpenCL examples to see what operating frequency they achieve.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks @HRZ , i thoughts using more m10k memory block could reduce the ALuTs and FF usage.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am afraid that is not usually the case.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
@HRZ I'm sorry but where can I view the FPGA SDram usage? Theres 64mb SDram on de1soc.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
--- Quote Start --- @HRZ I'm sorry but where can I view the FPGA SDram usage? Theres 64mb SDram on de1soc. --- Quote End --- The BSP of that board does not support the SDRAM memory and hence, it cannot be used with OpenCL.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Alright, guess i try shared memory method to decrease the required data bandwidth. Thanks HRZ
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi HRZ,The programming guide quote "You cannot use the library function malloc or the operator new to allocate physically shared memory.". So, can i realloc the buffer? The thing is i want to resize the output buffer.
data_buffer = clCreateBuffer(context, CL_MEM_ALLOC_HOST_PTR, buffer_size, NULL, &status); //initiate
// clEnqueueTask // some process
x = clEnqueueMapBuffer(queue,data_buffer,CL_TRUE,CL_MAP_WRITE | CL_MAP_READ,0,size*sizeof(float),0,NULL,NULL,&status); //read output
if(something) x = realloc(x, newsize*sizeof(float) //realloc new size to the output buffer
else //take back old size
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Alloc in clCreateBuffer() from start to maximal size of all possible variants.
On host program buffers allocated with malloc() you may realloc() no problem.- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
No idea why, but my kernel freezes indefinitely during execution.
input/output declared as shared buffer between host/fpga. input -> kernel1 -> channel -> kernel 2 -> output.data_input = clCreateBuffer(context, CL_MEM_ALLOC_HOST_PTR, buffer_size, NULL, &status);
data_output = clCreateBuffer(context, CL_MEM_ALLOC_HOST_PTR, buffer_size, NULL, &status);
b = (float *) clEnqueueMapBuffer(queue_kernel1,data_input,CL_TRUE,CL_MAP_READ,0,size*sizeof(float),0,NULL,NULL,&status);
// put data in b
clEnqueueTask (kernel 1)
clEnqueueTask (kernel 2)
// take data out
out = (float *)clEnqueueMapBuffer(queue_kernel2,data_output,CL_TRUE,CL_MAP_READ,0,size*sizeof(float),0,NULL,NULL,&status);
__kernel1(input){
//load input store in buffer
//send to channel
writeintelchannel(something_ch,buffer);
}
__kernel2(output){
//load data from channel
//some add/mul
output = out;
}
Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page