- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
Hi,
I have compiled my kernel and i know only uses 16% of the M10k memory. Is there way i can use more of it? Also, i view my report, it says "loop sacrificed fmax to achieve II to 1." But in the optimisation report, it was fine, it says "Pipelined well. Successive iterations are launched every cycle." Is there way i can avoid that? I wanted to maximize its performance (increase fmax). Is 121Mhz for cyclone V is good enough? Not much data for me to benchmarkコピーされたリンク
9 返答(返信)
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
--- Quote Start --- I have compiled my kernel and i know only uses 16% of the M10k memory. Is there way i can use more of it? --- Quote End --- The compiler will use as much resources as required; trying to use more could actually reduce performance since it complicates routing and reduces operating frequency. M10k blocks are generally used for implementing large local memory buffers and FIFOs. The more local memory you use, the higher the M10k utilization will become. --- Quote Start --- Also, i view my report, it says "loop sacrificed fmax to achieve II to 1." But in the optimisation report, it was fine, it says "Pipelined well. Successive iterations are launched every cycle." Is there way i can avoid that? I wanted to maximize its performance (increase fmax). --- Quote End --- You probably have a loop-carried dependency (feedback) somewhere in your code, forcing the compiler to create a large critical path to achieve an II of one, at the cost of lowered operating frequency. --- Quote Start --- Is 121Mhz for cyclone V is good enough? Not much data for me to benchmark --- Quote End --- For Cyclone V, 120 MHz is not too low, but it is also far from high. It is generally not hard to achieve 170-180 MHz on this device. You can try compiling Altera's reference OpenCL examples to see what operating frequency they achieve.
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
Thanks @HRZ , i thoughts using more m10k memory block could reduce the ALuTs and FF usage.
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
I am afraid that is not usually the case.
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
@HRZ I'm sorry but where can I view the FPGA SDram usage? Theres 64mb SDram on de1soc.
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
--- Quote Start --- @HRZ I'm sorry but where can I view the FPGA SDram usage? Theres 64mb SDram on de1soc. --- Quote End --- The BSP of that board does not support the SDRAM memory and hence, it cannot be used with OpenCL.
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
Alright, guess i try shared memory method to decrease the required data bandwidth. Thanks HRZ
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
Hi HRZ,The programming guide quote "You cannot use the library function malloc or the operator new to allocate physically shared memory.". So, can i realloc the buffer? The thing is i want to resize the output buffer.
data_buffer = clCreateBuffer(context, CL_MEM_ALLOC_HOST_PTR, buffer_size, NULL, &status); //initiate
// clEnqueueTask // some process
x = clEnqueueMapBuffer(queue,data_buffer,CL_TRUE,CL_MAP_WRITE | CL_MAP_READ,0,size*sizeof(float),0,NULL,NULL,&status); //read output
if(something) x = realloc(x, newsize*sizeof(float) //realloc new size to the output buffer
else //take back old size
- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
Alloc in clCreateBuffer() from start to maximal size of all possible variants.
On host program buffers allocated with malloc() you may realloc() no problem.- 新着としてマーク
- ブックマーク
- 購読
- ミュート
- RSS フィードを購読する
- ハイライト
- 印刷
- 不適切なコンテンツを報告
No idea why, but my kernel freezes indefinitely during execution.
input/output declared as shared buffer between host/fpga. input -> kernel1 -> channel -> kernel 2 -> output.data_input = clCreateBuffer(context, CL_MEM_ALLOC_HOST_PTR, buffer_size, NULL, &status);
data_output = clCreateBuffer(context, CL_MEM_ALLOC_HOST_PTR, buffer_size, NULL, &status);
b = (float *) clEnqueueMapBuffer(queue_kernel1,data_input,CL_TRUE,CL_MAP_READ,0,size*sizeof(float),0,NULL,NULL,&status);
// put data in b
clEnqueueTask (kernel 1)
clEnqueueTask (kernel 2)
// take data out
out = (float *)clEnqueueMapBuffer(queue_kernel2,data_output,CL_TRUE,CL_MAP_READ,0,size*sizeof(float),0,NULL,NULL,&status);
__kernel1(input){
//load input store in buffer
//send to channel
writeintelchannel(something_ch,buffer);
}
__kernel2(output){
//load data from channel
//some add/mul
output = out;
}
