- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I've developed a design in which there are 4 NDRange kernels all running concurrently on the FPGA and communicating with each other via the Altera OpencL Channels extension. There are many channels in the program , however it is inherently cyclical as the data is passed in a circle around the 4 kernels. I know this design isn't recommended by Altera (as it can't optimize the channel depths), however I'm trying to look at all possible architectures to get the best throughput. But I'm having a problem while running it.
I developed and tested the kernel using the emulator successfully. Everything was functional and produced promising correct results during emulation. However, after completing the full compilation and running the kernels, they seem to enter a deadlock state. I'm assuming this only happens after full compilation due to the kernels operating concurrently. Does anyone have any suggestions for debugging kernels that won't come to completion? The solution I have right now uses global memory and an independent queue that continuously transfers over the data from global memory, but it seems inefficient.Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You can use printf calls in your kernels to find out where they are currently at. Yes, printf actually does work on the FPGA, it just takes up a good chunk of LEs FFs and RAMs. I've used it for debugging before. Unfortunately the emulator is useless for these kind of problems.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks mscharrer! I thought printf was only supported in emulation. That should make debugging a little easier.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I was surprised that it works as well, and only noticed it by accidentally leaving it in after debugging in the emulator.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
--- Quote Start --- I was surprised that it works as well, and only noticed it by accidentally leaving it in after debugging in the emulator. --- Quote End --- For anyone following this thread, I was able to find the solution to the deadlock. By using printf statements into a csv format, I found that the deadlock was occuring due to the channel reads and writes being executed in an unexpected order. In my case I was using an NDRange kernel iterating over the channels with a loop as follows: --- Quote Start --- __kernel __attribute__((reqd_work_group_size(32,1,1))) ExampleKernel() { int id = get_global_id(0); char4 data[NUM_ELEMENTS]; // The data is processed here. ... for(int i=0; i<NUM_ELEMENTS; i++) { switch(id) { case 0: write_channel_altera(outputChannel[0],data);break;
case 1: write_channel_altera(outputchannel[1],data);break; case 2: write_channel_altera(outputChannel[2],data);break;
case 3: write_channel_altera(outputchannel[3],data);break; case 4: write_channel_altera(outputChannel[4],data);break;
case 5: write_channel_altera(outputchannel[5],data);break; case 6: write_channel_altera(outputChannel[6],data);break;
case 7: write_channel_altera(outputchannel[7],data);break; case 8: write_channel_altera(outputChannel[8],data);break;
case 9: write_channel_altera(outputchannel[9],data);break; case 10: write_channel_altera(outputChannel[10],data);break;
case 11: write_channel_altera(outputchannel[11],data);break; case 12: write_channel_altera(outputChannel[12],data);break;
case 13: write_channel_altera(outputchannel[13],data);break; case 14: write_channel_altera(outputChannel[14],data);break;
case 15: write_channel_altera(outputchannel[15],data);break; case 16: write_channel_altera(outputChannel[16],data);break;
case 17: write_channel_altera(outputchannel[17],data);break; case 18: write_channel_altera(outputChannel[18],data);break;
case 19: write_channel_altera(outputchannel[19],data);break; case 20: write_channel_altera(outputChannel[20],data);break;
case 21: write_channel_altera(outputchannel[21],data);break; case 22: write_channel_altera(outputChannel[22],data);break;
case 23: write_channel_altera(outputchannel[23],data);break; case 24: write_channel_altera(outputChannel[24],data);break;
case 25: write_channel_altera(outputchannel[25],data);break; case 26: write_channel_altera(outputChannel[26],data);break;
case 27: write_channel_altera(outputchannel[27],data);break; case 28: write_channel_altera(outputChannel[28],data);break;
case 29: write_channel_altera(outputchannel[29],data);break; case 30: write_channel_altera(outputChannel[30],data);break;
case 31: write_channel_altera(outputchannel[31],data);break; } } --- Quote End --- The above is an example of a (32,1,1) NDRange kernel. The data arrived such that the current loop iteration was completely finished before it continued. Here is how the data arrived: id loop iteration data 0 data[0] 1 data[0] 2 data[0] ... ... 31 data[0] 0 data[1] 1 data[1] 2 data[1] ... ... ... ... 0 data[NUM_ELEMENTS-1] 1 data[NUM_ELEMENTS-1] 2 data[NUM_ELEMENTS-1] ... ... 31 data[NUM_ELEMENTS-1]
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello, now I met the same problem with you! Have you solved your problem?How?Thanks!!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
--- Quote Start --- Hello, now I met the same problem with you! Have you solved your problem?How?Thanks!! --- Quote End --- If you see my previous post, I could solve the issue by re-ordering how the data is taken out of the pipes/channels. Unfortunately due to how my kernels were required to execute I abandoned this design and moved towards one that used local memory to transfer data.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
--- Quote Start --- If you see my previous post, I could solve the issue by re-ordering how the data is taken out of the pipes/channels. Unfortunately due to how my kernels were required to execute I abandoned this design and moved towards one that used local memory to transfer data.[/QUOT,E] In your before post, I see "For anyone following this thread, I was able to find the solution to the deadlock. By using printf statements into a csv format, I found that the deadlock was occuring due to the channel reads and writes being executed in an unexpected order. ", but when I add printf to my kernel, when kernel execute in FPPFA, most of the information cann't printf out, even though at the begining of the kernel! Have you met the same situation? If not, do you know why?Thanks!!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
--- Quote Start --- --- Quote Start --- If you see my previous post, I could solve the issue by re-ordering how the data is taken out of the pipes/channels. Unfortunately due to how my kernels were required to execute I abandoned this design and moved towards one that used local memory to transfer data.[/QUOT,E] In your before post, I see "For anyone following this thread, I was able to find the solution to the deadlock. By using printf statements into a csv format, I found that the deadlock was occuring due to the channel reads and writes being executed in an unexpected order. ", but when I add printf to my kernel, when kernel execute in FPPFA, most of the information cann't printf out, even though at the begining of the kernel! Have you met the same situation? If not, do you know why?Thanks!! --- Quote End --- I would guess that it is due to a similar issue that I faced. I apologize for the late reply and please forgive me if you've figured out a solution to this already. If a specific work-item executes the channel reads/writes in a different order you are likely in a deadlock scenario.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Try using non-blocking reads and/or writes to the channel to avoid deadlock. In one of my designs I use a non-blocking read inside a do-while loop which effectively polls the channel until something is received:
bool got_val;
do{
read_val = read_channel_nb_altera(accum, &got_val);
}while(!got_val);
note, replace 'altera' with 'intel' if you are using sdk version > 16.0 Take a look at the fpga opencl sdk programming guide (https://www.altera.com/content/dam/altera-www/global/en_us/pdfs/literature/hb/opencl-sdk/aocl_programming_guide.pdf#page=33) (take a look at pages 33-35 for channel reads/writes)

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page