Intel® Quartus® Prime Software
Intel® Quartus® Prime Design Software, Design Entry, Synthesis, Simulation, Verification, Timing Analysis, System Design (Platform Designer, formerly Qsys)
16559 Discussions

FPGA(Opencl) code work well in emulator mode, but can't finish run on the FPGA

Altera_Forum
Honored Contributor II
2,742 Views

Hello: 

When I run my FPGA(OpenCL) code in emulator mode, the result is right; but after I create hardware configuration file and run on the FPGA, the code is blocked and can't finish.My code used channel.What is the problem?How to debug the code in this situation?Thank you very much!
0 Kudos
23 Replies
Altera_Forum
Honored Contributor II
941 Views

1) Is there any error output? 

2) What hardware are you using? 

3) Did you use a reference design?
0 Kudos
Altera_Forum
Honored Contributor II
941 Views

This is a very normal and recurring situation; Altera's emulator has a lot of limitations, first and foremost the fact that it doesn't emulate concurrency/parallelism. When channels are used, it is very likely that if you are not careful enough, your code will block on the FPGA even though it works fine on the emulator.  

 

Debugging the code in this situation is not easy. You can try these steps to see if you can find the problem: 

 

1- Add some counters to your code in the emulator and make sure you are writing the same number of values to every channel, that you are reading from it. 

2- Pay attention to the "order" of your channels, try to think of situations that if the channels are reordered, your code may block. It is very likely that this is your problem since the compiler does NOT guarantee channel ordering. Carefully read the "Programming Guide > 1.6.4.5.7 Enforcing the Order of Channel Calls" and try using "mem_fence(CLK_CHANNEL_MEM_FENCE)" to force the order of your channels and see if it fixes the issue. 

3- Increasing the depth of channels might help. 

4- You can add printf to your OpenCL kernel and run it on the FPGA; even though it will heavily slow down the kernel, it might help you find the channel that is blocking the execution. Try to use a light printf (avoid printing values from the kernel, just print a fixed text to see where it is blocking) to avoid new dependencies.
0 Kudos
Altera_Forum
Honored Contributor II
941 Views

Thanks for your reply! 

I want to printf some informations, but the informations are not display.Do you know why?Does it means my code did not reach there before blocked?
0 Kudos
Altera_Forum
Honored Contributor II
941 Views

Thanks for your reply! 

I want to printf some informations, but the informations are not display.Do you know why?Does it means my code did not reach there before blocked?
0 Kudos
Altera_Forum
Honored Contributor II
941 Views

Thank you for reply! 

1)There is no erroroutput. 

2)Xeon + Arria10 

3)I refer to Altera's channel exampls! 

Do you know any method to debug the code?
0 Kudos
Altera_Forum
Honored Contributor II
941 Views

 

--- Quote Start ---  

1) Is there any error output? 

2) What hardware are you using? 

3) Did you use a reference design? 

--- Quote End ---  

 

 

Thank you for reply! 

1)There is no erroroutput. 

2)Xeon + Arria10 

3)I refer to Altera's channel exampls! 

Do you know any method to debug the code?
0 Kudos
Altera_Forum
Honored Contributor II
941 Views

 

--- Quote Start ---  

Thanks for your reply! 

I want to printf some informations, but the informations are not display.Do you know why?Does it means my code did not reach there before blocked? 

--- Quote End ---  

 

 

 

Probably yes. Try putting a printf at the beginning of your kernel code to see if you get any output. I have never personally used printf on the FPGA itself.
0 Kudos
Altera_Forum
Honored Contributor II
941 Views

 

--- Quote Start ---  

1) Is there any error output? 

2) What hardware are you using? 

3) Did you use a reference design? 

--- Quote End ---  

 

 

 

--- Quote Start ---  

This is a very normal and recurring situation; Altera's emulator has a lot of limitations, first and foremost the fact that it doesn't emulate concurrency/parallelism. When channels are used, it is very likely that if you are not careful enough, your code will block on the FPGA even though it works fine on the emulator.  

 

Debugging the code in this situation is not easy. You can try these steps to see if you can find the problem: 

 

1- Add some counters to your code in the emulator and make sure you are writing the same number of values to every channel, that you are reading from it. 

2- Pay attention to the "order" of your channels, try to think of situations that if the channels are reordered, your code may block. It is very likely that this is your problem since the compiler does NOT guarantee channel ordering. Carefully read the "Programming Guide > 1.6.4.5.7 Enforcing the Order of Channel Calls" and try using "mem_fence(CLK_CHANNEL_MEM_FENCE)" to force the order of your channels and see if it fixes the issue. 

3- Increasing the depth of channels might help. 

4- You can add printf to your OpenCL kernel and run it on the FPGA; even though it will heavily slow down the kernel, it might help you find the channel that is blocking the execution. Try to use a light printf (avoid printing values from the kernel, just print a fixed text to see where it is blocking) to avoid new dependencies. 

--- Quote End ---  

 

 

Thanks for your reply! 

I want to printf some informations, but the informations are not display.Do you know why?Does it means my code did not reach there before blocked? 

 

My code as belows: 

# pragma OPENCL_EXTENSION cl_altera_channels : enable 

 

// Channel declarations 

channel float DATA_IN __attribute__((depth(8))); 

channel float DATA_OUT __attribute__((depth(8))); 

channel float CONV1_WEIGHTS __attribute__((depth(8))); 

channel float CONV1_BIAS __attribute__((depth(8))); 

 

//num_channel = num_pre_feature_maps!!!!! 

 

__kernel 

void data_in(int num_pre_feature_maps, int num_feature_maps, __global const float *restrict input) 

printf("data in\n"); 

int global_idx = get_global_id(0); 

int global_idy = get_global_id(1); 

int global_offset = global_idy * get_local_size(0) + global_idx; 

printf("data in global_idx=%d global_idy=%d\n", global_idx, global_idy); 

 

//Read data 

float data = input[global_offset]; 

for(int i = 0; i < num_feature_maps; ++i){ 

for(int j = 0; j < num_pre_feature_maps; ++j){ 

printf("channel read datanum_feature_map=%d global_idx=%d global_idy=%d\n", i, global_idx, global_idy); 

write_channel_altera(DATA_IN, data); 

 

 

 

 

__kernel 

void weights_bias_in(__global const float *restrict weights, __global const float *restrict bias) 

printf("here\n"); 

int global_idx = get_global_id(0); 

int global_idy = get_global_id(1); 

//Read Weights 

write_channel_altera(CONV1_WEIGHTS, weights[global_idy * KERNEL_SIZE * KERNEL_SIZE + global_idx]); 

 

 

//Read Bias 

if(global_idx == 0){ 

printf("gidx=%d gidy=%d\n", global_idx, global_idy); 

write_channel_altera(CONV1_BIAS, bias[global_idy]); 

 

 

 

__kernel 

void conv(int map_size, int num_pre_feature_maps, int num_feature_maps, int relu_on) 

printf("conv start\n"); 

float res_buf[28 * INPUT_SIZE]; 

float weights_buf[KERNEL_SIZE * KERNEL_SIZE]; 

float bias_buf[MAX_NUM_FEATURE_MAPS]; 

float rows[4 * INPUT_SIZE + 5]; 

 

//load bias 

for(int i = 0; i < num_feature_maps; ++i){ 

printf("i=%d\n", i); 

bias_buf = read_channel_altera(conv1_bias); 

//printf("bias data=%f\n", bias_buf); 

//printf("End of load bias\n"); 

 

 

//int input_size = map_size + KERNEL_SIZE - 1; 

for(int num_feature_map = 0; num_feature_map < num_feature_maps; ++num_feature_map){ 

for(int i = 0; i < KERNEL_SIZE * KERNEL_SIZE; ++i){ 

weights_buf = read_channel_altera(conv1_weights); 

printf("i=%d w=%f\n", i, weights_buf); 

 

for(int i = 0; i < 1024; ++i){ 

for(int j = (4 * INPUT_SIZE + 4); j >0; --j){ 

res_buf[j] = res_buf[j - 1]; 

res_buf[0] = read_channel_altera(DATA_IN); 

//test 

for(int i = 0; i < map_size; ++i){ 

for(int j = 0; j < map_size; ++j){ 

double res = res_buf[i * INPUT_SIZE + j] + weights_buf[num_feature_map] + bias_buf[num_feature_map]; 

if(relu_on){ 

res = (int)res > 0.0f ? (int)res : 0.0f; 

printf("write channel i=%d j=%d res=%f\n", i, j, res); 

write_channel_altera(DATA_OUT, res); 

 

 

 

__kernel 

void data_out(int num_feature_maps, __global float *restrict output) 

printf("data out start\n"); 

int local_idx = get_local_id(0); 

int local_idy = get_local_id(1); 

int global_idx = get_global_id(0); 

int global_idy = get_global_id(1); 

printf("data out global_idx=%d global_idy=%d\n", global_idx, global_idy); 

 

for(int i = 0; i < num_feature_maps; ++i){ 

printf("data out num_feature_map=%d global_idx=%d global_idy=%d\n", i, global_idx, global_idy); 

float data = read_channel_altera(DATA_OUT); 

//float data = 1; 

int global_offset = global_idy * get_local_size(0) + global_idx; 

output[global_offset] = data; 

 

output += get_global_size(1) * get_global_size(1); 

 

 

Can you find any problem?? 

 

Thank you!
0 Kudos
Altera_Forum
Honored Contributor II
941 Views

 

--- Quote Start ---  

1) Is there any error output? 

2) What hardware are you using? 

3) Did you use a reference design? 

--- Quote End ---  

 

 

 

--- Quote Start ---  

Probably yes. Try putting a printf at the beginning of your kernel code to see if you get any output. I have never personally used printf on the FPGA itself. 

--- Quote End ---  

 

 

I printf some information at the beginning of my kernel, but they don't printf out.
0 Kudos
Altera_Forum
Honored Contributor II
941 Views

 

--- Quote Start ---  

1) Is there any error output? 

2) What hardware are you using? 

3) Did you use a reference design? 

--- Quote End ---  

 

 

 

--- Quote Start ---  

Thanks for your reply! 

I want to printf some informations, but the informations are not display.Do you know why?Does it means my code did not reach there before blocked? 

 

My code as belows: 

 

# pragma OPENCL_EXTENSION cl_altera_channels : enable 

 

// Channel declarations 

channel float DATA_IN __attribute__((depth(8))); 

channel float DATA_OUT __attribute__((depth(8))); 

channel float CONV1_WEIGHTS __attribute__((depth(8))); 

channel float CONV1_BIAS __attribute__((depth(8))); 

 

//num_channel = num_pre_feature_maps!!!!! 

 

__kernel 

void data_in(int num_pre_feature_maps, int num_feature_maps, __global const float *restrict input) 

printf("data in\n"); 

int global_idx = get_global_id(0); 

int global_idy = get_global_id(1); 

int global_offset = global_idy * get_local_size(0) + global_idx; 

printf("data in global_idx=%d global_idy=%d\n", global_idx, global_idy); 

 

//Read data 

float data = input[global_offset]; 

for(int i = 0; i < num_feature_maps; ++i){ 

for(int j = 0; j < num_pre_feature_maps; ++j){ 

printf("channel read datanum_feature_map=%d global_idx=%d global_idy=%d\n", i, global_idx, global_idy); 

write_channel_altera(DATA_IN, data); 

 

 

 

 

__kernel 

void weights_bias_in(__global const float *restrict weights, __global const float *restrict bias) 

printf("here\n"); 

int global_idx = get_global_id(0); 

int global_idy = get_global_id(1); 

//Read Weights 

write_channel_altera(CONV1_WEIGHTS, weights[global_idy * KERNEL_SIZE * KERNEL_SIZE + global_idx]); 

 

 

//Read Bias 

if(global_idx == 0){ 

printf("gidx=%d gidy=%d\n", global_idx, global_idy); 

write_channel_altera(CONV1_BIAS, bias[global_idy]); 

 

 

 

__kernel 

void conv(int map_size, int num_pre_feature_maps, int num_feature_maps, int relu_on) 

printf("conv start\n"); 

float res_buf[28 * INPUT_SIZE]; 

float weights_buf[KERNEL_SIZE * KERNEL_SIZE]; 

float bias_buf[MAX_NUM_FEATURE_MAPS]; 

float rows[4 * INPUT_SIZE + 5]; 

 

//load bias 

for(int i = 0; i < num_feature_maps; ++i){ 

printf("i=%d\n", i); 

bias_buf = read_channel_altera(conv1_bias); 

//printf("bias data=%f\n", bias_buf); 

//printf("End of load bias\n"); 

 

 

//int input_size = map_size + KERNEL_SIZE - 1; 

for(int num_feature_map = 0; num_feature_map < num_feature_maps; ++num_feature_map){ 

for(int i = 0; i < KERNEL_SIZE * KERNEL_SIZE; ++i){ 

weights_buf = read_channel_altera(conv1_weights); 

printf("i=%d w=%f\n", i, weights_buf); 

 

for(int i = 0; i < 1024; ++i){ 

for(int j = (4 * INPUT_SIZE + 4); j >0; --j){ 

res_buf[j] = res_buf[j - 1]; 

res_buf[0] = read_channel_altera(DATA_IN); 

//test 

for(int i = 0; i < map_size; ++i){ 

for(int j = 0; j < map_size; ++j){ 

double res = res_buf[i * INPUT_SIZE + j] + weights_buf[num_feature_map] + bias_buf[num_feature_map]; 

if(relu_on){ 

res = (int)res > 0.0f ? (int)res : 0.0f; 

printf("write channel i=%d j=%d res=%f\n", i, j, res); 

write_channel_altera(DATA_OUT, res); 

 

 

 

__kernel 

void data_out(int num_feature_maps, __global float *restrict output) 

printf("data out start\n"); 

int local_idx = get_local_id(0); 

int local_idy = get_local_id(1); 

int global_idx = get_global_id(0); 

int global_idy = get_global_id(1); 

printf("data out global_idx=%d global_idy=%d\n", global_idx, global_idy); 

 

for(int i = 0; i < num_feature_maps; ++i){ 

printf("data out num_feature_map=%d global_idx=%d global_idy=%d\n", i, global_idx, global_idy); 

float data = read_channel_altera(DATA_OUT); 

//float data = 1; 

int global_offset = global_idy * get_local_size(0) + global_idx; 

output[global_offset] = data; 

 

output += get_global_size(1) * get_global_size(1); 

 

 

Can you find any problem?? 

 

Thank you! 

--- Quote End ---  

 

 

 

It only can printf out information in the "weights_bias_in" kernel.
0 Kudos
Altera_Forum
Honored Contributor II
941 Views

There is no way the "data_in" kernel would deadlock without printing anything. If you don't get any output from that kernel at runtime, it is possible that there is something wrong with your host code. I have no idea how Altera implements printf, though; there might be implementation-specific details that block the printf calls. 

 

Needless to say, you definitely have an ordering issue in reading and writing CONV1_BIAS and CONV1_WEIGHTS; you are first writing CONV1_WEIGHTS and then CONV1_BIAS, while reading them in the opposite direction. This is very likely one major (or the only) source of your problem.
0 Kudos
Altera_Forum
Honored Contributor II
941 Views

 

--- Quote Start ---  

There is no way the "data_in" kernel would deadlock without printing anything. If you don't get any output from that kernel at runtime, it is possible that there is something wrong with your host code. I have no idea how Altera implements printf, though; there might be implementation-specific details that block the printf calls. 

 

Needless to say, you definitely have an ordering issue in reading and writing CONV1_BIAS and CONV1_WEIGHTS; you are first writing CONV1_WEIGHTS and then CONV1_BIAS, while reading them in the opposite direction. This is very likely one major (or the only) source of your problem. 

--- Quote End ---  

 

 

Thanks for your reply! 

First, about printf, "Intel FPFA for opencl programming guide" says: 

"During kernel execution, printf data is stored in a global printf buffer that the Intel FPGA SDK for OpenCL Offline 

Compiler allocates automatically. The size of this buffer is 64 kB; the total size of data arguments to a printf call should 

not exceed this size. When kernel execution completes, the contents of the printf buffer are printed to standard output. 

Buffer overflows are handled seamlessly; printf instructions can be executed an unlimited number of times. However, if 

the printf buffer overflows, kernel pipeline execution stalls until the host reads the buffer and prints the buffer contents. 

Because printf functions store their data into a global memory buffer, the performance of your kernel will drop if it 

includes such functions". 

So, I think my printf information stored in the buffer? Becaue in emulator mode,my code result is right, so i think the host code is right? The host code have four queus, every kernel correspond to on queue. 

 

Second, I have changed the sequence of reading and writing CONV1_BIAS and CONV1_WEIGHTS, and also increaing the depth of channel to 20.Complie is running,I expect I will get right result.  

 

Thank you very much!!
0 Kudos
Altera_Forum
Honored Contributor II
941 Views

 

--- Quote Start ---  

So, I think my printf information stored in the buffer? Becaue in emulator mode,my code result is right, so i think the host code is right? The host code have four queus, every kernel correspond to on queue. 

--- Quote End ---  

 

That is correct. I was under the impression that printf calls are "streamed" over PCI-E. If the calls are stored in a buffer and printed out "after" kernel execution is finished, then printf is not only completely useless for debugging deadlocks, using it can make deadlocks more likely. 

 

 

--- Quote Start ---  

Second, I have changed the sequence of reading and writing CONV1_BIAS and CONV1_WEIGHTS, and also increaing the depth of channel to 20.Complie is running,I expect I will get right result.  

--- Quote End ---  

 

If your kernel still didn't work, I recommend removing the printfs (since it might cause deadlocks of its own) and also using "mem_fence(CLK_CHANNEL_MEM_FENCE)" to enforce channel ordering in kernels that have multiple channel reads/writes.
0 Kudos
Altera_Forum
Honored Contributor II
941 Views

 

--- Quote Start ---  

If your kernel still didn't work, I recommend removing the printfs (since it might cause deadlocks of its own) and also using "mem_fence(CLK_CHANNEL_MEM_FENCE)" to enforce channel ordering in kernels that have multiple channel reads/writes. 

--- Quote End ---  

 

 

I have add "mem_fence(CLK_CHANNEL_MEM_FENCE)" to my code. I forget to say just now. 

 

But I considered that, in my code of last version, the printf information of kernel "weights_bias_in" can all printf out, so the kernel excute finished. That is to say, the other three kernels all start excute, read or write channels for many times before blocked. Especially for "conv", have been read “CONV1_WEIGHTS” for "num_feature_maps" times. 

So, is not the problem of ordering of reading and writing of channels? Is the problem of FIFO(channel) confilict? Have you ever encounter the problem? Does Chaneel has the problem?I am not quite sure, so I also increaing the deapth of channel!
0 Kudos
Altera_Forum
Honored Contributor II
941 Views

 

--- Quote Start ---  

If your kernel still didn't work, I recommend removing the printfs (since it might cause deadlocks of its own) and also using "mem_fence(CLK_CHANNEL_MEM_FENCE)" to enforce channel ordering in kernels that have multiple channel reads/writes. 

--- Quote End ---  

 

 

Complie finished! But nothing printf out and kernel blocked! Now I remove most of printf to try again! I consider the host code is wrong?? What error will cause this result???
0 Kudos
Altera_Forum
Honored Contributor II
941 Views

 

--- Quote Start ---  

But I considered that, in my code of last version, the printf information of kernel "weights_bias_in" can all printf out, so the kernel excute finished. That is to say, the other three kernels all start excute, read or write channels for many times before blocked. Especially for "conv", have been read “CONV1_WEIGHTS” for "num_feature_maps" times. 

So, is not the problem of ordering of reading and writing of channels? Is the problem of FIFO(channel) confilict? Have you ever encounter the problem? Does Chaneel has the problem?I am not quite sure, so I also increaing the deapth of channel! 

--- Quote End ---  

 

 

Your observation is correct but there are a lot of unknown factors involved in the process. I recommend removing all the printf calls except the ones that are the beginning of your kernels. 

If your host code is structured like this, it is correct (this is a pseudo code with shortened function names): 

 

ClEnqueue(queue1, kernel1); 

ClEnqueue(queue2, kernel2); 

ClEnqueue(queue3, kernel3); 

ClEnqueue(queue4, kernel4); 

ClFinish(queue4);
0 Kudos
Altera_Forum
Honored Contributor II
941 Views

 

--- Quote Start ---  

Your observation is correct but there are a lot of unknown factors involved in the process. I recommend removing all the printf calls except the ones that are the beginning of your kernels. 

If your host code is structured like this, it is correct (this is a pseudo code with shortened function names): 

 

ClEnqueue(queue1, kernel1); 

ClEnqueue(queue2, kernel2); 

ClEnqueue(queue3, kernel3); 

ClEnqueue(queue4, kernel4); 

ClFinish(queue4); 

--- Quote End ---  

 

 

Hi, my host code is like this. I consulted an Intel FAE, he said my be caused by imbalance of write and read of channels! But I think, imlalance will not cause channel blocked, because I use buffered channel! What do you think?Tank you!
0 Kudos
Altera_Forum
Honored Contributor II
941 Views

If you have channel ordering issues, imbalance CAN result in deadlock even with buffered channels. Did you try removing all your printf calls and using mem_fence for all kernels that have more than one channel read/write?

0 Kudos
Altera_Forum
Honored Contributor II
941 Views

 

--- Quote Start ---  

If you have channel ordering issues, imbalance CAN result in deadlock even with buffered channels. Did you try removing all your printf calls and using mem_fence for all kernels that have more than one channel read/write? 

--- Quote End ---  

 

 

Yes, I have try, but still don't work. I have channel ording issues, so I think its caused by imbalance of access of channel. So, if this is the question, How do I change my code?
0 Kudos
Altera_Forum
Honored Contributor II
835 Views

 

--- Quote Start ---  

Yes, I have try, but still don't work. I have channel ording issues, so I think its caused by imbalance of access of channel. So, if this is the question, How do I change my code? 

--- Quote End ---  

 

Easiest solution is to put all of your code in one kernel, remove all the channels, and use local memory buffers instead.
0 Kudos
Reply