Re: Wrong results when running design on hardware

Altera_Forum · ‎08-23-2017

Hello,

My design is made of a chain of single work-item kernels transfering data using channels.

It runs fine on emulation, and the FPGA binary is built correclty (95% of estimated usage).

Here is my problem:

Both emulation and hardware run up to completion (no deadlock), but only the emulation produces correct results.

The machines used for development and deployment are different, and it is not possible to use the same machine for both steps.

The only part that is recompiled in the deployment machine is the host binary, so I guess that could be the issue but not sure where to start looking for the problem cause.

Also, the host part processes the output from the FPGA after the latter has finished. Could any host compilation be affecting results?

Did anyone experience a similar issue?

Any hints will be apprecciated.

Leonardo

Altera_Forum · ‎08-23-2017

Your problem is very likely not caused by host compilation, but rather a race condition or some other issue that does not show up in the emulator (and there are a lot of them).

I have so far encountered two such cases:

- Two kernels running in parallel, one updating an off-chip memory location and then sending a "completion" flag to the other kernel, and then the other kernel reading that memory location. This will work correctly in the emulator, since the emulator does NOT emulate concurrency, but will produce incorrect results on the actual FPGA.

- Accidentally reducing the scope of a variable more than necessary. For example in the following code, the output will be 10 on the emulator, but it will be 0 on the FPGA:

int sum = 0;
for (int i=0; i<1; i++)
{
  int sum = 10;
}
printf("%d", sum);

Altera_Forum · ‎08-24-2017

Hi HRZ,

Yes, my design falls into the first case you mentioned.

I thought that adding a fence on global memory would ensure consistency, i.e. something like this was initially added in my design:

Kernel updating off-chip memory:


void Krnl_Store( ... ) {
  ...
  // writing to global-memory location
  write_mem_fence(CLK_GLOBAL_MEM_FENCE);
  write_channel_altera(chan_Store2GG_ack, 1);
  ...
}

Kernel reading off-chip memory:


  void Krnl_GG( ... ) {
  ...
  ack = read_channel_altera(chan_Store2GG_ack);
  // reading from global-memory location
  ...
}

It seems adding a fence in the writing kernel doesn't work,

so, what would be a possible solution for this case?

Thank you,

Leonardo

Altera_Forum · ‎08-24-2017

Hi HRZ,

Yes, my design falls into the first case you mentioned.

I thought that adding a fence on global memory would ensure consistency, i.e. something like this was initially added in my design:

Kernel updating off-chip memory:


void Krnl_Store( ... ) {
  ...
  // writing to global-memory location
  write_mem_fence(CLK_GLOBAL_MEM_FENCE);
  write_channel_altera(chan_Store2GG_ack, 1);
  ...
}

Kernel reading off-chip memory:


  void Krnl_GG( ... ) {
  ...
  ack = read_channel_altera(chan_Store2GG_ack);
  // reading from global-memory location
  ...
}

It seems adding a fence in the writing kernel doesn't work,

so, what would be a possible solution for this case?

Thank you,

Leonardo

Altera_Forum · ‎08-24-2017

I only tried doing this once, with multiple different barrier configurations, but none of the configurations worked in the end. Since debugging on the FPGA is too time-consuming and troublesome, and the emulator fails to correctly emulate and show this behavior, I gave up on that design and merged the two kernels into one to make sure I would get correct results. Even though the "CLK_GLOBAL_MEM_FENCE" seems to be supposed to avoid such race conditions, in practice it doesn't seem to work as intended. Note that the OpenCL specification does NOT guarantee global memory consistency unless at the end of kernel execution and hence, Altera doesn't have to provide the means to avoid such problems. I would suggest seeking an alternative kernel design. You could also open a ticket directly with Altera and ask them why the barrier is not working as it should, in this case.

Altera_Forum · ‎08-27-2017

--- Quote Start ---

I only tried doing this once, with multiple different barrier configurations, but none of the configurations worked in the end. Since debugging on the FPGA is too time-consuming and troublesome, and the emulator fails to correctly emulate and show this behavior, I gave up on that design and merged the two kernels into one to make sure I would get correct results. Even though the "CLK_GLOBAL_MEM_FENCE" seems to be supposed to avoid such race conditions, in practice it doesn't seem to work as intended. Note that the OpenCL specification does NOT guarantee global memory consistency unless at the end of kernel execution and hence, Altera doesn't have to provide the means to avoid such problems. I would suggest seeking an alternative kernel design. You could also open a ticket directly with Altera and ask them why the barrier is not working as it should, in this case.

--- Quote End ---

Merging kernels was the solution.

Thanks!

Altera_Forum · ‎08-28-2017

--- Quote Start ---

Your problem is very likely not caused by host compilation, but rather a race condition or some other issue that does not show up in the emulator (and there are a lot of them).

I have so far encountered two such cases:

- Two kernels running in parallel, one updating an off-chip memory location and then sending a "completion" flag to the other kernel, and then the other kernel reading that memory location. This will work correctly in the emulator, since the emulator does NOT emulate concurrency, but will produce incorrect results on the actual FPGA.

- Accidentally reducing the scope of a variable more than necessary. For example in the following code, the output will be 10 on the emulator, but it will be 0 on the FPGA:

int sum = 0;
for (int i=0; i<1; i++)
{
  int sum = 10;
}
printf("%d", sum);

--- Quote End ---

Hi

So we can't solve the data race problem on the FPGA unless we run it in serial? That's a bummer...

And for the second case, it seems to be a very common usage, surely not every for loop will go wrong?

Altera_Forum · ‎08-29-2017

The first one is a standard load after store dependency. It will cause a race condition and potentially incorrect results on every hardware, not just FPGAs. Using atomic load/stores might bypass this problem, but that would be extremely slow to the point that it might be faster if you just serialize your loads and stores. If both operations are in the same kernel, the compiler will make sure to use the lowest-possible iteration interval to avoid the dependency. In the worst case, it will serialize them.

The second case is an incorrect code. I am redefining the same variable inside the loop, which will likely result in an error in standard C code (haven't tested) but for some reason, OpenCL compilers and also Altera's emulator seem to compile that code successfully (even though they shouldn't) and in a way that there is only one "sum" variable, while AOC [correctly] compiles the code in a way that there are two, with one of them only being valid inside of the loop. This case should not exist in any correct code; I have made this "mistake" a few times, though, and couldn't catch the mistake by using the emulator. Note that the cases I encountered this issue were a lot more complex, and the example I have used here might not actually show the problem (or it might outright fail to compile).

Altera_Forum · ‎08-29-2017

Hi HRZ

Thanks for the reply. I overlook that you redefined the int.

For the first one, as far as I know CUDA or OpenCL(GPU) won't run different kernels concurrently unless you specified it with stream, but for Altera FPGA will just launch them concurrently right?

--------

Just read the Best Practices Guide and it mentioned in AOCL you use different queues to enable concurrent execution and use attribute (blocking) to avoid data race in pipeline.

Another thing I'm wondering is does launching kernels using NDRange for FPGA really execute them in parallel or just pipeline fashion?

One of Altera's online lesson said we should launch kernels in pipeline parallelism, so I assume NDRange is full parallelism, but if that's the case why is the hardware usage already determined when compiling aocx? Because you specify it with attribute?

Sorry for the second question, it should be documented somewhere but I may be using the wrong key word and can't find it. Thanks

Altera_Forum · ‎08-30-2017

I am not sure about CUDA, but with OpenCL on GPUs, you can still have multiple queues and try to run multiple kernels in parallel, and they could actually run in parallel on the hardware as long as there are shader blocks left unused by the first kernel. And you can also always have such kind of races between work-items from the same kernel which are running in different work-groups.

Regarding NDRange kernels, without SIMD, all work-items from all work-groups will be pipelined on the actual hardware and no two threads will ever be issued in the same clock (hence you don't need to recompile the kernel if you change local or global size). However, if you use SIMD, as many threads as your SIMD width can potentially be issued in the same clock. With num_compute_units, you can have multiple work-groups issued concurrently in different compute units.

Altera_Forum · ‎09-25-2017

--- Quote Start ---

I am not sure about CUDA, but with OpenCL on GPUs, you can still have multiple queues and try to run multiple kernels in parallel, and they could actually run in parallel on the hardware as long as there are shader blocks left unused by the first kernel. And you can also always have such kind of races between work-items from the same kernel which are running in different work-groups.

Regarding NDRange kernels, without SIMD, all work-items from all work-groups will be pipelined on the actual hardware and no two threads will ever be issued in the same clock (hence you don't need to recompile the kernel if you change local or global size). However, if you use SIMD, as many threads as your SIMD width can potentially be issued in the same clock. With num_compute_units, you can have multiple work-groups issued concurrently in different compute units.

--- Quote End ---

I tried num_compute_units on an image processing kernel, I guess because of I have to do indexing inside the kernel(to work on different image region) it became even slower. I made two copies of the kernel under different name and execute them under different queues, and pass different region of the image into the kernels, it took half of the original time as I expected.

So now I'm trying to figure out if there's an easier way to do this, it's doesn't seem wise to do this multiple copies strategy manually when I need a large number of copies, and the num_compute_units attribute doesn't help much because of the index computation overhead (unless I make copies for each processing pixels which will just take too much resources). And SIMD can only be applied on kernels which computation can be vectorized.

-----------------------

I found that simply using get_global_id(0) will cause a 10ms scale latency. No way to avoid this except launch kernels manually I guess.

Altera_Forum · ‎09-26-2017

Are you talking about num_compute_units for NDRange kernels or single work-item kernels? num_compute_units for NDRange kernels works in a fully automatic manner and does not require any user intervention other than adding the attribute to the kernel header. The compiler will automatically replicate the pipeline in this case, allowing multiple work-groups to be scheduled in parallel. This obviously comes at the cost of higher area usage and higher memory bandwidth utilization. If memory bandwidth is saturated, using num_compute_units will actually reduce performance due to extra memory contention.

Altera_Forum · ‎09-26-2017

--- Quote Start ---

Are you talking about num_compute_units for NDRange kernels or single work-item kernels? num_compute_units for NDRange kernels works in a fully automatic manner and does not require any user intervention other than adding the attribute to the kernel header. The compiler will automatically replicate the pipeline in this case, allowing multiple work-groups to be scheduled in parallel. This obviously comes at the cost of higher area usage and higher memory bandwidth utilization. If memory bandwidth is saturated, using num_compute_units will actually reduce performance due to extra memory contention.

--- Quote End ---

Thx for the reply.

I meant from single work-item to NDRange. Original kernel process all pixels in one for loop, then I tried spiting the loop in half and launch two copies in parallel. By using NDRange I have to call get_global_id and later I found out that this will cause latency compare to not using it at all. I guess it's not a big deal when the kernels are complex and this 10ms doesn't cause a bottle neck, mimicking GPU programming on FPGA just won't pay off...