Re: Re:Non-Ordered Pipes for Random-Number-Generation

Christoph9 · ‎10-16-2022

Hello,

I currently am trying to fit a design onto the Arria10 FPGA that requires random-numbers at multiple locations in the code, sometimes in sub-sub-subroutines.

First I tried to hand the state of the random-number generator. I use the engine used by DPCT. Here my autorun kernel showing how I use it:

class random_generator_kernel_id;

struct rnd_generator {
  void operator()() const {
    auto rand_state = dpct::rng::device::rng_generator<
        oneapi::mkl::rng::device::philox4x32x10<4>>(1984, { 0, 0 * 8 });
    while (1) 
    {
        sycl::ext::intel::pipe<rnd_out_pipe_id, float, 8>::write(
            rand_state
                .generate<oneapi::mkl::rng::device::uniform<float>,
                            1>());
    }
  }
};

fpga_tools::Autorun<random_generator_kernel_id> ar_rnd_gen{ds, rnd_generator{}};

However, this resulted in huge area-utilization during the low-level synthesis (the HLS gave me low estimates in the report, which should have fit easily).

I tried many things and came up with an idea: Use a second autorun kernel (as in the autorun-tutorial in the OneAPI samples repository) to generate the random numbers and put them in a pipe (as shown above).

My other kernel (ND-Range), can then just read from the pipe at multiple locations. In the OneAPI programming guide for FPGAs is noted that no order-guarantee can be given when using pipes with ND-range kernels, but this should be fine as I just use it for random numbers.

This compiles, but throws warnings like this for pipe-reads in loops:

raytracing.dp.cpp:170: Compiler Warning: Pipe ordering required barrier insertion in for.body.i.i of pipe_ZTS18RaytracingKernelID, but kernel may hang as a result
camera.h:33: Compiler Warning: Pipe ordering required barrier insertion in do.body.i.i.i.i of pipe_ZTS18RaytracingKernelID, but kernel may hang as a result
hitable_list.h:44: Compiler Warning: Pipe ordering required barrier insertion in for.body.i.i.i.i of pipe_ZTS18RaytracingKernelID, but kernel may hang as a result
raytracing.dp.cpp:122: Compiler Warning: Pipe ordering required barrier insertion in _ZNK12hitable_list3hitERK3rayffR10hit_record.exit.i.i.i of pipe_ZTS18RaytracingKernelID, but kernel may hang as a result
material.h:103: Compiler Warning: Pipe ordering required barrier insertion in do.body.i.i.i.i.i.i of pipe_ZTS18RaytracingKernelID, but kernel may hang as a result
material.h:103: Compiler Warning: Pipe ordering required barrier insertion in do.body.i.i105.i.i.i.i of pipe_ZTS18RaytracingKernelID, but kernel may hang as a result
raytracing.dp.cpp:170: Compiler Warning: Pipe ordering required barrier insertion in for.cond.i.i.i.i.preheader.UnifiedLatchBlock.switch of pipe_ZTS18RaytracingKernelID, but kernel may hang as a result

And as the compiler said, the kernel indeed gets stuck and hangs at execution.

So now my question: How can I prevent the compiler from establishing some kind of pipe ordering in loops, as I do not need this for the random-numbers?
And if this is not possible, is there another way for efficient random-number generation on an FPGA in SYCL without needing to pass the whole random-state to subroutines causing exploding area-utilization?

Thanks in advance for any suggestions,
Christoph

KennyTan_Altera · ‎10-20-2022

Hi Christoph,

Sorry for the late reply on this, we are actually investigating this. This issue is quite complex and it will take times for us to do the investigation. Will get back to you as soon as possible.

Thanks

Christoph9 · ‎10-21-2022

Hey Kenny,

thanks for your reply, I'm looking forward to your investigation!

Best regards,
Christoph

BoonBengT_Altera · ‎10-20-2022

Hi @Christoph9,

Thank you for posting in Intel community forum, hope all is well and apologies for the delayed in response.

If I understand the situation correctly, what are to be achieve data passing between kernel concurrently in kernels execution.

For that I would say yes, there is a way for implementing a non-blocking for the reads and writes in pipes.

This will enable prevent stalling to the kernel until the fifo buffer is free.

More details explanation can be found in our optimization guide below under section 4.3.1 pipes extension:

- https://www.intel.com/content/dam/develop/external/us/en/documents/oneapi-dpcpp-fpga-optimization-guide.pdf#page=183&zoom=100,0,333

Readily available pipes tutorial are also available for the convenient to try on as below:

https://github.com/oneapi-src/oneAPI-samples/tree/master/DirectProgramming/DPC%2B%2BFPGA/Tutorials/Features/pipes

Hope that clarify.

Best Wishes

BB

Christoph9 · ‎10-21-2022

Hey BB,

thanks for your reply, no problems with the late response!

The problem is that I already use non-blocking pipes. The warnings result from reading from these pipes in e.g. a while loop:

#define RANDVEC3               \
    vec3(sycl::ext::intel::pipe<rnd_out_pipe_id, float, 8>::read(), \
         sycl::ext::intel::pipe<rnd_out_pipe_id, float, 8>::read(), \
         sycl::ext::intel::pipe<rnd_out_pipe_id, float, 8>::read())

vec3
random_in_unit_sphere()
{
    vec3 p;
    do
    {
        p = 2.0f * RANDVEC3 - vec3(1.0f, 1.0f, 1.0f);
    } while (p.squared_length() >= 1.0f);
    return p;
}

This results in the warning shown in the first post. The compiler seems to try to maintain the ordering of read's to the pipes by using barriers which let the kernel get stuck at runtime.

However I use these pipes in an ND-range kernel for random numbers, so the order in which the read's take place are not of any concern. I just used the pipes for random-numbers as passing the random-state itself through the various functions of the kernel and therefore its pipeline let the area-utilization explode during low-level synthesis with Quartus (the HLS expected a much lower utilization, one that should easily fit on the FPGA).

Therefore it would be great if I could diable the forcing of the read-order by the compiler or if I found another way for creating random-numbers efficiently deep inside my kernel. I did not found any articles on efficient random-number generation with DPC++ on FPGAs, so I have no ideas for finding/creating such a implementation.

Kind regards,
Christoph

BoonBengT_Altera · ‎10-30-2022

Hi @Christoph9,

Thank you for the patients, im trying to simulate the warning you have, unfortunately no luck in that.

If convenient perhaps you can share the .cpp files you have from your end for better understanding.

And as looking deeper into the warning message like you said compiler are using the barriers to synchronize memory accesses across threads.

My guess is that are more toward the loop pipelining part, perhaps you can try upon compilation to disabled the loops pipelining options. (More details on the link here --> https://www.intel.com/content/dam/develop/external/us/en/documents/oneapi-dpcpp-fpga-optimization-guide.pdf#page=171&zoom=100,0,572)

Hope that clarify.

Best Wishes

BB

BoonBengT_Altera · ‎11-03-2022

Hey @Christoph9,

Greetings, just checking in to see if there is any further doubts in regards to this matter.

Hope we have clarify your doubts.

Best Wishes

BB

BoonBengT_Altera · ‎11-06-2022

Hi @Christoph9,

Good day, just following up on the previous clarification.

By any chances did you managed to look into the it?

Best Wishes

BB

Christoph9 · ‎11-08-2022

Hey BB,

thanks for your help and sorry for the late response, I did not came to work on this for the last week.

I made you the smallest little example of my setup that creates this error, it does not really something useful, just some snippets of the DPCT-translated and slightly modified raytracing benchmark of the Altis GPU Becnhmark.

You need the following files in some directory:

#include <CL/sycl.hpp>

#include <dpct/dpct.hpp>
#include <dpct/rng_utils.hpp>
#include <oneapi/dpl/random>
#include <oneapi/mkl.hpp>
#include <oneapi/mkl/rng/device.hpp>
#include <sycl/ext/intel/fpga_extensions.hpp>

#include "vec3.h"
#include "sphere.h"
#include "camera.h"
#include "random_gen.h"

namespace fpga_tools {

namespace detail {
// Autorun implementation
template <bool run_forever, typename KernelID>
struct Autorun_impl {
  // Constructor with a kernel name
  template <typename DeviceSelector, typename KernelFunctor>
  Autorun_impl(DeviceSelector device_selector, KernelFunctor kernel) {
    // static asserts to ensure KernelFunctor is callable
    static_assert(std::is_invocable_r_v<void, KernelFunctor>,
                  "KernelFunctor must be callable with no arguments");

    // create the device queue
    sycl::queue q{device_selector};

    // submit the user's kernel
    if constexpr (run_forever) {
      if constexpr (std::is_same_v<KernelID, void>) {
        // AutorunForever, kernel name not given
        q.single_task([=] {
          while (1) {
            kernel();
          }
        });
      } else {
        // AutorunForever, kernel name given
        q.single_task<KernelID>([=] {
          while (1) {
            kernel();
          }
        });
      }
    } else {
      // run the kernel as-is, if the user wanted it to run forever they
      // will write their own explicit while-loop
      if constexpr (std::is_same_v<KernelID, void>) {
        // Autorun, kernel name not given
        q.single_task(kernel);
      } else {
        // Autorun, kernel name given
        q.single_task<KernelID>(kernel);
      }
    }
  }
};
}  // namespace detail

// Autorun
template <typename KernelID = void>
using Autorun = detail::Autorun_impl<false, KernelID>;

// AutorunForever
template <typename KernelID = void>
using AutorunForever = detail::Autorun_impl<true, KernelID>;
}  // namespace fpga_tools

#if defined(FPGA_EMULATOR)
sycl::ext::intel::fpga_emulator_selector ds;
#else
sycl::ext::intel::fpga_selector ds;
#endif

class random_generator_kernel_id;

struct rnd_generator {
  void operator()() const {
    auto rand_state = dpct::rng::device::rng_generator<
        oneapi::mkl::rng::device::philox4x32x10<4>>(1984, { 0, 0 * 8 });
    while (1) 
    {
        sycl::ext::intel::pipe<rnd_out_pipe_id, float, 8>::write(
            rand_state
                .generate<oneapi::mkl::rng::device::uniform<float>,
                            1>());
    }
  }
};

fpga_tools::Autorun<random_generator_kernel_id> ar_rnd_gen{ds, rnd_generator{}};

int 
main(int argc, char *argv[])
{
    sycl::queue q{ds};

    const sycl::range<3> blocks(1, 4, 4);
    const sycl::range<3> threads(1, 16, 16);

    const size_t       fb_size = 3840 * 2160 * sizeof(vec3);
    sycl::buffer<vec3> h_fb { sycl::range(fb_size) };

    sycl::event render_event = q.submit([&](sycl::handler &cgh) {
        sycl::accessor a_fb { h_fb, cgh, sycl::write_only, sycl::no_init };

        cgh.parallel_for<class sample_kernel_id>(
            sycl::nd_range<3>(blocks * threads, threads),
            [=](sycl::nd_item<3> item_ct1) {
                int i = item_ct1.get_local_id(2)
                        + item_ct1.get_group(2) * item_ct1.get_local_range(2);
                int j = item_ct1.get_local_id(1)
                        + item_ct1.get_group(1) * item_ct1.get_local_range(1);
                if ((i >= 3840) || (j >= 2160))
                    return;
                int pixel_index = j * 3840 + i;

                vec3   col(0, 0, 0);
                vec3  lookfrom(13.0f, 2.0f, 3.0f);
                vec3  lookat(0.0f, 0.0f, 0.0f);
                float dist_to_focus = 10.0f;
                (lookfrom - lookat).length();
                float aperture = 0.1f;
                camera cam(lookfrom,
                            lookat,
                            vec3(0.0f, 1.0f, 0.0f),
                            30.0f,
                            float(3840) / float(2160),
                            aperture,
                            dist_to_focus);
                for (int s = 0; s < 10; s++)
                {
                    float u = float(i + sycl::ext::intel::pipe<rnd_out_pipe_id, float, 8>::read()) / float(3840);
                    float v = float(j + sycl::ext::intel::pipe<rnd_out_pipe_id, float, 8>::read()) / float(2160);
                    ray   r = cam.get_ray(u, v);
                }

                col /= float(10);
                col[0]          = sycl::sqrt(col[0]);
                col[1]          = sycl::sqrt(col[1]);
                col[2]          = sycl::sqrt(col[2]);
                a_fb[pixel_index] = col;
            });
    });
    render_event.wait();
    const float elapsed = render_event.get_profiling_info<
                                sycl::info::event_profiling::command_end>()
                            - render_event.get_profiling_info<
                                sycl::info::event_profiling::command_start>();

    std::cout << "Test elapsed: " << elapsed << std::endl;

    return 0;
}

(test.cpp)

////////////////////////////////////////////////////////////////////////////////////////////////////
// file:	altis\src\cuda\level2\raytracing\camera.h
//
// summary:	Declares the camera class
// 
// origin: Ray tracing(https://github.com/ssangx/raytracing.cuda)
////////////////////////////////////////////////////////////////////////////////////////////////////

#ifndef CAMERAH
#define CAMERAH

#include <CL/sycl.hpp>
#include <dpct/dpct.hpp>
#include <oneapi/mkl.hpp>
#include <oneapi/mkl/rng/device.hpp>
#include <dpct/rng_utils.hpp>

#include "ray.h"
#include "random_gen.h"

////////////////////////////////////////////////////////////////////////////////////////////////////
/// <summary>	Random in unit disk. </summary>
///
/// <remarks>	Ed, 5/20/2020. </remarks>
///
/// <param name="local_rand_state">	[in,out] If non-null, state of the local random. </param>
///
/// <returns>	A vec3. </returns>
////////////////////////////////////////////////////////////////////////////////////////////////////

/*
DPCT1032:1298: A different random number generator is used. You may need to
adjust the code.
*/
vec3 random_in_unit_disk() {
    vec3 p;
    do {
        /*
        DPCT1084:1299: The function call has multiple migration results in
        different template instantiations that could not be unified. You may
        need to adjust the code.
        */
        p = 2.0f * vec3(sycl::ext::intel::pipe<rnd_out_pipe_id, float, 8>::read(),
                        sycl::ext::intel::pipe<rnd_out_pipe_id, float, 8>::read(),
                        0) -
            vec3(1, 1, 0);
    } while (dot(p,p) >= 1.0f);
    return p;
}

////////////////////////////////////////////////////////////////////////////////////////////////////
/// <summary>	A camera. </summary>
///
/// <remarks>	Ed, 5/20/2020. </remarks>
////////////////////////////////////////////////////////////////////////////////////////////////////

class camera {
public:
    camera(vec3 lookfrom, vec3 lookat, vec3 vup, float vfov, float aspect, float aperture, float focus_dist) { // vfov is top to bottom in degrees
        lens_radius = aperture / 2.0f;
        float theta = vfov*((float)M_PI)/180.0f;
        float half_height = sycl::tan(theta / 2.0f);
        float half_width = aspect * half_height;
        origin = lookfrom;
        /*
        DPCT1084:1300: The function call has multiple migration results in
        different template instantiations that could not be unified. You may
        need to adjust the code.
        */
        w = unit_vector(lookfrom - lookat);
        /*
        DPCT1084:1301: The function call has multiple migration results in
        different template instantiations that could not be unified. You may
        need to adjust the code.
        */
        u = unit_vector(cross(vup, w));
        /*
        DPCT1084:1302: The function call has multiple migration results in
        different template instantiations that could not be unified. You may
        need to adjust the code.
        */
        v = cross(w, u);
        /*
        DPCT1084:1303: The function call has multiple migration results in
        different template instantiations that could not be unified. You may
        need to adjust the code.
        */
        lower_left_corner = origin - half_width * focus_dist * u -
                            half_height * focus_dist * v - focus_dist * w;
        /*
        DPCT1084:1304: The function call has multiple migration results in
        different template instantiations that could not be unified. You may
        need to adjust the code.
        */
        horizontal = 2.0f * half_width * focus_dist * u;
        /*
        DPCT1084:1305: The function call has multiple migration results in
        different template instantiations that could not be unified. You may
        need to adjust the code.
        */
        vertical = 2.0f * half_height * focus_dist * v;
    }

    ////////////////////////////////////////////////////////////////////////////////////////////////////
    /// <summary>	Gets a ray. </summary>
    ///
    /// <remarks>	Ed, 5/20/2020. </remarks>
    ///
    /// <param name="s">			   	A float to process. </param>
    /// <param name="t">			   	A float to process. </param>
    /// <param name="local_rand_state">	[in,out] If non-null, state of the local random. </param>
    ///
    /// <returns>	The ray. </returns>
    ////////////////////////////////////////////////////////////////////////////////////////////////////

    /*
    DPCT1032:1306: A different random number generator is used. You may need to
    adjust the code.
    */
    ray
    get_ray(float s, float t) {
        /*
        DPCT1084:1307: The function call has multiple migration results in
        different template instantiations that could not be unified. You may
        need to adjust the code.
        */
        vec3 rd = lens_radius * random_in_unit_disk();
        /*
        DPCT1084:1308: The function call has multiple migration results in
        different template instantiations that could not be unified. You may
        need to adjust the code.
        */
        vec3 offset = u * rd.x() + v * rd.y();
        /*
        DPCT1084:1309: The function call has multiple migration results in
        different template instantiations that could not be unified. You may
        need to adjust the code.
        */
        return ray(origin + offset, lower_left_corner + s * horizontal +
                                        t * vertical - origin - offset);
    }

    /// <summary>	The origin. </summary>
    vec3 origin;
    /// <summary>	The lower left corner. </summary>
    vec3 lower_left_corner;
    /// <summary>	The horizontal. </summary>
    vec3 horizontal;
    /// <summary>	The vertical. </summary>
    vec3 vertical;

    ////////////////////////////////////////////////////////////////////////////////////////////////////
    /// <summary>	Gets the w. </summary>
    ///
    /// <value>	The w. </value>
    ////////////////////////////////////////////////////////////////////////////////////////////////////

    vec3 u, v, w;
    /// <summary>	The lens radius. </summary>
    float lens_radius;
};

#endif

(camera.h)

#pragma once

class rnd_out_pipe_id;

(random_gen.h)

#ifndef RAYH
#define RAYH
#include <CL/sycl.hpp>
#include <dpct/dpct.hpp>
#include "vec3.h"

class ray
{
    public:
        ray() {}

        ray(const vec3& a, const vec3& b) { A = a; B = b; }

        vec3 origin() const       { return A; }

        vec3 direction() const    { return B; }

        /*
        DPCT1084:1293: The function call has multiple migration results in
        different template instantiations that could not be unified. You may
        need to adjust the code.
        */
        vec3 point_at_parameter(float t) const { return A + t * B; }

        /// <summary>	A vec3 to process. </summary>
        vec3 A;
        /// <summary>	A vec3 to process. </summary>
        vec3 B;
};

#endif

(ray.h)

#ifndef SPHEREH
#define SPHEREH

#include <CL/sycl.hpp>
#include <dpct/dpct.hpp>

#include "ray.h"

class sphere {
    public:
        sphere() {}
        sphere(vec3 cen, float r) : center(cen), radius(r)  {};

        SYCL_EXTERNAL bool hit(const ray& r, float tmin, float tmax) const;
        
        vec3 center;
        float radius;
};

bool sphere::hit(const ray& r, float t_min, float t_max) const {
    vec3 oc = r.origin() - center;
    float a = dot(r.direction(), r.direction());
    float b = dot(oc, r.direction());
    float c = dot(oc, oc) - radius*radius;
    float discriminant = b*b - a*c;
    if (discriminant > 0) {
        float temp = (-b - sycl::sqrt(discriminant)) / a;
        if (temp < t_max && temp > t_min) {
            return true;
        }
        temp = (-b + sycl::sqrt(discriminant)) / a;
        if (temp < t_max && temp > t_min) {
            return true;
        }
    }
    return false;
}


#endif

(sphere.h)

#ifndef VEC3H
#define VEC3H

#include <CL/sycl.hpp>
#include <dpct/dpct.hpp>
#include <math.h>
#include <stdlib.h>
#include <iostream>

class vec3  {
public:
    vec3() {}
    vec3(float e0, float e1, float e2) { e[0] = e0; e[1] = e1; e[2] = e2; }

    inline float x() const { return e[0]; }
    inline float y() const { return e[1]; }
    inline float z() const { return e[2]; }

    inline float r() const { return e[0]; }
    inline float g() const { return e[1]; }
    inline float b() const { return e[2]; }

    inline const vec3& operator+() const { return *this; }
    inline vec3 operator-() const { return vec3(-e[0], -e[1], -e[2]); }
    inline float operator[](int i) const { return e[i]; }
    inline float& operator[](int i) { return e[i]; };
    inline vec3& operator+=(const vec3 &v2);
    inline vec3& operator-=(const vec3 &v2);
    inline vec3& operator*=(const vec3 &v2);
    inline vec3& operator/=(const vec3 &v2);
    inline vec3& operator*=(const float t);
    inline vec3& operator/=(const float t);

    inline float length() const {
            return sycl::sqrt(e[0] * e[0] + e[1] * e[1] + e[2] * e[2]);
    }

    inline float squared_length() const { return e[0]*e[0] + e[1]*e[1] + e[2]*e[2]; }
    inline void make_unit_vector();

    float e[3];
};

inline std::istream& operator>>(std::istream &is, vec3 &t) {
    is >> t.e[0] >> t.e[1] >> t.e[2];
    return is;
}

inline std::ostream& operator<<(std::ostream &os, const vec3 &t) {
    os << t.e[0] << " " << t.e[1] << " " << t.e[2];
    return os;
}

inline void vec3::make_unit_vector() {
    float k = 1.0 / sycl::sqrt(e[0] * e[0] + e[1] * e[1] + e[2] * e[2]);
    e[0] *= k; e[1] *= k; e[2] *= k;
}

inline vec3 operator+(const vec3 &v1, const vec3 &v2) {
    return vec3(v1.e[0] + v2.e[0], v1.e[1] + v2.e[1], v1.e[2] + v2.e[2]);
}

inline vec3 operator-(const vec3 &v1, const vec3 &v2) {
    return vec3(v1.e[0] - v2.e[0], v1.e[1] - v2.e[1], v1.e[2] - v2.e[2]);
}

inline vec3 operator*(const vec3 &v1, const vec3 &v2) {
    return vec3(v1.e[0] * v2.e[0], v1.e[1] * v2.e[1], v1.e[2] * v2.e[2]);
}

inline vec3 operator/(const vec3 &v1, const vec3 &v2) {
    return vec3(v1.e[0] / v2.e[0], v1.e[1] / v2.e[1], v1.e[2] / v2.e[2]);
}

inline vec3 operator*(float t, const vec3 &v) {
    return vec3(t*v.e[0], t*v.e[1], t*v.e[2]);
}

inline vec3 operator/(vec3 v, float t) {
    return vec3(v.e[0]/t, v.e[1]/t, v.e[2]/t);
}

inline vec3 operator*(const vec3 &v, float t) {
    return vec3(t*v.e[0], t*v.e[1], t*v.e[2]);
}

inline float dot(const vec3 &v1, const vec3 &v2) {
    return v1.e[0] *v2.e[0] + v1.e[1] *v2.e[1]  + v1.e[2] *v2.e[2];
}

inline vec3 cross(const vec3 &v1, const vec3 &v2) {
    return vec3( (v1.e[1]*v2.e[2] - v1.e[2]*v2.e[1]),
                (-(v1.e[0]*v2.e[2] - v1.e[2]*v2.e[0])),
                (v1.e[0]*v2.e[1] - v1.e[1]*v2.e[0]));
}

inline vec3& vec3::operator+=(const vec3 &v){
    e[0]  += v.e[0];
    e[1]  += v.e[1];
    e[2]  += v.e[2];
    return *this;
}

inline vec3& vec3::operator*=(const vec3 &v){
    e[0]  *= v.e[0];
    e[1]  *= v.e[1];
    e[2]  *= v.e[2];
    return *this;
}

inline vec3& vec3::operator/=(const vec3 &v){
    e[0]  /= v.e[0];
    e[1]  /= v.e[1];
    e[2]  /= v.e[2];
    return *this;
}

inline vec3& vec3::operator-=(const vec3& v) {
    e[0]  -= v.e[0];
    e[1]  -= v.e[1];
    e[2]  -= v.e[2];
    return *this;
}

inline vec3& vec3::operator*=(const float t) {
    e[0]  *= t;
    e[1]  *= t;
    e[2]  *= t;
    return *this;
}

inline vec3& vec3::operator/=(const float t) {
    float k = 1.0/t;

    e[0]  *= k;
    e[1]  *= k;
    e[2]  *= k;
    return *this;
}

inline vec3 unit_vector(vec3 v) {
    /*
    DPCT1084:1292: The function call has multiple migration results in different
    template instantiations that could not be unified. You may need to adjust
    the code.
    */
    return v / v.length();
}

#endif

(vec3.h)

Compiling the project with

dpcpp -fsycl -fintelfpga -o test.cpp.o -c test.cpp

And linking it with

dpcpp -fsycl -fintelfpga -Xshardware -Xstarget="/opt/intel/oneapi/intel_a10gx_pac:pac_a10" test.cpp.o -o test.fpga

Results in the errors I stated above:

test.cpp:150: Compiler Warning: Pipe ordering required barrier insertion in for.body.i of pipe_ZTSZZ4mainENKUlRN2cl4sycl7handlerEE_clES2_E16sample_kernel_id, but kernel may hang as a result
camera.h:47: Compiler Warning: Pipe ordering required barrier insertion in do.body.i.i.i of pipe_ZTSZZ4mainENKUlRN2cl4sycl7handlerEE_clES2_E16sample_kernel_id, but kernel may hang as a result
test.cpp:150: Compiler Warning: Pipe ordering required barrier insertion in _ZN6camera7get_rayEff.exit.i of pipe_ZTSZZ4mainENKUlRN2cl4sycl7handlerEE_clES2_E16sample_kernel_id, but kernel may hang as a result

A smaller example did not lead to the errors, but I hope it is still not to time-consuming to look into it.

Thanks for your help in advance,
Christoph

Christoph9 · ‎11-08-2022

Hey BB,

sorry for the late answer, I did not came to work on this the last week!

In the attachment a "minimal" example where this errors occurs. It does nothing productive and consists of codesnippets from the DPCT-translated and slightly modified raytracing-benchmark from the Altis GPU Benchmark Suite.

One file I could not attach, so test.cpp is here:

#include <CL/sycl.hpp>

#include <dpct/dpct.hpp>
#include <dpct/rng_utils.hpp>
#include <oneapi/dpl/random>
#include <oneapi/mkl.hpp>
#include <oneapi/mkl/rng/device.hpp>
#include <sycl/ext/intel/fpga_extensions.hpp>

#include "vec3.h"
#include "sphere.h"
#include "camera.h"
#include "random_gen.h"

namespace fpga_tools {

namespace detail {
// Autorun implementation
template <bool run_forever, typename KernelID>
struct Autorun_impl {
  // Constructor with a kernel name
  template <typename DeviceSelector, typename KernelFunctor>
  Autorun_impl(DeviceSelector device_selector, KernelFunctor kernel) {
    // static asserts to ensure KernelFunctor is callable
    static_assert(std::is_invocable_r_v<void, KernelFunctor>,
                  "KernelFunctor must be callable with no arguments");

    // create the device queue
    sycl::queue q{device_selector};

    // submit the user's kernel
    if constexpr (run_forever) {
      if constexpr (std::is_same_v<KernelID, void>) {
        // AutorunForever, kernel name not given
        q.single_task([=] {
          while (1) {
            kernel();
          }
        });
      } else {
        // AutorunForever, kernel name given
        q.single_task<KernelID>([=] {
          while (1) {
            kernel();
          }
        });
      }
    } else {
      // run the kernel as-is, if the user wanted it to run forever they
      // will write their own explicit while-loop
      if constexpr (std::is_same_v<KernelID, void>) {
        // Autorun, kernel name not given
        q.single_task(kernel);
      } else {
        // Autorun, kernel name given
        q.single_task<KernelID>(kernel);
      }
    }
  }
};
}  // namespace detail

// Autorun
template <typename KernelID = void>
using Autorun = detail::Autorun_impl<false, KernelID>;

// AutorunForever
template <typename KernelID = void>
using AutorunForever = detail::Autorun_impl<true, KernelID>;
}  // namespace fpga_tools

#if defined(FPGA_EMULATOR)
sycl::ext::intel::fpga_emulator_selector ds;
#else
sycl::ext::intel::fpga_selector ds;
#endif

class random_generator_kernel_id;

struct rnd_generator {
  void operator()() const {
    auto rand_state = dpct::rng::device::rng_generator<
        oneapi::mkl::rng::device::philox4x32x10<4>>(1984, { 0, 0 * 8 });
    while (1) 
    {
        sycl::ext::intel::pipe<rnd_out_pipe_id, float, 8>::write(
            rand_state
                .generate<oneapi::mkl::rng::device::uniform<float>,
                            1>());
    }
  }
};

fpga_tools::Autorun<random_generator_kernel_id> ar_rnd_gen{ds, rnd_generator{}};

int 
main(int argc, char *argv[])
{
    sycl::queue q{ds};

    const sycl::range<3> blocks(1, 4, 4);
    const sycl::range<3> threads(1, 16, 16);

    const size_t       fb_size = 3840 * 2160 * sizeof(vec3);
    sycl::buffer<vec3> h_fb { sycl::range(fb_size) };

    sycl::event render_event = q.submit([&](sycl::handler &cgh) {
        sycl::accessor a_fb { h_fb, cgh, sycl::write_only, sycl::no_init };

        cgh.parallel_for<class sample_kernel_id>(
            sycl::nd_range<3>(blocks * threads, threads),
            [=](sycl::nd_item<3> item_ct1) {
                int i = item_ct1.get_local_id(2)
                        + item_ct1.get_group(2) * item_ct1.get_local_range(2);
                int j = item_ct1.get_local_id(1)
                        + item_ct1.get_group(1) * item_ct1.get_local_range(1);
                if ((i >= 3840) || (j >= 2160))
                    return;
                int pixel_index = j * 3840 + i;

                vec3   col(0, 0, 0);
                vec3  lookfrom(13.0f, 2.0f, 3.0f);
                vec3  lookat(0.0f, 0.0f, 0.0f);
                float dist_to_focus = 10.0f;
                (lookfrom - lookat).length();
                float aperture = 0.1f;
                camera cam(lookfrom,
                            lookat,
                            vec3(0.0f, 1.0f, 0.0f),
                            30.0f,
                            float(3840) / float(2160),
                            aperture,
                            dist_to_focus);
                for (int s = 0; s < 10; s++)
                {
                    float u = float(i + sycl::ext::intel::pipe<rnd_out_pipe_id, float, 8>::read()) / float(3840);
                    float v = float(j + sycl::ext::intel::pipe<rnd_out_pipe_id, float, 8>::read()) / float(2160);
                    ray   r = cam.get_ray(u, v);
                }

                col /= float(10);
                col[0]          = sycl::sqrt(col[0]);
                col[1]          = sycl::sqrt(col[1]);
                col[2]          = sycl::sqrt(col[2]);
                a_fb[pixel_index] = col;
            });
    });
    render_event.wait();
    const float elapsed = render_event.get_profiling_info<
                                sycl::info::event_profiling::command_end>()
                            - render_event.get_profiling_info<
                                sycl::info::event_profiling::command_start>();

    std::cout << "Test elapsed: " << elapsed << std::endl;

    return 0;
}

Compiling them via

dpcpp -fsycl -fintelfpga -o test.cpp.o -c test.cpp

And linking it results in my above stated errors:

dpcpp -fsycl -fintelfpga -Xshardware -Xstarget="/opt/intel/oneapi/intel_a10gx_pac:pac_a10" test.cpp.o -o test.fpga
test.cpp:134: Compiler Warning: Pipe ordering required barrier insertion in for.body.i of pipe_ZTSZZ4mainENKUlRN2cl4sycl7handlerEE_clES2_E16sample_kernel_id, but kernel may hang as a result
camera.h:47: Compiler Warning: Pipe ordering required barrier insertion in do.body.i.i.i of pipe_ZTSZZ4mainENKUlRN2cl4sycl7handlerEE_clES2_E16sample_kernel_id, but kernel may hang as a result
test.cpp:134: Compiler Warning: Pipe ordering required barrier insertion in _ZN6camera7get_rayEff.exit.i of pipe_ZTSZZ4mainENKUlRN2cl4sycl7handlerEE_clES2_E16sample_kernel_id, but kernel may hang as a result

A smaller example sadly did not produce the erros, I hope this is not too much code too look into fast. As you see I did not use the explicit loop-pipeling for ND-Range kernels, therefore the disable-looppipelining attribute should not do anything here.

Thanks in advance and best regards,
Christoph

KennyTan_Altera · ‎11-15-2022

Hi,

Since this issue seems to be a lot complex, we will need about two weeks or more for investigation. Kindly be patient and we will get back to you as soon as we can.

Thanks

BoonBengT_Altera · ‎11-27-2022

Hi @Christoph9,

Apologies for the delayed in response, trying to build the example your provided from my end with some minor hiccups in my end. Getting that solved and will get back with you on the result.

Just to confirm, there are 2 sets of example that you provided, are they the same?

And for the minimal/smaller example which error did not came out, which sets of example are they(first or second)?

Hope to hear from you soon.

Best Wishes

BB

Christoph9 · ‎11-28-2022

Hello BB,

yes the two examples are the same (although in the 2nd I put some files as attachement). I accidentally posted two because the first one didn't visible to me for 1 day or so. I could'nt remove the first one either, maybe you can?

I did not attach any example where the error did'nt occur. If I remember correctly the error did not occur when the loop in the kernel looked like this:

for (int s = 0; s < 10; s++)
                {
                    float u = float(i + sycl::ext::intel::pipe<rnd_out_pipe_id, float, 8>::read()) / float(3840);
                    float v = float(j + sycl::ext::intel::pipe<rnd_out_pipe_id, float, 8>::read()) / float(2160);
                }

As soon as I add the cam.getRay function, the reads in the loop for u and v start to cause these warnings.

Best regards,
Christoph

Christoph9 · ‎11-28-2022

Here another one-file example where the warning is appearing. Compile it with

dpcpp -fsycl -fintelfpga -o test.cpp.o -c test.cpp -qactypes

Linking leads again to the error mentioned above:

u153009@s001-n068:~/thesis/optimized_nd/build$ dpcpp -fsycl -fintelfpga -Xshardware -Xstarget="/opt/intel/oneapi/intel_a10gx_pac:pac_a10" test.cpp.o -o test.fpga
test.cpp:61: Compiler Warning: Pipe ordering required barrier insertion in for.body.i.i.i of pipe_ZTS20likelihood_kernel_id, but kernel may hang as a result
Compiler Warning: Enforcing pipe ordering in kernel _ZTS20likelihood_kernel_id limits number of concurrent workgroups to 1

Again I am confused why the compiler tries to enforce any ordering to the pipes here as these too are ND-Range kernels?

As stated in the FPGA optimizations guide there is no guarantee to any ordering, so for what kind of purpose can I use pipes in ND-Range kernels? My two ideas where the ordering to pipe-reads/writes does not matter

Propagate Random numbers efficiently and
Sum results of different workgroups together

are not working at all due to the weird barrier insertion by the compiler.

Is there no way to prevent this barrier insertion or is this just a bug and should not happen at all -> This would be my guess as stated above - in the optimization guide there stands that there is no ordering guarantee at all in ND-range kernels?

Best regards and thanks in advance for your help,
Christoph

BoonBengT_Altera · ‎12-11-2022

Hi @Christoph9,

Just to update on the case below, we are validating the behaviour.

And at the same time has contacted our internal engineering team to colloborate with them to understand the issues further.

Will get back to you as soon as possible once we have an updates.

Best Wishes

BB