Intel® High Level Design
Support for Intel® High Level Synthesis Compiler, DSP Builder, OneAPI for Intel® FPGAs, Intel® FPGA SDK for OpenCL™
663 Discussions

DPC++ optimizations for FPGA and using report.html

CFR
New Contributor II
1,096 Views

Environment: Ubuntu 18.04.4, OneAPI

I guess my broader question is how does one control/guide the optimization and how can you observe/verify that the compiler "gets it right". (I think this would be a great addition to the DPC++ FPGA optimization document )

Most specifically given the following "toy" kernel code:

    q.submit([&](sycl::handler& cgh) {
      auto xaccessor = xbuffer.get_access<sycl::access::mode::read_write, sycl::access::target::global_buffer>(cgh);
      cgh.parallel_for<class k0a>(
        sycl::range<1> {Nproc},
        [=] (sycl::item<1> item) {
          unsigned long gidx = item.get_linear_id();
          LCG48 lcg48(xaccessor[gidx]);
          for (int i=0; i<Niter; i++) lcg48.step();
          xaccessor[gidx] = lcg48.get();
        }
      );
    });

How does one get DPC++ to unroll the lcg48.step() loop and pipeline the work items?  I've tried various  "unroll", "max_concurrency", "max_interleaving", etc... but I must be missing something because I don't recognize that any optimization is happening (at least according to the compiler output and/or report.html). Based on the  documentation DPC++ should also be able to do static coalesce the memory access but I don't see that either.

Just for completeness here's a full program to experiment with....

#include <CL/sycl.hpp>
#include <cstdio>
#include <unistd.h>
#include <algorithm>

namespace sycl = cl::sycl;

const int Nproc=20;
const int Niter=5;


class LCG48 {
  public :
    LCG48() { state = 0x330E; }
    LCG48(unsigned int x) { state = (long(x)<<16) + 0x330E; }
    void seed(unsigned int x) { state = (long(x)<<16) + 0x330E; }
    long int step() {
      state = ((0x5DEECE66D * (state) + 0xB) % (1L<<48));
      return (0x7FFFFFFF & (state >> 17));
    }
    long int get() {
      return (0x7FFFFFFF & (state >> 17));
    }
  private:
    unsigned long state;
};


int 
main(int argc, char *argv[])
{
  unsigned long t1, t2;
  int xdata[Nproc];

  for (int i=0; i<Nproc; i++) xdata[i] = i;
  LCG48 lcg48;
  for (int i=0; i<Nproc; i++) {
    lcg48.seed(i);
    for (int j=0; j<Niter; j++) lcg48.step();
    if (i<8) printf("%08X ", (unsigned int)lcg48.get());
  }
  printf("\n");

  /** 
   ** Choose a device 
   **/
  //sycl::device dev = sycl::default_selector().select_device();
  //sycl::device dev = sycl::host_selector().select_device();
  //sycl::device dev = sycl::gpu_selector().select_device();
  //sycl::device dev = sycl::cpu_selector().select_device();
  sycl::device dev = sycl::accelerator_selector().select_device();
  //intel::fpga_emulator_selector dev
  //intel::fpga_selector dev;

  std::cout << "Device: " 
        << "name: " << dev.get_info<sycl::info::device::name>() << std::endl
        << "vendor: " << dev.get_info<sycl::info::device::vendor>() << std::endl;
  sycl::queue q(dev);

  /**
   ** Parallel For
   **/
  for (int i=0; i<Nproc; i++) xdata[i] = i+1;
  {
    sycl::buffer<int, 1> xbuffer((int *)xdata, sycl::range<1> {Nproc});
    q.submit([&](sycl::handler& cgh) {
      auto xaccessor = xbuffer.get_access<sycl::access::mode::read_write, sycl::access::target::global_buffer>(cgh);
      cgh.parallel_for<class k0a>(
        sycl::range<1> {Nproc},
        [=] (sycl::item<1> item) {
          unsigned long gidx = item.get_linear_id();
          LCG48 lcg48(xaccessor[gidx]);
          for (int i=0; i<Niter; i++) lcg48.step();
          xaccessor[gidx] = lcg48.get();
        }
      );
    });
  }
  for (int i=0; i<std::min(Nproc,8); i++) printf("%08X ", xdata[i]); printf("\n");

  /**
   **  Single task
   **/
  for (int i=0; i<Nproc; i++) xdata[i] = i+0;
  {
    sycl::buffer<int, 1> xbuffer((int *)xdata, sycl::range<1> {Nproc});
    q.submit([&](sycl::handler& cgh) {
      auto xaccessor = xbuffer.get_access<sycl::access::mode::read_write, sycl::access::target::global_buffer>(cgh);
      cgh.single_task<class k1a>(
        [=] () {
          for (int gidx=0; gidx<Nproc; gidx++) {
            LCG48 lcg48(xaccessor[gidx]);
            for (int i=0; i<Niter; i++) lcg48.step();
            xaccessor[gidx] = lcg48.get();
          }
        }
      );
    });
  }
  for (int i=0; i<std::min(Nproc,8); i++) printf("%08X ", xdata[i]); printf("\n");
}

... which I compile with:

dpcpp -O3 -g -mavx2 -fintelfpga -fsycl-link -Xshardware -Xsboard=intel_a10gx_pac:pac_a10 fpga1b.cpp -lOpenCL -lsycl

 

0 Kudos
4 Replies
GouthamK_Intel
Moderator
1,084 Views

Hi,

Thanks for reaching out to us!

Since your issue is related to FPGA, we are moving this query to the FPGA forum for a faster response.


Regards

Goutham


0 Kudos
CFR
New Contributor II
1,067 Views

Just to elaborate on what I realize is an otherwise broad question...

If I were optimizing my example I'd look to unroll the lcg48.step() loop and then parallelize the work items enough to use all the memory bandwidth.  Then each work item "column" could be pipelined.  For example if there's 256 bit wide memory, I'd have <=8 pipelines (256/32) and each pipeline would process ceil(20/8) work items.  (Ok, my choice of example parameters isn't all that great, but I kept things small so I could see things.  You get the idea ;^)).  So back to the original question.  I can use "#pragma unroll" to unroll the inner loop (and observe that i the report.html).  What I don't see how to control how many parallel pipelines are created. (see note).  I don't think it's happening automatically; or a least I don't see anything in the report.html to suggest that it is.

 

Hope that narrows things a little bit.

note: unrolling both loops in the k1a and you do see what looks like parallelism in the graph view of report.html. 

0 Kudos
CFR
New Contributor II
1,013 Views

Just to follow up in case someone else might be trying similar experiments...

I took the OneAPI/FPGA Tutorial at FPL2020 (highly recommend it).  One of the things stressed in the tutorial was that the compilers are better at handling "single_task" and not so much good at "parallel_for".  It was really stressed to use "single_task".  This didn't come across to me in the documentation.

0 Kudos
AnilErinch_A_Intel
936 Views

Hi ,

Thanks for sharing your learning with the community.

Thanks and Regards

Anil


0 Kudos
Reply