Cannot compile with Stratix10 in DevCloud and Attributes-Local Memory Performance

RN1 · ‎03-31-2022

Hi,

==== First part) Stratix10 ====

We tried different ways without luck, therefore, we attach the same steps using the interactive session in a fpga_compile machine. We tried directly with your example (via oneapi-cli). We hope you can assist us.

u148129@s001-n057:~/fpga_compile$ mkdir build
u148129@s001-n057:~/fpga_compile$ cd build/
u148129@s001-n057:~/fpga_compile/build$ cmake .. -DFPGA_BOARD=intel_s10sx_pac:pac_s10
-- The CXX compiler identification is Clang 14.0.0
-- Check for working CXX compiler: /glob/development-tools/versions/oneapi/2022.1.2/oneapi/compiler/2022.0.2/linux/bin/dpcpp
-- Check for working CXX compiler: /glob/development-tools/versions/oneapi/2022.1.2/oneapi/compiler/2022.0.2/linux/bin/dpcpp -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Configuring the design to run on FPGA board intel_s10sx_pac:pac_s10
-- Configuring done
-- Generating done
-- Build files have been written to: /home/u148129/fpga_compile/build
u148129@s001-n057:~/fpga_compile/build$ make report
Scanning dependencies of target fpga_compile_report.a
[ 50%] Building CXX object src/CMakeFiles/fpga_compile_report.a.dir/fpga_compile.cpp.o
[100%] Linking CXX executable ../fpga_compile_report.a
Can't opendir /glob/development-tools/versions/oneapi/2022.1.2/oneapi/intelfpgadpcpp/2022.1.0/board/de10_agilex: Permission denied at /glob/development-tools/versions/oneapi/2022.1.2/oneapi/compiler/2022.0.2/linux/lib/oclfpga/share/lib/perl/acl/Common.pm line 717.
llvm-foreach:
dpcpp: error: fpga compiler command failed with exit code 13 (use -v to see invocation)
make[3]: *** [src/CMakeFiles/fpga_compile_report.a.dir/build.make:84: fpga_compile_report.a] Error 13
make[2]: *** [CMakeFiles/Makefile2:181: src/CMakeFiles/fpga_compile_report.a.dir/all] Error 2
make[1]: *** [CMakeFiles/Makefile2:134: src/CMakeFiles/report.dir/rule] Error 2
make: *** [Makefile:131: report] Error 2

Some context, the node and devices listed:

# requesting the machine
u148129@login-2:~/logs$ qsub -I -l nodes=1:fpga_compile:ppn=2 -d .
qsub: waiting for job 1874662.v-qsvr-1.aidevcloud to start
qsub: job 1874662.v-qsvr-1.aidevcloud ready


########################################################################
#      Date:           Thu 31 Mar 2022 04:33:10 AM PDT
#    Job ID:           1874662.v-qsvr-1.aidevcloud
#      User:           u148129
# Resources:           neednodes=1:fpga_compile:ppn=2,nodes=1:fpga_compile:ppn=2,walltime=06:00:00
########################################################################


aocl list-devices
u148129@s001-n057:~/fpga_compile/build$ aocl list-devices
/glob/development-tools/versions/oneapi/2022.1.2/oneapi/intelfpgadpcpp/2022.1.0/board/intel_a10gx_pac/linux64/libexec/diagnose: error while loading shared libraries: libopae-c.so.1: cannot open shared object file: No such file or directory
--------------------------------------------------------------------
Warning:
No devices attached for package:
/glob/development-tools/versions/oneapi/2022.1.2/oneapi/intelfpgadpcpp/2022.1.0/board/intel_a10gx_pac
--------------------------------------------------------------------

- We cannot compile for Stratix10, our target, in a fpga_compile node. If we try to use a "stratix10" is even worse (fpga runtime machine, more errors, but that is something somehow expected).

- There are errors when compiling and also when querying (aocl). Also, it emits something related with "intel_a10gx_pac", another board that we didn't select. If you go to such directories, there are also files for the Stratix10 board (s10sx) but somehow it does not work.

What are the problems here? How can we solve them? We are just running the basic example.

==== Second part) Performance ====

If we execute with the default board a simple matrix multiplication, Arria10 (although we are not interested in such), it compiles and emits the report. However, the performance is not good compared with the report of pure OpenCL - FPGA. We would like to see what should be changed to increase, for example, the bandwidth (currently, in many places, only 32 bits, not 2048 like in OpenCL FPGA):

#include <CL/sycl.hpp>
#include <iostream>
#include <limits>

//#include <CL/sycl/INTEL/fpga_extensions.hpp>
#include <sycl/ext/intel/fpga_extensions.hpp>

// dpc_common.hpp can be found in the dev-utilities include folder.
// e.g., $ONEAPI_ROOT/dev-utilities/<version>/include/dpc_common.hpp
#include "dpc_common.hpp"

using namespace std;
using namespace sycl;

class a_init;
class b_init;
class c_calc;

/**
 * Each element of the product matrix c[i][j] is computed from a unique row and
 * column of the factor matrices, a[i][k] and b[k][j]
 */

// Matrix size constants.
//#define m_size 512  // Must be a multiple of 8.
#define N 4096
#define BL 4
#define SIMD 2

/**
 * Perform matrix multiplication on host to verify results from device.
 */
int VerifyResult(float (*c_back)[N]);

double GetExecutionTime(const event &e) {
  double start_k = e.get_profiling_info<info::event_profiling::command_start>();
  double end_k = e.get_profiling_info<info::event_profiling::command_end>();
  double kernel_time = (end_k - start_k) * 1e-9; // ns to s
  return kernel_time;
}

int main() {
  // Host memory buffer that device will write data back before destruction.
  float(*c_back)[N] = new float[N][N];

  // Intialize c_back
  for (int i = 0; i < N; i++)
    for (int j = 0; j < N; j++) c_back[i][j] = 0.0f;

  // Initialize the device queue with the default selector. The device queue is
  // used to enqueue kernels. It encapsulates all states needed for execution.
  try {
    //sycl::INTEL::fpga_selector _device;
    ext::intel::fpga_selector _device;
    queue q(_device, dpc_common::exception_handler,cl::sycl::property::queue::enable_profiling());

    cout << "Device: " << q.get_device().get_info<info::device::name>() << "\n";

    // Create 2D buffers for matrices, buffer c is bound with host memory c_back

    // These are all Global Memory
    //buffer<float, 2> a(range(N, N));
    buffer<float, 1> aa(range(N*N));
    //buffer<float, 2> b(range(N, N));
    buffer<float, 1> bb(range(N*N));
    buffer c(reinterpret_cast<float *>(c_back), range(N, N));

    cout << "Problem size: c(" << N << "," << N << ") = a(" << N << "," << N
         << ") * b(" << N << "," << N << ")\n";

    // Submit command group to queue to multiply matrices: c = a * b
    auto e_c = q.submit([&](handler &h) {
      // Read from a and b, write to c
      auto A = aa.get_access<access::mode::read>(h);
      auto B = bb.get_access<access::mode::read>(h);
      auto C = c.get_access<access::mode::write>(h);

      //local_accessor<float, 2> A_local(range<2>{BL, BL}, h);
      //local_accessor<float, 2> B_local(range<2>{BL, BL}, h);
      /*accessor<float, 2, access::mode::read_write, access::target::local>
       [[intel::numbanks(198), intel::bankwidth(2048)]] A_local(range<2>{BL, BL}, h);
      accessor<float, 2, access::mode::read_write, access::target::local>
       [[intel::numbanks(12), intel::bankwidth(2048)]] B_local(range<2>{BL, BL}, h);*/

      accessor<float, 2, access::mode::read_write, access::target::local> A_local(range<2>{BL, BL}, h);
      accessor<float, 2, access::mode::read_write, access::target::local> B_local(range<2>{BL, BL}, h);

      range<2> num_groups(N, N);
      range<2> num_items(BL, BL);

      h.parallel_for<c_calc>(nd_range<2>(num_groups, num_items), [=](nd_item<2> item)
        [[
        intel::kernel_args_restrict,
        intel::max_work_group_size(1, BL, BL),
        sycl::reqd_work_group_size(1,BL,BL),
        intel::num_simd_work_items(SIMD)
        ]]
      {
          /*[[intel::doublepump,
          intel::fpga_memory("MLAB"),
          intel::numbanks(1),
          intel::max_replicates(3)]]
          float A_local[BL][BL];*/

          /*auto ptr = group_local_memory_for_overwrite<int[64]>(item.get_group());

          auto ptrA = group_local_memory_for_overwrite<float[BL][BL]>(item.get_group());
          auto& A_local = *ptrA;*/

          /*[[intel::doublepump,
          intel::fpga_memory("MLAB"),
          intel::numbanks(16),
          intel::max_replicates(3)]]
          float B_local[BL][BL];*/

          /*auto ptrB = group_local_memory_for_overwrite<float[BL][BL]>(item.get_group());
          auto& B_local = *ptrB;*/

          int block_x = item.get_group().get_id(0);
          int block_y = item.get_group().get_id(1);

          int local_x = item.get_local_id(0);
          int local_y = item.get_local_id(1);

          int a_start = N * BL * block_x;
          int a_end   = a_start + N - 1;
          int b_start = BL * block_y;

          float sum = 0.0f;

          for (int a = a_start, b = b_start; a <= a_end; a += BL, b += (BL * N)){

              A_local[local_x][local_y] = A[a + N * local_x + local_y];
              B_local[local_y][local_x] = B[b + N * local_x + local_y];

              item.barrier(access::fence_space::local_space);

              //#pragma unroll (BL/SIMD)
              #pragma unroll
              for (int k = 0; k < BL; ++k){
                  float aaa = A_local[local_x][k];
                  float bbb = B_local[local_y][k];
                  sum += aaa * bbb;
                  //sum += A_local[local_x][k] * B_local[local_y][k];
              }

              item.barrier(access::fence_space::local_space);

          }

          C[item.get_global_id(0)][item.get_global_id(1)] = sum;

      });


    });
    //double a_time = GetExecutionTime(e_a);
    //double b_time = GetExecutionTime(e_b);
    double c_time = GetExecutionTime(e_c);
    double input_size_kb = (2*N)*sizeof(float)/(1024);
    //std::cout << "Kernel throughput initializing a: "
    //          << (input_size_kb/a_time) << " KB/s \n"; // this one is larger and not accurate. Some device initialization is included here.
    //std::cout << "Kernel throughput initializing b: "
    //          << (input_size_kb/b_time) << " KB/s \n";
    std::cout << "Kernel throughput calculating c: "
              << (input_size_kb/c_time) << " KB/s \n";

  } catch (sycl::exception const &e) {
    cout << "An exception is caught while multiplying matrices.\n";
    terminate();
  }

  int result;
  cout << "Result of matrix multiplication using DPC++: ";
  result = VerifyResult(c_back);
  delete[] c_back;

  return result;
}

bool ValueSame(float a, float b) {
  return fabs(a - b) < numeric_limits<float>::epsilon();
}

int VerifyResult(float (*c_back)[N]) {
  // Check that the results are correct by comparing with host computing.
  int i, j, k;

  // 2D arrays on host side.
  float(*a_host)[N] = new float[N][N];
  float(*b_host)[N] = new float[N][N];
  float(*c_host)[N] = new float[N][N];

  // Each element of matrix a is 1.
  for (i = 0; i < N; i++)
    for (j = 0; j < N; j++) a_host[i][j] = 1.0f;

  // Each column of b_host is the sequence 1,2,...,N
  for (i = 0; i < N; i++)
    for (j = 0; j < N; j++) b_host[i][j] = i + 1.0f;

  // c_host is initialized to zero.
  for (i = 0; i < N; i++)
    for (j = 0; j < N; j++) c_host[i][j] = 0.0f;

  for (i = 0; i < N; i++) {
    for (k = 0; k < N; k++) {
      // Each element of the product is just the sum 1+2+...+n
      for (j = 0; j < N; j++) {
        c_host[i][j] += a_host[i][k] * b_host[k][j];
      }
    }
  }

  bool mismatch_found = false;

  // Compare host side results with the result buffer from device side: print
  // mismatched data 5 times only.
  int print_count = 0;

  for (i = 0; i < N; i++) {
    for (j = 0; j < N; j++) {
      if (!ValueSame(c_back[i][j], c_host[i][j])) {
        cout << "Fail - The result is incorrect for element: [" << i << ", "
             << j << "], expected: " << c_host[i][j]
             << ", but found: " << c_back[i][j] << "\n";
        mismatch_found = true;
        print_count++;
        if (print_count == 50) break;
      }
    }

    if (print_count == 50) break;
  }

  delete[] a_host;
  delete[] b_host;
  delete[] c_host;

  if (!mismatch_found) {
    cout << "Success - The results are correct!\n";
    return 0;
  } else {
    cout << "Fail - The results mismatch!\n";
    return -1;
  }
}

Question 1) Is local memory properly used for the FPGA? (using such accessors)

Question 2) Defining the attributes is something really important for performance (as we saw in OpenCL), and we want to be sure her are using them correctly. If we use intel::reqd_work_group_size emits unknown attribute, while sycl::reqd_work_grou_size says nothing. However, we don't know if it worked properly.

block_matrix_mul_dpcpp.cpp:111:9: warning: unknown attribute 'reqd_work_group_size' ignored [-Wunknown-attributes]
        intel::reqd_work_group_size(1,BL,BL),
        ^~~~~~~~~~~~~~~~~~~~~~~~~~~
1 warning generated.
Can't opendir /glob/development-tools/versions/oneapi/2022.1.2/oneapi/intelfpgadpcpp/2022.1.0/board/de10_agilex: Permission denied at /glob/development-tools/versions/oneapi/2022.1.2/oneapi/compiler/2022.0.2/linux/lib/oclfpga/share/lib/perl/acl/Common.pm line 717.
llvm-foreach:
dpcpp: error: fpga compiler command failed with exit code 13 (use -v to see invocation)

Thanks for your time.

BoonBengT_Altera · ‎04-04-2022

Hi @RN1,

Thank you for posting in Intel community forum and hope all is well.

For the first part of the example, my guess is that the node using (i.e. s001-n057) does not have the required hardware. You can check the nodes spec via pbsnodes to get more information on the node.

Would recommend to use the nodes with S10 oneapi, however current there are some error going on the nodes and it has been escalated to the required team.

On the second part, there are optimization guide to programming with oneAPI as well as existing match library as below which would be a good recommended way to start with which we are looking into your mention code:

- https://www.intel.com/content/www/us/en/develop/documentation/oneapi-programming-guide/top/software-development-process/performance-tuning-cycle/optimize.html

- https://www.intel.com/content/www/us/en/develop/documentation/oneapi-programming-guide/top/api-based-programming/intel-oneapi-math-kernel-library-onemkl.html

Hope that clarify.

Best Wishes

BB

RN1 · ‎04-06-2022

Hi @BoonBengT_Altera

Thanks. Regarding the node, we achieved to compile in the s10 oneapi as you suggested, but the performance is really low compared with OpenCL.

Thanks, but we have checked previously those links, and they say nothing regarding local memory optimizations. You have the code in the previous post, and you have here the reports that we have extracted, maybe you know what to do to increase the performance since it is still quite slower compared with the OpenCL version.

I attach the captures and reports as a file.

It is compiled like this:

dpcpp -fintelfpga -Xshardware -fsycl-link=early -Xsfp-relaxed=true -Xsno-interleaving=default -Xsno-interleaving=DDR -Xsno-accessor-aliasing block_matrix_mul_dpcpp.cpp

Thanks for your time.

BoonBengT_Altera · ‎04-13-2022

Hi @RN1,

Thanks for getting back that the compilation on the correct node in Devcloud.

As for the performance, we are looking into it as we speak and will get back to you.

Clarification if I may, above code are the dpcpp code, where are the openCL example which you are referring to? Just so that we are comparing the right things from our end.

Hope to hear from you soon.

Best Wishes

BB

BoonBengT_Altera · ‎04-20-2022

Hi @RN1,

Good day, just following up on the previous clarification.

By any chances did you managed to look into the it?

Best Wishes

BB

RN1 · ‎04-21-2022

Hello,

The OpenCL code of the kernel was sent along the report files.

We are waiting to receive your response regarding performance and local memory in oneAPI, how to optimize that simple code to achieve a similar performance than OpenCL. We didn't find any Intel/oneAPI code that shows how to properly exploit the local memory.

I attach here but you can find in the files we attached.

#include "../host/inc/matrixMult.h"

#ifndef SIMD_WORK_ITEMS
#define SIMD_WORK_ITEMS 16 // default value
#endif


__kernel 
__attribute((reqd_work_group_size(BLOCK_SIZE,BLOCK_SIZE,1)))
__attribute((num_simd_work_items(SIMD_WORK_ITEMS)))
void matrixMult( // Input and output matrices
                 __global float *restrict C,
                 __global float *A,
                 __global float *B, 
                 // Widths of matrices.
                 int A_width, int B_width)
{
    // Local storage for a block of input matrices A and B
    __local float A_local[BLOCK_SIZE][BLOCK_SIZE];
    __local float B_local[BLOCK_SIZE][BLOCK_SIZE];

    // Block index
    int block_x = get_group_id(0);
    int block_y = get_group_id(1);

    // Local ID index (offset within a block)
    int local_x = get_local_id(0);
    int local_y = get_local_id(1);

    // Compute loop bounds
    int a_start = A_width * BLOCK_SIZE * block_y;
    int a_end   = a_start + A_width - 1;
    int b_start = BLOCK_SIZE * block_x;

    float running_sum = 0.0f;

    // Compute the matrix multiplication result for this output element. Each
    // loop iteration processes one block of the matrix.
    for (int a = a_start, b = b_start; a <= a_end; a += BLOCK_SIZE, b += (BLOCK_SIZE * B_width))
    {
        // Load the matrices to local memory. Note that the (x, y) indices
        // are swapped for A_local and B_local. This affects the reads from
        // A_local and B_local below and result in more efficient hardware.
        //
        // This is actually an optimization that the compiler can perform,
        // but is shown here for illustration purposes.
        A_local[local_y][local_x] = A[a + A_width * local_y + local_x];
        B_local[local_x][local_y] = B[b + B_width * local_y + local_x];
	
        // Wait for the entire block to be loaded.
        barrier(CLK_LOCAL_MEM_FENCE);

        // Do the dot product accumulation within this block. Fully unroll the loop.
        // As a result of the swap of indices above, memory accesses to
        // A_local and B_local are very efficient because each loop iteration
        // accesses consecutive elements. This can be seen by unrolling the
        // loop and analyzing the regions that are loaded:
        //  A_local[local_y][0..BLOCK_SIZE-1] and
        //  B_local[local_x][0..BLOCK_SIZE-1]
        #pragma unroll
        for (int k = 0; k < BLOCK_SIZE; ++k)
        {
            running_sum += A_local[local_y][k] * B_local[local_x][k];
        }

        // Wait for the block to be fully consumed before loading the next
        // block.
        barrier(CLK_LOCAL_MEM_FENCE);
    }

    // Store result in matrix C
    C[get_global_id(1) * get_global_size(0) + get_global_id(0)] = running_sum;
}


#ifndef MATRIXMULT_H
#define MATRIXMULT_H

// Block size. Affects the kernel, so if this value changes, the kernel
// needs to be recompiled.
#ifndef BLOCK_SIZE
#define BLOCK_SIZE 64 // default value
//#define BLOCK_SIZE 128 // default value
#endif

#endif

Kind regards

BoonBengT_Altera · ‎06-13-2022

Hi @RN1,

Thank you for the patients, on the mention of memory allocation would recommend to refer to the section below:

https://www.intel.com/content/www/us/en/develop/documentation/dev-guide-ipp-for-oneapi/top/programming-considerations/managing-memory-allocations.html

That would be the recommended guide to perform memory management.

Also would suggest to make sure of caching to improvement the performance.

https://www.intel.com/content/www/us/en/develop/documentation/dev-guide-ipp-for-oneapi/top/programming-considerations/cache-optimizations.html

Hope that clarify.

Best Wishes

BB

BoonBengT_Altera · ‎06-19-2022

Hi @RN1,

Good day, just checking in to see if there is any further doubts in regards to this matter.

Hope we have clarify your doubts.

Best Wishes

BB

BoonBengT_Altera · ‎06-24-2022

Hi @RN1,

Greetings, as we do not receive any further clarification on what is provided, we would assume challenge are overcome. For new queries, please feel free to open a new thread and we will be right with you. Pleasure having you here.

Best Wishes

BB