Intel® High Level Design
Support for Intel® High Level Synthesis Compiler, DSP Builder, OneAPI for Intel® FPGAs, Intel® FPGA SDK for OpenCL™
561 Discussions

Different runtime behavior of shared memory and device memory?


The following snippet of code demonstrates a different between using shared memory and using device memory that I don't fully understand. A simple kernel uses the memory to build and iterate through a linked list and then explicitly copy the memory back to the host. When I use device memory, everything works as expected. When I change to shared memory, the output is 0,0,0,0 instead of the expected 1,2,3,4. My best guess is that the runtime affects when shared memory is implicitly copied between the device and the host. If this is the cause, is there any detailed description of how this works and is there any way to get the code to work as expected (i.e. some additional synchronization call to force the copy)?

I'm building and running this in the devcloud with an Arria 10 device.

#if defined(FPGA_EMULATOR)
INTEL::fpga_emulator_selector device_selector;
INTEL::fpga_selector device_selector;
queue q(device_selector, dpc_common::exception_handler);

size_t num_items = 4;
size_t num_bytes = num_items * sizeof(Node);
// When I used malloc_device() below, then everything works as expected
Node *linked_list = malloc_shared<Node>(num_items, q);

q.memset(linked_list, 0, num_bytes).wait();

auto linked_e = q.submit([&](handler &h) {
   h.single_task<LinkedKernel>([=]() {
      linked_list[0].data = 0;
      linked_list[1].data = 1;
      linked_list[2].data = 2;
      linked_list[3].data = 3;

      linked_list[3].next = &(linked_list[2]);
      linked_list[2].next = &(linked_list[1]);
      linked_list[1].next = &(linked_list[0]);
      linked_list[0].next = nullptr;

      Node *head = &(linked_list[3]);

      for (Node *next = head; next != nullptr; next = next->next)
         next->data += 1;


Node *host_list = (Node *) malloc(num_bytes);
memset(host_list, 0, num_bytes);

q.memcpy(host_list, linked_list, num_bytes).wait();

for (size_t i = 0; i < num_items; ++i)
   std::cout << host_list[i].data << std::endl;

// Expected Output:
// 1
// 2
// 3
// 4


0 Kudos
1 Reply


Please refer to the Examples of Stall-free and Stallable Memory Systems in the Intel High Level Synthesis Compiler Pro Edition: Best Practices Guide: