Different runtime behavior of shared memory and device memory?

AustinKnutsonTMobile · ‎12-02-2020

The following snippet of code demonstrates a different between using shared memory and using device memory that I don't fully understand. A simple kernel uses the memory to build and iterate through a linked list and then explicitly copy the memory back to the host. When I use device memory, everything works as expected. When I change to shared memory, the output is 0,0,0,0 instead of the expected 1,2,3,4. My best guess is that the runtime affects when shared memory is implicitly copied between the device and the host. If this is the cause, is there any detailed description of how this works and is there any way to get the code to work as expected (i.e. some additional synchronization call to force the copy)?

I'm building and running this in the devcloud with an Arria 10 device.

#if defined(FPGA_EMULATOR)
INTEL::fpga_emulator_selector device_selector;
#else
INTEL::fpga_selector device_selector;
#endif
queue q(device_selector, dpc_common::exception_handler);

size_t num_items = 4;
size_t num_bytes = num_items * sizeof(Node);
// When I used malloc_device() below, then everything works as expected
Node *linked_list = malloc_shared<Node>(num_items, q);

q.memset(linked_list, 0, num_bytes).wait();

auto linked_e = q.submit([&](handler &h) {
   h.single_task<LinkedKernel>([=]() {
      linked_list[0].data = 0;
      linked_list[1].data = 1;
      linked_list[2].data = 2;
      linked_list[3].data = 3;

      linked_list[3].next = &(linked_list[2]);
      linked_list[2].next = &(linked_list[1]);
      linked_list[1].next = &(linked_list[0]);
      linked_list[0].next = nullptr;

      Node *head = &(linked_list[3]);

      for (Node *next = head; next != nullptr; next = next->next)
         next->data += 1;
   });
});

linked_e.wait();

Node *host_list = (Node *) malloc(num_bytes);
memset(host_list, 0, num_bytes);

q.memcpy(host_list, linked_list, num_bytes).wait();

for (size_t i = 0; i < num_items; ++i)
   std::cout << host_list[i].data << std::endl;

// Expected Output:
// 1
// 2
// 3
// 4

EBERLAZARE_I_Intel · ‎01-24-2021

Hi,

Please refer to the Examples of Stall-free and Stallable Memory Systems in the Intel High Level Synthesis Compiler Pro Edition: Best Practices Guide:

https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/hb/hls/ug-hls-best-practices.pdf#page=35