Intel® High Level Design
Support for Intel® High Level Synthesis Compiler, DSP Builder, OneAPI for Intel® FPGAs, Intel® FPGA SDK for OpenCL™
Announcements
Intel Support hours are Monday-Fridays, 8am-5pm PST, except Holidays. Thanks to our community members who provide support during our down time or before we get to your questions. We appreciate you!

Need Forum Guidance? Click here
Search our FPGA Knowledge Articles here.
438 Discussions

Different runtime behavior of shared memory and device memory?

AustinKnutsonTMobile
199 Views

The following snippet of code demonstrates a different between using shared memory and using device memory that I don't fully understand. A simple kernel uses the memory to build and iterate through a linked list and then explicitly copy the memory back to the host. When I use device memory, everything works as expected. When I change to shared memory, the output is 0,0,0,0 instead of the expected 1,2,3,4. My best guess is that the runtime affects when shared memory is implicitly copied between the device and the host. If this is the cause, is there any detailed description of how this works and is there any way to get the code to work as expected (i.e. some additional synchronization call to force the copy)?

I'm building and running this in the devcloud with an Arria 10 device.

#if defined(FPGA_EMULATOR)
INTEL::fpga_emulator_selector device_selector;
#else
INTEL::fpga_selector device_selector;
#endif
queue q(device_selector, dpc_common::exception_handler);

size_t num_items = 4;
size_t num_bytes = num_items * sizeof(Node);
// When I used malloc_device() below, then everything works as expected
Node *linked_list = malloc_shared<Node>(num_items, q);

q.memset(linked_list, 0, num_bytes).wait();

auto linked_e = q.submit([&](handler &h) {
   h.single_task<LinkedKernel>([=]() {
      linked_list[0].data = 0;
      linked_list[1].data = 1;
      linked_list[2].data = 2;
      linked_list[3].data = 3;

      linked_list[3].next = &(linked_list[2]);
      linked_list[2].next = &(linked_list[1]);
      linked_list[1].next = &(linked_list[0]);
      linked_list[0].next = nullptr;

      Node *head = &(linked_list[3]);

      for (Node *next = head; next != nullptr; next = next->next)
         next->data += 1;
   });
});

linked_e.wait();

Node *host_list = (Node *) malloc(num_bytes);
memset(host_list, 0, num_bytes);

q.memcpy(host_list, linked_list, num_bytes).wait();

for (size_t i = 0; i < num_items; ++i)
   std::cout << host_list[i].data << std::endl;

// Expected Output:
// 1
// 2
// 3
// 4

 

0 Kudos
1 Reply
EBERLAZARE_I_Intel
131 Views

Hi,

Please refer to the Examples of Stall-free and Stallable Memory Systems in the Intel High Level Synthesis Compiler Pro Edition: Best Practices Guide:

https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/hb/hls/ug-hls-best-practice...

 

Reply