- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi all,
I read the paper "Best-Effort FPGA Programming: A Few Steps Can Go a Long Way". They used HLS on Xilinx devices as example. Besides the normal optimization, I found the double buffering is interesting:
void aes(...) { ... }
void load(...) { ... }
void store(...) { ... }
void compute(...) { ... }
void kernel(char *data, int size) {
char buf_data[3][PE_NUM][PE_BATCH];
#pragma HLS array_partition var=buf_data complete dim=1
#pragma HLS array_partition var=buf_data cyclic=PE_NUM dim=2
for (int i=0; i < size/BATCH_SIZE; i++) {
switch (i % 3) {
case 0:
load(buf_data[0], data+i*BATCH_SIZE);
compute(buf_data[1]);
store(data+i*BATCH_SIZE, buf_data[2]);
break;
case 1:
load(buf_data[1], data+i*BATCH_SIZE);
compute(buf_data[2]);
store(data+i*BATCH_SIZE, buf_data[0]);
break;
case 2:
load(buf_data[2], data+i*BATCH_SIZE);
compute(buf_data[0]);
store(data+i*BATCH_SIZE, buf_data[1]);
break;
}
}
}
Can the Intel compiler successfully imply this pipeline? I tried this on compiler version 16 but it seems that the throughput improvement is very limited.
- Tags:
- Pragma
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page