Poor heterogeneous memory performance in Quartus 14.1

Altera_Forum · ‎05-30-2016

EDIT: This happens in Quartus 15.1 and 16.0 as well.

Hi All

I have an issue with low performance when using multiple memory systems on the DE5Net Board. When using 2 Memory Systems, the non-default/non-primary memory system is getting 1/10 the expected performance of a single memory system. See below for more information.

I'm using three custom Board Support Packages (BSP) on the DE5Net:

1. Uses the 2 DDRs (Very similar to the vendor provided board)

2. Uses the 4 QDRs

3. Uses the 2 DDRs, and 4 QDRs available on the Terasic DE5Net Board. The DDR Subsytem is the "Default"/Primary memory system

If I compile vector add on either the custom 2 DDR board (1) OR the custom 4 QDR (2) board with a single work group, no vectorization, I get 250 M IOPS on a kernel running at 250 MHz (1 IOP/cycle). Therefore, it stands to reason that the DDR and QDR memories are being used correctly, and neither memory is incurring bandwidth limitations.

However, if I compile (3), I get interesting performance results. If I tell aoc to use the DDR System for input and output vectors, I get 250 M IOPS. If I tell aoc to use the QDR system, I get 25 M IOPS.

Oddly enough, if I switch the primary systems (i.e. QDR is primary), the results switch: If I tell aoc to use the DDR System for input and output vectors, I get 25 M IOPS, and the QDR System gets 250 M IOPS.

As far as I can tell, all of the clocks have met timing, all clock crossing buffers are sufficiently deep, and the kernel clocks are running at ~250 MHz.

So, I'm stumped. Does anyone have any ideas about what is causing this issue?

Thanks

Altera_Forum · ‎10-20-2016

Still having this issue. Does anyone have a clever idea?

Altera_Forum · ‎10-21-2016

Have you tried using the CL_MEM_HETEROGENEOUS_ALTERA flag? This hint tells the host runtime to allocate into the heterogeneous memory. What you may be seeing is the host copying out of default memory prior to kernel execution, and that overhead is getting lumped into the kernel execution time.

Altera_Forum · ‎10-21-2016

Better, I'm now getting like 1/8th of the expected performance, but not best.

I applied it to all three clCreateBuffer calls, which are all allocated in the non-default OpenCL Memory.

I have a signal tap design that I will try when I get into my office.

Altera_Forum · ‎10-21-2016

Signal tap will help. Profiler may also tell you if there's any change to the actual kernel execution time.

Altera_Forum · ‎10-28-2016

You need two things to avoid having it allocated in "default" global memory. If you don't do these two things then you pay a penalty of the software reading it from DDR and writing it to QDR prior to kernel execution (depending on when you take your timestamp, this can look like lower kernel performance - check the profiler to be sure). But in short you need:

1. Use the cl_mem flag CL_MEM_HETEROGENEOUS_ALTERA when you do CreateBuffer

2. Use clSetKernelArg to bind it to the argument that has the "buffer_location" attribute BEFORE doing any accesses to that buffer.

For example you need to do:

mem = clCreateBuffer(context, flags|CL_MEM_HETEROGENEOUS_ALTERA, memSize, NULL,&errNum);

clSetKernelArg(kernel, 0, sizeof(cl_mem), &mem);

clEnqueueWriteBuffer(queue, mem, CL_FALSE, 0, N, 0, NULL, &write_event);

clEnqueueNDRangeKernel(queue, kernel, 1, NULL, global_work_size, NULL, 0, NULL, &kernel_event);

That should launch your kernel without any unnecessary copies.