Re: get_global_id(0) cause much latency ?

Altera_Forum · ‎09-26-2017

Hi

I'm trying to compare the performance of FPGA and GPU based on image processing algorithm. I've found that there's a 10ms different when using the get_global_id(0) and not using it. The kernel in question was performing point wise multiplication and it took 0.5 ms on GPU and even lower on FPGA when I launch multiply kernel copies under different names and queues.

I experiment several time and was sure that the line "int id=get_global_id(0)" (not even using this id) was causing all the difference.

It's possible to launch parallel kernels manually if it's only a few copies like mine, but when it need a large amount of copies, I don't see it efficient. SIMD doesn't help as it also needs to get the id.

Is there anyway around this? Thanks.

----

The examples from Altera like the fft design used get_global_id(0) and the processing speed are about 1.5ms...

I did use reqd_work_group_size attribute in my kernel

I'm confused...

Altera_Forum · ‎09-26-2017

Are you launching this as a single work-item kernel or as an NDRange kernel? Since you're using get_global_id (or trying to), I presume NDRange. Maybe the design would work better as a single work item kernel. Some code (from the host and kernel) might help to explain.

Altera_Forum · ‎09-26-2017

--- Quote Start ---

Are you launching this as a single work-item kernel or as an NDRange kernel? Since you're using get_global_id (or trying to), I presume NDRange. Maybe the design would work better as a single work item kernel. Some code (from the host and kernel) might help to explain.

--- Quote End ---

NDRange, as soon as I add the get_global_id it will become much slower...

It did get the correct global_id when I actually use it.

__attribute__((num_compute_units(2)))
__attribute__((reqd_work_group_size(2, 1, 1)))
__kernel void pointWiseMul(__global float2* restrict d_afCorr, __global float2* restrict d_afPadScn, __global float2* restrict d_afPadTpl, int dataN, float fScale)
{
        int begin = get_global_id(0);//mark out this line and the speed change dramatically
	for (int iIndx = 0; iIndx < dataN; iIndx++)
	{
		float2 cDat = d_afPadScn;
		float2 cKer = d_afPadTpl;
		//take the conjugate of the kernel
		cKer.y = -cKer.y;
		float2 cMul = { cDat.x* cKer.x - cDat.y * cKer.y, cDat.y * cKer.x + cDat.x * cKer.y };
		cMul.x = fScale * cMul.x;
		cMul.y = fScale * cMul.y;
		d_afCorr = cMul;
	}
}

Altera_Forum · ‎09-26-2017

I'm confused. "begin" as the work-item number is not used anywhere in the code. Is that a typo? Should it be:

int iIndx = get_global_id(0);

replacing the for loop?

Altera_Forum · ‎09-26-2017

--- Quote Start ---

I'm confused. "begin" as the work-item number is not used anywhere in the code. Is that a typo? Should it be:

int iIndx = get_global_id(0);

replacing the for loop?

--- Quote End ---

yes originally it was like that. I was trying to figure out what was causing all the trouble and change back to single for loop, that's why it looks like typo.

I experimented more (each modify takes one hour to compile...) and get rid off get_global_id(0) and it sill have that latency...

Let me do more test and figure what the hell is going on...

Altera_Forum · ‎09-27-2017

How are you launching the kernel on the host? What does your clEnqueueNDRangeKernel command look like?

Altera_Forum · ‎09-27-2017

--- Quote Start ---

NDRange, as soon as I add the get_global_id it will become much slower...

It did get the correct global_id when I actually use it.

__attribute__((num_compute_units(2)))
__attribute__((reqd_work_group_size(2, 1, 1)))
__kernel void pointWiseMul(__global float2* restrict d_afCorr, __global float2* restrict d_afPadScn, __global float2* restrict d_afPadTpl, int dataN, float fScale)
{
        int begin = get_global_id(0);//mark out this line and the speed change dramatically
    for (int iIndx = 0; iIndx < dataN; iIndx++)
    {
        float2 cDat = d_afPadScn;
        float2 cKer = d_afPadTpl;
        //take the conjugate of the kernel
        cKer.y = -cKer.y;
        float2 cMul = { cDat.x* cKer.x - cDat.y * cKer.y, cDat.y * cKer.x + cDat.x * cKer.y };
        cMul.x = fScale * cMul.x;
        cMul.y = fScale * cMul.y;
        d_afCorr = cMul;
    }
}

--- Quote End ---

In this specific code, whether you comment out the get_global_id(0) function or not, the output circuit will be EXACTLY the same since the "begin" variable is never used and the compiler automatically optimizes it out anyway. I tested and confirmed this on my own environment. If you are seeing any performance difference when you comment that line and when you don't, the reason for that is probably in the host code or your timing function, not the kernel code. Variations up to 10 ms are standard on GPUs and even FPGAs, you should not make any conclusions based on run times that are less that 10 ms. In fact, the kernel launch overhead itself will be a few milliseconds.

However, if you remove the loop and use the global_id in place of the loop iterator, then everything will be different. That would convert the kernel from single work-item to NDRange, which are implemented in completely different ways. The pipeline latency for the single work-item case in your code is reported as 192 cycles in the compilation report, while in the NDRange case it is reported to be 32 cycles. In the former case the pipeline is deeper so that latency of external memory accesses can be hidden if there are enough inputs. In the latter case, however, the pipeline is much shorter and the runtime scheduler will determine the best thread scheduling to hide the latency of memory accesses.

If you want to perform correct timing comparisons, use test cases which at least run for a few seconds.

Altera_Forum · ‎09-27-2017

Thanks for the replies

I did more testing and find that if I didn't use the "reqd_work_group_size(2, 1, 1)", including "begin=get_global_id(0)"(still not using it) it would run slower then single pipeline;

but when having the reqd_work_group_size attribute, including or comment out "begin=get_global_id(0)" made no difference, both slower then single pipeline.

the difference are like 13ms and 1.5ms each frame.

I guess it's like you said because of the pipeline length. Could you tell me where did you see the 32 and 192 cycle number? I didn't see them in the report.html

Thanks

Altera_Forum · ‎09-27-2017

You can see the latency values by hovering your mouse over the blocks in the "System Viewer" tab of the report.

Altera_Forum · ‎09-27-2017

--- Quote Start ---

You can see the latency values by hovering your mouse over the blocks in the "System Viewer" tab of the report.

--- Quote End ---

Thanks a lot!

I saw that the load in the kernel that compile optimized for NDRange (containing get_globle_id) is in burst-coalesced mode and the single pipeline one is in prefetching mode, and the latency of burst-coalesced mode is huge(159) compare to prefetching(2).

Is there anyway to achieve what I was trying to do in the first place: creating multiple kernel copies by using num_compute_unit and get_globle_id and not enduring the huge latency caused by the burst-coalesced mode?

Beside from launch them separately under different queues of cause:)

Altera_Forum · ‎09-28-2017

There is really nothing wrong with the high-latency of the burst-coalesced mode. In fact, it is very effective in hiding the high latency of external memory accesses and avoiding stalls from propagating all the way through the pipeline. What you need to do is to use large inputs. If you have 10,000 inputs going through a pipeline with a latency of 159 clocks, the memory latency hiding effect from the deeper pipeline will far outweigh its higher warmup time/latency. You need to remember that even though the latency of the prefetching access is 2 cycles, the latency of external memory accesses are always over 100 cycles, which means the prefetching access will be stalling most of the time until the data is read/written from/to external memory.

There is no way to control the type of memory port the compiler infers, it is done automatically based on what the compiler thinks is best for the kernel.

For NDRange kernels you can easily achieve replication using num_compute_units as I mentioned before; you need a lot of work-groups running in parallel to be able to use it efficiently, though. For single work-item kernels you can decouple your memory accesses from compute, and put them in separate kernels, and define your compute kernel as "autorun" which would allow you to easily replicate it using num_compute_units (same attribute as the one used for NDRange kernels, but used in a completely different manner) and customize the replicas using a static ID supplied by the compiler. You still need to create separate parallel queues for your memory read/write kernels in this case, but you do not need any queues for the compute kernel(s). Check Altera's documentation for more info on autorun kernels and how to create and replicate them.