<P>Sorry for the delayed reply. There is still room for optimization. We're hoping to have at least one publication come out on this topic within a few weeks where we go into strategies for matrix-matrix and matrix-vector optimizations in more detail. Until then:</P>
<P>As you know, optimizing OpenCL is a combination of efficient operations and efficient data transfer. To reduce the # of operations you could consider opportunities for unrolling. There should also be a way to write back results as a vector type instead of one at a time.</P>
<P>Our tools can give excellent feedback to help with optimization. You may have already seen the introductory videos here: <A href="https://software.intel.com/en-us/intel-opencl">https://software.intel.com/en-us/intel-opencl</A></P>
<P>I'm hoping to have a more detailed response soon. </P>
Tue, 08 Nov 2016 03:33:04 GMT Jeffrey_M_Intel
<P>Hi all,</P>
<P>I am trying to optimize a Matrix-vector multiplication kernel for an Intel CPU-GPU system. I know that gemv/BLAS-2 is memory bound but I want to obtain the best performance possible. Here's the code for the kernel : </P>
<PRE class="brush:cpp;">__kernel void gemv(const __global float4* M,
const __global float4* V,
uint width, uint height,
__global float* W,
__local float* partialDotProduct)
{
// Each work-group handles as many matrix rows as necessary
for (uint y = get_group_id(0); y < height; y += get_num_groups(0)) {
// Row pointer
const __global float4* row = M + (y * width/4);
// Each work-item accumulates as many products as necessary
// into local variable "sum"
float4 sum = (float4) (0.0f);
for (uint x = get_local_id(0); x < width/4; x += get_local_size(0))
sum = fma(row<X>,V<X>,sum);
// Each partial dot product is stored in SLM
partialDotProduct[get_local_id(0)] = dot(sum, (float4) 1.0f);
// Perform parallel reduction to add each work-item's
// partial dot product together
for (uint stride = get_local_size(0) / 2; stride > 0; stride /= 2) {
// Synchronize to make sure each work-item is done updating SLM
barrier(CLK_LOCAL_MEM_FENCE);
// Only the first work-items in the work-group add elements together
if (get_local_id(0) < stride) {
// Add two elements from the "partialDotProduct" array
// and store the result in partialDotProduct[index]
partialDotProduct[get_local_id(0)] += partialDotProduct[get_local_id(0) + stride];
}
}
// Write the result of the reduction to global memory
if (get_local_id(0) == 0)
W<Y> = partialDotProduct[0];
}
}
</Y></X></X></PRE>
<P>On measuring performance (Using the profiling queue), I found that the CPU is faster than the GPU by 10-20% for all datasizes (Ranging from matrices 512x512 to 8192x8192).</P>
<P>Is there room for any more optimization here? Or am I correct in assuming that the performance here is bounded by the memory accesses?</P>
<P>I am using the latest OpenCL runtimes on Intel 6300U/ HD 520 running Windows.</P>
Tue, 08 Nov 2016 03:33:04 GMT Jeffrey_M_Intel
<P>If you haven't done already, you should use pinned host arrays and USE_HOST_PTR type direct memory accessing to lower memory read write laatency to %30. I tested with Intel HD 400 and 1M data was taking 10-15 ms to read and compute then I converted it to map and compute using host buffer 4096 aligned and it took 3-5 ms. </P>
<P>Then with a 1 cpu-core, it takes 55ms which is 10 times slower. </P>
<P> </P>
Because data reuse ratio is too low, and 8192x8192 matrix is also out of cache. Tue, 08 Nov 2016 09:06:00 GMT Huseyin_Tugrul_B_
<P>Do you get the same result with this as without vectorization?</P>
<P>I have a vector of size 3150 and matrix 3150 * 3150. I am trying to use float2, but the result I get from the multiplication is different from the scalar one. </P>
<P>Also, do you recommend using float4 over float2?</P>
Thanks, Tue, 17 Jan 2017 03:48:16 GMT Namrata_C_Intel
<P>We recently published some best known methods for matrix multiply on Gen graphics here, which may help with matrix-vector multiply.</P>
<P><A href="https://software.intel.com/en-us/articles/sgemm-ocl-opt">Optimizing Matrix Multiply for Intel® Processor Graphics Architecture Gen9</A></P>
This includes float4 and even 16 bit (half) floats. Tue, 17 Jan 2017 04:01:16 GMT Jeffrey_M_Intel1
<P>Thank you very much for sharing this Jeffrey. </P>
<P><SPAN style="font-size: 1em;">One question though. In case my matrix is not a power of 2 (3150 in my case), is it still recommended to use these techniques?</SPAN></P>
<P><SPAN style="font-size: 1em;">If not, what can I do to get the best results?</SPAN></P>
<P><SPAN style="font-size: 1em;">I have tried different kernels, but the gemv with vectorization gives best latency. The only problem is, I get different results when using a scalar and a vector version.</SPAN></P>
Thanks, Tue, 17 Jan 2017 08:58:51 GMT Namrata_C_Intel