Community
cancel
Showing results for
Did you mean:
Beginner
118 Views

## Optimizing a Matrix-Vector multiplication kernel

Hi all,

I am trying to optimize a Matrix-vector multiplication kernel for an Intel CPU-GPU system. I know that gemv/BLAS-2 is memory bound but I want to obtain the best performance possible. Here's the code for the kernel :

```__kernel void gemv(const __global float4* M,
const __global float4* V,
uint width, uint height,
__global float* W,
__local float* partialDotProduct)
{
// Each work-group handles as many matrix rows as necessary

for (uint y = get_group_id(0); y < height; y += get_num_groups(0)) {

// Row pointer
const __global float4* row = M + (y * width/4);

// Each work-item accumulates as many products as necessary
// into local variable "sum"
float4 sum = (float4) (0.0f);

for (uint x = get_local_id(0); x < width/4; x += get_local_size(0))
sum = fma(row,V,sum);

// Each partial dot product is stored in SLM
partialDotProduct[get_local_id(0)] = dot(sum, (float4) 1.0f);

// Perform parallel reduction to add each work-item's
// partial dot product together

for (uint stride = get_local_size(0) / 2; stride > 0; stride /= 2) {

// Synchronize to make sure each work-item is done updating SLM
barrier(CLK_LOCAL_MEM_FENCE);

// Only the first work-items in the work-group add elements together
if (get_local_id(0) < stride) {

// Add two elements from the "partialDotProduct" array
// and store the result in partialDotProduct[index]
partialDotProduct[get_local_id(0)] += partialDotProduct[get_local_id(0) + stride];
}
}

// Write the result of the reduction to global memory
if (get_local_id(0) == 0)
W = partialDotProduct[0];

}
}
```

On measuring performance (Using the profiling queue), I found that the CPU is faster than the GPU by 10-20% for all datasizes (Ranging from matrices 512x512 to 8192x8192).

Is there room for any more optimization here? Or am I correct in assuming that the performance here is bounded by the memory accesses?

I am using the latest OpenCL runtimes on Intel 6300U/ HD 520 running Windows.

Thanks

Tags (3)
5 Replies
Highlighted
Employee
118 Views

Sorry for the delayed reply.  There is still room for optimization.   We're hoping to have at least one publication come out on this topic within a few weeks where we go into strategies for matrix-matrix and matrix-vector optimizations in more detail.  Until then:

As you know, optimizing OpenCL is a combination of efficient operations and efficient data transfer.  To reduce the # of operations you could consider opportunities for unrolling.   There should also be a way to write back results as a vector type instead of one at a time.

Our tools can give excellent feedback to help with optimization.  You may have already seen the introductory videos here: https://software.intel.com/en-us/intel-opencl

I'm hoping to have a more detailed response soon.

Highlighted
Novice
118 Views

If you haven't done already, you should use pinned host arrays and USE_HOST_PTR  type direct memory accessing to lower memory read write laatency to %30. I tested with Intel HD 400 and 1M data was taking 10-15 ms to read and compute then I converted it to map and compute using host buffer 4096 aligned and it took 3-5 ms.

Then with a 1 cpu-core, it takes 55ms which is 10 times slower.

Because data reuse ratio is too low, and 8192x8192 matrix is also out of cache.

Highlighted
Employee
118 Views

Do you get the same result with this as without vectorization?

I have a vector of size 3150 and matrix 3150 * 3150. I am trying to use float2, but the result I get from the multiplication is different from the scalar one.

Also, do you recommend using float4 over float2?

Thanks,

Namrata Choudhury
Highlighted
Employee
118 Views

We recently published some best known methods for matrix multiply on Gen graphics here, which may help with matrix-vector multiply.

Optimizing Matrix Multiply for Intel® Processor Graphics Architecture Gen9

This includes float4 and even 16 bit (half) floats.

Highlighted
Employee
118 Views

Thank you very much for sharing this Jeffrey.

One question though. In case my matrix is not a power of 2 (3150 in my case), is it still recommended to use these techniques?