Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Intel Community
- Software Development SDKs and Libraries
- OpenCL*
- If you haven't done already,

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

Highlighted
##

Supradeep_A_

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

10-30-2016
06:48 AM

154 Views

Optimizing a Matrix-Vector multiplication kernel

Hi all,

I am trying to optimize a Matrix-vector multiplication kernel for an Intel CPU-GPU system. I know that gemv/BLAS-2 is memory bound but I want to obtain the best performance possible. Here's the code for the kernel :

__kernel void gemv(const __global float4* M, const __global float4* V, uint width, uint height, __global float* W, __local float* partialDotProduct) { // Each work-group handles as many matrix rows as necessary for (uint y = get_group_id(0); y < height; y += get_num_groups(0)) { // Row pointer const __global float4* row = M + (y * width/4); // Each work-item accumulates as many products as necessary // into local variable "sum" float4 sum = (float4) (0.0f); for (uint x = get_local_id(0); x < width/4; x += get_local_size(0)) sum = fma(row,V ,sum); // Each partial dot product is stored in SLM partialDotProduct[get_local_id(0)] = dot(sum, (float4) 1.0f); // Perform parallel reduction to add each work-item's // partial dot product together for (uint stride = get_local_size(0) / 2; stride > 0; stride /= 2) { // Synchronize to make sure each work-item is done updating SLM barrier(CLK_LOCAL_MEM_FENCE); // Only the first work-items in the work-group add elements together if (get_local_id(0) < stride) { // Add two elements from the "partialDotProduct" array // and store the result in partialDotProduct[index] partialDotProduct[get_local_id(0)] += partialDotProduct[get_local_id(0) + stride]; } } // Write the result of the reduction to global memory if (get_local_id(0) == 0) W = partialDotProduct[0]; } }

On measuring performance (Using the profiling queue), I found that the CPU is faster than the GPU by 10-20% for all datasizes (Ranging from matrices 512x512 to 8192x8192).

Is there room for any more optimization here? Or am I correct in assuming that the performance here is bounded by the memory accesses?

I am using the latest OpenCL runtimes on Intel 6300U/ HD 520 running Windows.

Thanks

5 Replies

Highlighted
##

Jeffrey_M_Intel1

Employee

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-07-2016
07:33 PM

154 Views

Sorry for the delayed reply. There is still room for optimization. We're hoping to have at least one publication come out on this topic within a few weeks where we go into strategies for matrix-matrix and matrix-vector optimizations in more detail. Until then:

As you know, optimizing OpenCL is a combination of efficient operations and efficient data transfer. To reduce the # of operations you could consider opportunities for unrolling. There should also be a way to write back results as a vector type instead of one at a time.

Our tools can give excellent feedback to help with optimization. You may have already seen the introductory videos here: https://software.intel.com/en-us/intel-opencl

I'm hoping to have a more detailed response soon.

Highlighted
##

Huseyin_Tugrul_B_

Novice

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-08-2016
01:06 AM

154 Views

If you haven't done already, you should use pinned host arrays and USE_HOST_PTR type direct memory accessing to lower memory read write laatency to %30. I tested with Intel HD 400 and 1M data was taking 10-15 ms to read and compute then I converted it to map and compute using host buffer 4096 aligned and it took 3-5 ms.

Then with a 1 cpu-core, it takes 55ms which is 10 times slower.

Because data reuse ratio is too low, and 8192x8192 matrix is also out of cache.

Highlighted
##

Namrata_C_Intel

Employee

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

01-16-2017
07:48 PM

154 Views

Do you get the same result with this as without vectorization?

I have a vector of size 3150 and matrix 3150 * 3150. I am trying to use float2, but the result I get from the multiplication is different from the scalar one.

Also, do you recommend using float4 over float2?

Thanks,

Highlighted
##

Jeffrey_M_Intel1

Employee

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

01-16-2017
08:01 PM

154 Views

We recently published some best known methods for matrix multiply on Gen graphics here, which may help with matrix-vector multiply.

Optimizing Matrix Multiply for Intel® Processor Graphics Architecture Gen9

This includes float4 and even 16 bit (half) floats.

Highlighted
##

Namrata_C_Intel

Employee

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

01-17-2017
12:58 AM

154 Views

Thank you very much for sharing this Jeffrey.

One question though. In case my matrix is not a power of 2 (3150 in my case), is it still recommended to use these techniques?

If not, what can I do to get the best results?

I have tried different kernels, but the gemv with vectorization gives best latency. The only problem is, I get different results when using a scalar and a vector version.

Thanks,

For more complete information about compiler optimizations, see our Optimization Notice.