OpenCL* for CPU
Ask questions and share information on Intel® SDK for OpenCL™ Applications and OpenCL™ implementations for Intel® CPU
This forum covers OpenCL* for CPU only. OpenCL* for GPU questions can be asked in the GPU Compute Software forum. Intel® FPGA SDK for OpenCL™ questions can be ask in the FPGA Intel® High Level Design forum.
1663 Discussions

Optimizing a Matrix-Vector multiplication kernel


Hi all,

I am trying to optimize a Matrix-vector multiplication kernel for an Intel CPU-GPU system. I know that gemv/BLAS-2 is memory bound but I want to obtain the best performance possible. Here's the code for the kernel : 

__kernel void gemv(const __global float4* M,
	const __global float4* V,
	uint width, uint height,
	__global float* W,
	__local float* partialDotProduct)
	// Each work-group handles as many matrix rows as necessary

	for (uint y = get_group_id(0); y < height; y += get_num_groups(0)) {

		// Row pointer
		const __global float4* row = M + (y * width/4);

		// Each work-item accumulates as many products as necessary
		// into local variable "sum"
		float4 sum = (float4) (0.0f);

		for (uint x = get_local_id(0); x < width/4; x += get_local_size(0))
			sum = fma(row,V,sum);
		// Each partial dot product is stored in SLM
		partialDotProduct[get_local_id(0)] = dot(sum, (float4) 1.0f);

		// Perform parallel reduction to add each work-item's
		// partial dot product together

		for (uint stride = get_local_size(0) / 2; stride > 0; stride /= 2) {

			// Synchronize to make sure each work-item is done updating SLM

			// Only the first work-items in the work-group add elements together
			if (get_local_id(0) < stride) {

				// Add two elements from the "partialDotProduct" array
				// and store the result in partialDotProduct[index]
				partialDotProduct[get_local_id(0)] += partialDotProduct[get_local_id(0) + stride];

		// Write the result of the reduction to global memory
		if (get_local_id(0) == 0)
			W = partialDotProduct[0];


On measuring performance (Using the profiling queue), I found that the CPU is faster than the GPU by 10-20% for all datasizes (Ranging from matrices 512x512 to 8192x8192).

Is there room for any more optimization here? Or am I correct in assuming that the performance here is bounded by the memory accesses?

I am using the latest OpenCL runtimes on Intel 6300U/ HD 520 running Windows.


0 Kudos
5 Replies

Sorry for the delayed reply.  There is still room for optimization.   We're hoping to have at least one publication come out on this topic within a few weeks where we go into strategies for matrix-matrix and matrix-vector optimizations in more detail.  Until then:

As you know, optimizing OpenCL is a combination of efficient operations and efficient data transfer.  To reduce the # of operations you could consider opportunities for unrolling.   There should also be a way to write back results as a vector type instead of one at a time.

Our tools can give excellent feedback to help with optimization.  You may have already seen the introductory videos here:

I'm hoping to have a more detailed response soon. 




If you haven't done already, you should use pinned host arrays and USE_HOST_PTR  type direct memory accessing to lower memory read write laatency to %30. I tested with Intel HD 400 and 1M data was taking 10-15 ms to read and compute then I converted it to map and compute using host buffer 4096 aligned and it took 3-5 ms. 

Then with a 1 cpu-core, it takes 55ms which is 10 times slower. 


Because data reuse ratio is too low, and 8192x8192 matrix is also out of cache.


Do you get the same result with this as without vectorization?

I have a vector of size 3150 and matrix 3150 * 3150. I am trying to use float2, but the result I get from the multiplication is different from the scalar one. 

Also, do you recommend using float4 over float2?



We recently published some best known methods for matrix multiply on Gen graphics here, which may help with matrix-vector multiply.

Optimizing Matrix Multiply for Intel® Processor Graphics Architecture Gen9

This includes float4 and even 16 bit (half) floats.


Thank you very much for sharing this Jeffrey. 

One question though. In case my matrix is not a power of 2 (3150 in my case), is it still recommended to use these techniques?

If not, what can I do to get the best results?

I have tried different kernels, but the gemv with vectorization gives best latency. The only problem is, I get different results when using a scalar and a vector version.