Optimizing Matrix Multiply for Intel® Processor Graphics Architecture Gen9

Scout · ‎06-30-2022

12/23/2016 The article "Optimizing Matrix Multiply for Intel® Processor Graphics Architecture Gen9" was published by Jeffrey McAllister.

Link is here: https://www.intel.com/content/www/us/en/developer/articles/technical/sgemm-ocl-opt.html

At the end of this article, the link of the sample code is no longer available. Is it possible to locate the sample code in this article?

Thanks a lot!

HemanthCH_Intel · ‎07-01-2022

Hi,

Thanks for posting in Intel Communities.

We are working on your issue internally and will get back to you soon.

Thanks & Regards,

Hemanth

Jinchuan_Tang · ‎09-18-2022

Hi,

I don't know if this would be helpful:

GitHub - ek9852/intel-gemm: General matrix-matrix multiplication in OpenCL from Intel

I don't know if this applies fully to a Gen9 since it has been there for a long time.

In the meantime, if you want to know how to optimize the matrix multiplication generally. I may help to point you to some useful materials.

Best wishes,

Jinchuan

Scout · ‎11-07-2022

hello Jinchuan,

Thanks for your reply. The reason I'm looking for this old example is that it uses intel's "subgroup shuffle" feature to share data within work-items. This feature extremly improve the performance for matrix multiply. But, I searched online for several days and cannot find a ready-to-go example of "subgroup shuffle" application, specially for matrix multiply. If you can give an example, that will be very helpful. Thanks a lot!

Jinchuan_Tang · ‎11-07-2022

Hi Scout,

try reaching them via linkedin:

Lingyi Kong is a Software Engineer at Intel’s IT Flex Services Group. He is an expert in GPU programming and optimization, and also has Graphics driver/runtime development experience on Intel® Iris and Intel® Iris Pro Graphics.

Robert Ioffe is a Technical Consulting Engineer at Intel’s Software and Solutions Group. He is an expert in OpenCL programming and OpenCL workload optimization on Intel Iris and Intel Iris Pro Graphics with deep knowledge of Intel Graphics Hardware. He was heavily involved in Khronos standards work, focusing on prototyping the latest features and making sure they can run well on Intel architecture. Most recently he has been working on prototyping Nested Parallelism (enqueue_kernel functions) feature of OpenCL 2.0 and wrote a number of samples that demonstrate Nested Parallelism functionality, including GPU-Quicksort for OpenCL 2.0. He also recorded and released two Optimizing Simple OpenCL Kernels videos and a third video on Nested Parallelism.

https://www.codeproject.com/Articles/994769/SGEMM-for-Intel-Processor-Graphics

Best wishes,

Jinchuan

HemanthCH_Intel · ‎11-10-2022

Hi,

The page was re-directed to the newer page as the content of the old article was not relevant anymore. Please refer to the new link:

https://www.intel.com/content/www/us/en/develop/documentation/oneapi-gpu-optimization-guide/top/kernels.html

Thanks & Regards,

Hemanth.

VidyalathaB_Intel · ‎11-21-2022

Hi,

As the issue is resolved we are closing this thread. Please post a new question if you need any additional assistance from Intel as this thread will no longer be monitored.

Regards,

Vidya.