<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Undefined behavior in matrix multiplication kernel in OpenCL* for CPU</title>
    <link>https://community.intel.com/t5/OpenCL-for-CPU/Undefined-behavior-in-matrix-multiplication-kernel/m-p/1396717#M7063</link>
    <description>&lt;P&gt;Hi,&lt;/P&gt;
&lt;P&gt;I have a generalized matrix multiplication kernel and am getting incorrect results. This seems to stem from undefined behavior, as the result is correct when I do one of the following:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;add a barrier inside the `step` loop&lt;/LI&gt;
&lt;LI&gt;remove the last pair of parentheses in the index calculation of the `c_glb` access in the last loop nest&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;I have tried the following so far to find the error but with no luck:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;traced every read and write access from every work-item: they access the correct array elements and there should be no data races&lt;/LI&gt;
&lt;LI&gt;I found that the incorrect results are produced in the last loop nest in line `c_glb[...] += a_glb[...] * b_lcl[...];`. When tracing the memory accesses of a work-item I can see that the first accumulation produces the correct result in `c_glb`, but the second iteration overwrites the previously written value to `c_glb` instead of accumulating the results. I have checked that no other work-item tries to overwrite that array element.&lt;/LI&gt;
&lt;LI&gt;execute with oclgrind: no warnings shown&lt;/LI&gt;
&lt;LI&gt;translate to CUDA and execute on GPU: no undefined behavior and no warnings/errors when executing with cuda-memcheck&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;Do you by any chance see anything that could be causing undefined behavior in this kernel or do you have any recommendations for other tools/methods I could use to further narrow down where this problem comes from?&lt;/P&gt;
&lt;P&gt;Many thanks in advance!&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;__kernel void matmul( __global float const * const __restrict__ a_glb, __global float const * const __restrict__ b_glb, __global float * const __restrict__ _, __global float * const __restrict__ c_glb) {&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; __local float b_lcl[(4 * 1 * 1) * (2 * 2 * 2) * (1 * 1 * 2) * (2 * 1 * 1)];&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; const size_t wi_3 = get_local_id(0) / (2 * 4 * 1) % (2);&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; const size_t wi_5 = get_local_id(0) / (2) % (4);&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; const size_t wi_6 = get_local_id(0) % (2);&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; // init result memory&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; for (size_t lcl_6 = 0; lcl_6 &amp;lt; 2; ++lcl_6) {&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; for (size_t prv_6 = 0; prv_6 &amp;lt; 2; ++prv_6) {&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; c_glb[wi_3 * 32 + wi_5 * 8 + wi_6 * 4 + lcl_6 * 2 + prv_6] = 0.0f;&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; }}&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; // cache b_glb&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; #pragma unroll&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; for (size_t step = 0; step &amp;lt; (((4 * 1 * 1 ) * (2 * 2 * 2 ) * (1 * 1 * 2 ) * (2 * 1 * 1 )) / 1) / (1 * 1 * 2 * 1 * 4 * 2 * 1); ++step) {&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; b_lcl[step * (1 * 1 * 2 * 1 * 4 * 2 * 1) + get_local_id(0)] = &lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; b_glb[((0 * 1 * 4 * 1 * 1 + 0 * 4 * 1 * 1 ) + ((step * (1 * 1 * 2 * 1 * 4 * 2 * 1) + get_local_id(0)) / ((2 * 2 * 2 ) * (1 * 1 * 2 ) * ((2 * 1 * 1 ) / 1)) % ((4 * 1 * 1 )))) * (8) * (2) * ((2 / 1)) + ((0 * 1 * 2 * 2 * 2 + 0 * 2 * 2 * 2 ) + ((step * (1 * 1 * 2 * 1 * 4 * 2 * 1) + get_local_id(0)) / ((1 * 1 * 2 ) * ((2 * 1 * 1 ) / 1)) % ((2 * 2 * 2 )))) * (2) * ((2 / 1)) + ((0 * 1 * 1 * 1 * 2 + 0 * 1 * 1 * 2 ) + ((step * (1 * 1 * 2 * 1 * 4 * 2 * 1) + get_local_id(0)) / ((2 * 1 * 1 ) / 1) % ((1 * 1 * 2 )))) * ((2 / 1)) + (((0 * 1 * 2 * 1 * 1 + 0 * 2 * 1 * 1 ) / 1) + ((step * (1 * 1 * 2 * 1 * 4 * 2 * 1) + get_local_id(0)) % (((2 * 1 * 1 ) / 1)))) ];&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; }&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; barrier(CLK_LOCAL_MEM_FENCE);&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; // process tile&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; for (size_t lcl_6 = 0; lcl_6 &amp;lt; 2; ++lcl_6) {&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; for (size_t prv_7 = 0; prv_7 &amp;lt; 2; ++prv_7) {&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; for (size_t prv_6 = 0; prv_6 &amp;lt; 2; ++prv_6) {&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; c_glb[(wi_3) * (1) * (4) * (8) + (wi_5) * (8) + (wi_6 * 2 * 2 + lcl_6 * 2 + prv_6)] +=&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; a_glb[prv_7] * &lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; b_glb[(wi_5) * (2 * 2 * 2) * (1 * 1 * 2) * (2 * 1 * 1) + (wi_6 * 2 * 2 + lcl_6 * 2 + prv_6) * (1 * 1 * 2) * (2 * 1 * 1) + (prv_7) * (2 * 1 * 1) + (wi_3) ];&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; }}}&lt;BR /&gt;}&lt;/P&gt;</description>
    <pubDate>Thu, 30 Jun 2022 10:09:29 GMT</pubDate>
    <dc:creator>Richard_S_7</dc:creator>
    <dc:date>2022-06-30T10:09:29Z</dc:date>
    <item>
      <title>Undefined behavior in matrix multiplication kernel</title>
      <link>https://community.intel.com/t5/OpenCL-for-CPU/Undefined-behavior-in-matrix-multiplication-kernel/m-p/1396717#M7063</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;
&lt;P&gt;I have a generalized matrix multiplication kernel and am getting incorrect results. This seems to stem from undefined behavior, as the result is correct when I do one of the following:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;add a barrier inside the `step` loop&lt;/LI&gt;
&lt;LI&gt;remove the last pair of parentheses in the index calculation of the `c_glb` access in the last loop nest&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;I have tried the following so far to find the error but with no luck:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;traced every read and write access from every work-item: they access the correct array elements and there should be no data races&lt;/LI&gt;
&lt;LI&gt;I found that the incorrect results are produced in the last loop nest in line `c_glb[...] += a_glb[...] * b_lcl[...];`. When tracing the memory accesses of a work-item I can see that the first accumulation produces the correct result in `c_glb`, but the second iteration overwrites the previously written value to `c_glb` instead of accumulating the results. I have checked that no other work-item tries to overwrite that array element.&lt;/LI&gt;
&lt;LI&gt;execute with oclgrind: no warnings shown&lt;/LI&gt;
&lt;LI&gt;translate to CUDA and execute on GPU: no undefined behavior and no warnings/errors when executing with cuda-memcheck&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;Do you by any chance see anything that could be causing undefined behavior in this kernel or do you have any recommendations for other tools/methods I could use to further narrow down where this problem comes from?&lt;/P&gt;
&lt;P&gt;Many thanks in advance!&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;__kernel void matmul( __global float const * const __restrict__ a_glb, __global float const * const __restrict__ b_glb, __global float * const __restrict__ _, __global float * const __restrict__ c_glb) {&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; __local float b_lcl[(4 * 1 * 1) * (2 * 2 * 2) * (1 * 1 * 2) * (2 * 1 * 1)];&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; const size_t wi_3 = get_local_id(0) / (2 * 4 * 1) % (2);&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; const size_t wi_5 = get_local_id(0) / (2) % (4);&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; const size_t wi_6 = get_local_id(0) % (2);&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; // init result memory&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; for (size_t lcl_6 = 0; lcl_6 &amp;lt; 2; ++lcl_6) {&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; for (size_t prv_6 = 0; prv_6 &amp;lt; 2; ++prv_6) {&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; c_glb[wi_3 * 32 + wi_5 * 8 + wi_6 * 4 + lcl_6 * 2 + prv_6] = 0.0f;&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; }}&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; // cache b_glb&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; #pragma unroll&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; for (size_t step = 0; step &amp;lt; (((4 * 1 * 1 ) * (2 * 2 * 2 ) * (1 * 1 * 2 ) * (2 * 1 * 1 )) / 1) / (1 * 1 * 2 * 1 * 4 * 2 * 1); ++step) {&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; b_lcl[step * (1 * 1 * 2 * 1 * 4 * 2 * 1) + get_local_id(0)] = &lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; b_glb[((0 * 1 * 4 * 1 * 1 + 0 * 4 * 1 * 1 ) + ((step * (1 * 1 * 2 * 1 * 4 * 2 * 1) + get_local_id(0)) / ((2 * 2 * 2 ) * (1 * 1 * 2 ) * ((2 * 1 * 1 ) / 1)) % ((4 * 1 * 1 )))) * (8) * (2) * ((2 / 1)) + ((0 * 1 * 2 * 2 * 2 + 0 * 2 * 2 * 2 ) + ((step * (1 * 1 * 2 * 1 * 4 * 2 * 1) + get_local_id(0)) / ((1 * 1 * 2 ) * ((2 * 1 * 1 ) / 1)) % ((2 * 2 * 2 )))) * (2) * ((2 / 1)) + ((0 * 1 * 1 * 1 * 2 + 0 * 1 * 1 * 2 ) + ((step * (1 * 1 * 2 * 1 * 4 * 2 * 1) + get_local_id(0)) / ((2 * 1 * 1 ) / 1) % ((1 * 1 * 2 )))) * ((2 / 1)) + (((0 * 1 * 2 * 1 * 1 + 0 * 2 * 1 * 1 ) / 1) + ((step * (1 * 1 * 2 * 1 * 4 * 2 * 1) + get_local_id(0)) % (((2 * 1 * 1 ) / 1)))) ];&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; }&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; barrier(CLK_LOCAL_MEM_FENCE);&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; // process tile&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; for (size_t lcl_6 = 0; lcl_6 &amp;lt; 2; ++lcl_6) {&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; for (size_t prv_7 = 0; prv_7 &amp;lt; 2; ++prv_7) {&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; for (size_t prv_6 = 0; prv_6 &amp;lt; 2; ++prv_6) {&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; c_glb[(wi_3) * (1) * (4) * (8) + (wi_5) * (8) + (wi_6 * 2 * 2 + lcl_6 * 2 + prv_6)] +=&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; a_glb[prv_7] * &lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; b_glb[(wi_5) * (2 * 2 * 2) * (1 * 1 * 2) * (2 * 1 * 1) + (wi_6 * 2 * 2 + lcl_6 * 2 + prv_6) * (1 * 1 * 2) * (2 * 1 * 1) + (prv_7) * (2 * 1 * 1) + (wi_3) ];&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; }}}&lt;BR /&gt;}&lt;/P&gt;</description>
      <pubDate>Thu, 30 Jun 2022 10:09:29 GMT</pubDate>
      <guid>https://community.intel.com/t5/OpenCL-for-CPU/Undefined-behavior-in-matrix-multiplication-kernel/m-p/1396717#M7063</guid>
      <dc:creator>Richard_S_7</dc:creator>
      <dc:date>2022-06-30T10:09:29Z</dc:date>
    </item>
  </channel>
</rss>

