Can you post your code using

Vyacheslav_A_ · ‎02-04-2014

A question from newbie.

I am trying to use a reduction algorithm from NVidia SDK. It works correctly on Nvidia Discrete GPU, Intel HD Graphics 4400, but don't work on Intel CPU (Haswell i5).

Reduction method source:

inline void WarpReductionToFirstElement(
    __local float *partialDotProduct)
{
#define WARP_SIZE 32

      // Thread local ID within a warp
      uint id = get_local_id(0) & (WARP_SIZE - 1); 

      // Each warp reduces 64 (default) consecutive elements
      float warpResult = 0.0f;
      if (get_local_id(0) < get_local_size(0)/2 )
      {
          volatile __local float* p = partialDotProduct + 2 * get_local_id(0) - id;
          p[0] += p[32];
          p[0] += p[16];
          p[0] += p[8];
          p[0] += p[4];
          p[0] += p[2];
          p[0] += p[1];
          warpResult = p[0];
      }

      // Synchronize to make sure each warp is done reading
      // partialDotProduct before it is overwritten in the next step
      barrier(CLK_LOCAL_MEM_FENCE);

      // The first thread of each warp stores the result of the reduction
      // at the beginning of partialDotProduct
      if (id == 0)
         partialDotProduct[get_local_id(0) / WARP_SIZE] = warpResult;

      // Synchronize to make sure each warp is done writing to
      // partialDotProduct before it is read in the next step
      barrier(CLK_LOCAL_MEM_FENCE);

      // Number of remaining elements after the first reduction
      uint size = get_local_size(0) / (2 * WARP_SIZE);

      // get_local_size(0) is less or equal to 512 on NVIDIA GPUs, so
      // only a single warp is needed for the following last reduction
      // step
      if (get_local_id(0) < size / 2)
      {
         volatile __local float* p = partialDotProduct + get_local_id(0);

         if (size >= 8)
            p[0] += p[4];
         if (size >= 4)
            p[0] += p[2];
         if (size >= 2)
            p[0] += p[1];
      }

}

Kernel source:

__kernel void TestReductionKernel(
    __global float * gdata,
    __local float * ldata)
{
    ldata[get_local_id(0)] = gdata[get_local_id(0)];

    barrier(CLK_LOCAL_MEM_FENCE);

    WarpReductionToFirstElement(ldata);

    barrier(CLK_LOCAL_MEM_FENCE);

    if(get_local_id(0) == 0)
    {
        gdata[0] = ldata[0];
    }

    barrier(CLK_LOCAL_MEM_FENCE);
}

Unit test (C# + OpenCL.NET):

        [TestMethod]
        public void IntelCPUWarpReductionTest()
        {
            var deviceChooser = new IntelCPUDeviceChooser();

            TestWarpReduction(deviceChooser);
        }

        private static void TestWarpReduction(
            IDeviceChooser deviceChooser)
        {
            if (deviceChooser == null)
            {
                throw new ArgumentNullException("deviceChooser");
            }

            var seed = DateTime.Now.Millisecond;
            var randomizer = new DefaultRandomizer(ref seed);

            for (var size = 2; size < 64; size++)
            {
                ConsoleAmbientContext.Console.WriteLine("size = {0}", size);

                using (var clProvider = new CLProvider(deviceChooser, true))
                {
                    var m = clProvider.CreateFloatMem(
                        size,
                        MemFlags.CopyHostPtr | MemFlags.ReadWrite);

                    for (var cc = 0; cc < m.Array.Length; cc++)
                    {
                        m.Array[cc] = randomizer.Next(256)/256f;
                    }

                    m.Write(BlockModeEnum.Blocking);

                    var cpuSum = m.Array.Sum();

                    var k = @"
__kernel void TestReductionKernel(
    __global float * gdata,
    __local float * ldata)
{
    ldata[get_local_id(0)] = gdata[get_local_id(0)];

    barrier(CLK_LOCAL_MEM_FENCE);

    WarpReductionToFirstElement(ldata);

    barrier(CLK_LOCAL_MEM_FENCE);


    if(get_local_id(0) == 0)
    {
        gdata[0] = ldata[0];
    }

    barrier(CLK_LOCAL_MEM_FENCE);
}
";

                    var kernel = clProvider.CreateKernel(k, "TestReductionKernel");

                    kernel
                        .SetKernelArgMem(0, m)
                        .SetKernelArgLocalMem(1, size)
                        .EnqueueNDRangeKernel(
                            new[] {size},
                            new int[] {size});

                    m.Read(BlockModeEnum.Blocking);

                    var gpuSum = m.Array[0];

                    Assert.AreEqual(cpuSum, gpuSum);
                }
            }
        }

So, unit test reach full success on Nvidia GPU and Intel HD Graphics but fails on Intel CPU with size = 4.

Log:

size = 2
Choosed vendor: INTEL(R) CORPORATION
size = 3
Choosed vendor: INTEL(R) CORPORATION
size = 4
Choosed vendor: INTEL(R) CORPORATION

Ошибка в Assert.AreEqual. Ожидается: <2,1875>. Фактически: <1,695313>.
в MyNN.Tests.MLP2.Reduction.WarpReductionFixture.TestWarpReduction(IDeviceChooser deviceChooser) в WarpReductionFixture.cs: line 113
в MyNN.Tests.MLP2.Reduction.WarpReductionFixture.IntelCPUWarpReductionTest() в WarpReductionFixture.cs: line 38

The difference in behavior between Intel CPU and Intel GPU on SAME machine looks strange for me.

What is the reason to this problem? How can I fix the problem?

ps. if it needs I can attach a VS 2013 sample project.

Raghupathi_M_Intel · ‎02-05-2014

Can you post your code using the <code> tags? Also what is the graphics driver version? If you can attach the whole project that'll be even better.

Thanks,
Raghu

Vyacheslav_A_ · ‎02-05-2014

Hi Raghu.

Driver version is 3.0.1.10878. Today I've updated video driver and Intel OpenCL SDK, but nothing changes. VS

2012 Solution and screenshot with tests results has attached. File WarpReduction.cl contains NVidia Reduction algorithm.

File WarpReductionFixture.cs contains unit tests and test kernel.

My laptop is Lenovo u430p with Intel Haswell i5 + Nvidia GeForce 730m.

Thanks!

Reduction on Intel CPU