barrier(CLK_LOCAL_MEM_FENCE) weird deadlock in kernel's local array when reduction


The code run well by clCreateProgramWithSource, BUT dealock by clCreateProgramWithBinary. The size of work group is (8,8), I reproduce the deadlock on Intel Graphics win8.1+ HD4600/530 and win10+HD630.

local int SumAlpha[16*16/4];
int index = 1*get_local_id(0)+ 16/2*get_local_id(1);
SumAlpha[index] = sum_alpha;

// do reduction in shared mem
for(int i = (16*16/8); i > 0; i >>= 1) 
    if(index < i) 
        SumAlpha[index] += SumAlpha[index + i]; 
    barrier(CLK_LOCAL_MEM_FENCE); // unroll loop show deadlock when i==8

the cmd which create spir is 

"..\vendors\OpenCL\Intel\bin\ioc32.exe -cmd=build -input="xxx" -device=gpu -spir32="FileName%" -bo="-cl-std=CL1.2 -I %ProjetcDir%\cl"

And I tried '-cl-opt-disable', same deadlock. I tried below equivalence, also deadlock.

a = SumAlpha[index];
b = SumAlpha[index + i];
SumAlpha[index] = a+ b;

I can use instruction dot to avoid reduction, while I want to know more. Did I miss something? Or is there any way to dump out the gen assembly of both way to compare their difference? Greatly Thanks.

