Windows 8.1 64-bit, Intel HD 4600 (both latest release and beta drivers), the following snippet from an OpenCL kernel produces an incorrect result: if (k_delta == 38443432 && jj==4620) printf((__constant char *)"cl_barrett32_87_gs: jj=%x kdelta=%x mulhi=%x\n", (uint)jj, (uint)k_delta, (uint)mul_hi((uint)jj,(uint)k_delta));
the output is:
cl_barrett32_87_gs: jj=120c kdelta=24a99a8 mulhi=0
It is my understanding that mul_hi should not produce a zero result here.
I also have a (likely) related multiplication bug:
facdist = (ulong) (2 * NUM_CLASSES) * (ulong) exponent;
fails with the upper 32-bits being zero where NUM_CLASSES is a #define for 4620 and exponent is a value in the 50 million area.
OK, now it gets weird. If I add one line, that really does nothing, then the code snippet works (mul_hi returns 0x29). That line is:
jj = jj % k_delta;
Update: This code actually does something. In the original code snippet a smart optimizing compiler can determine that jj is a constant 4620. Adding the line above forces the compiler to place the jj value in a register or memory.
I failed at creating a tiny reproducible case so I removed a ton of extraneous code from my program and zipped it up for you. The zip includes all the source, MSVC make files, and a prebuilt executable.
The buggy code is in src/gpusieve.cl function CalcModularInverses above the "if (prime == 13)" printfs. It reproduces both the mul_hi and ulong multiplication bug. It also includes the correct result when the constant 4620 is assigned to a variable.
Let me know if you need more.