Repeatability of results from vslsConvExecX1D

Eric_Backus · ‎01-13-2010

Hi,

We're using MKL 10.2.2 to speed up parts of a signal processing application that we have. One issue that we've run across is repeatability of results. We find that the results out of vslsConvExecX1D are not always the same, even when the input data is identical.

Given identical input data, the output of vslsConvExecX1D is always close to the same,so I'd guess thatthe differences are due to different rounding/truncation error accumulation. I think at least part of the reason for these differences is differing memory alignment of the input data. But this doesn't seem to be sufficient to explain all differences - sometimes the results from different calls to vslsConvExecX1D are different even if the input vectors are identical and start at the exact same memory location.

The fact that there are any differences at all is disconcerting. Is there anything we can do to work around this issue?

One thing we tried is using vslsConvSetInternalPrecision(VSL_CONV_PRECISION_DOUBLE). This does make the results completely repeatable, but it the execution speed is more than an order of magnitude slower. Significantly slower than our original simple C code that does convolution completely repeatably.

At the end of this message I've attached a test program that demonstrates the problem, in case it helps. Sorry it's not shorter. Ugh, pasting seems to haveswallowed the indentation. On my machine, output from this program looks like this:

Run 0
Results at offset 6 differ from offset 0
Results at offset 8 differ from offset 0
Results at offset 10 differ from offset 0
Results at offset 12 differ from offset 0
Results at offset 14 differ from offset 0
Results at offset 16 differ from offset 0
Run 1
Results at offset 1 differ from offset 0
Results at offset 3 differ from offset 0
Results at offset 5 differ from offset 0
Results at offset 7 differ from offset 0
Results at offset 9 differ from offset 0
Results at offset 11 differ from offset 0
Results at offset 13 differ from offset 0
Results at offset 15 differ from offset 0

Note that the first and second runs produce different results, showing that results can differ even when two calls to vslsConvExecX1Dare given identical data at identical memory addresses.

--
Eric Backus
eric_backus@agilent.com

#include // For std::printf, std::fprintf
#include // For std::rand, std::srand
#include "mkl.h"

static void
conv_mkl(float* in, unsigned long in_len,
float* coef, unsigned long coef_len,
float* out)
{
const int mode = VSL_CONV_MODE_AUTO;
const int xshape = static_cast(coef_len);
const int yshape = static_cast(in_len);
const int zshape = static_cast(in_len - coef_len + 1);
const int xstride = 1;
const int ystride = 1;
const int zstride = 1;
const float* x = coef;
const int start = static_cast(coef_len - 1);
const int decimation = 1;

int status;
VSLConvTaskPtr task;
status = vslsConvNewTaskX1D(&task, mode, xshape, yshape, zshape, x, xstride);
if (status != VSL_STATUS_OK)
(void) std::fprintf(stderr, "vslsConvNewTaskX1D returned %d\n", status);
status = vslConvSetStart(task, &start);
if (status != VSL_STATUS_OK)
(void) std::fprintf(stderr, "vslConvSetStart returned %d\n", status);
status = vslConvSetDecimation(task, &decimation);
if (status != VSL_STATUS_OK)
(void) std::fprintf(stderr, "vslConvSetDecimation returned %d\n", status);

status = vslsConvExecX1D(task, in, ystride, out, zstride);
if (status != VSL_STATUS_OK)
(void) std::fprintf(stderr, "vslConvExecX1D returned %d\n", status);

// Deleting the task alleviates the problems we have seen. But in
// our real application, we use (and need) multiple tasks. In the
// real application, we delete tasks whenever we know we're done
// with them, but we still see the repeatability issues.
//vslConvDeleteTask(&task);
}

static void
buffer_init(float* out, unsigned long len)
{
for (unsigned long i = 0; i < len; i++)
*out++ = std::rand() * 0.001f;
}

static void
buffer_copy(float* in, float* out, unsigned long len)
{
for (unsigned long i = 0; i < len; i++)
*out++ = *in++;
}

// Returns true for exact equality
static bool
buffer_equal(float* in1, float* in2, unsigned long len)
{
for (unsigned long i = 0; i < len; i++)
if (*in1++ != *in2++)
return false;
return true;
}

int
main(int argc, char **argv)
{
std::srand(12345); // Same sequence every time, for now

// Main input data buffer
const unsigned long in_len = 4096;
float in[in_len];
buffer_init(in, in_len);

// Small "coef" buffer
const unsigned long coef_len = 29;
float coef[coef_len];
buffer_init(coef, coef_len);

// Temp coef buffer, which we use at different offsets to check
// how the convolution code reacts to different data alignments.
const unsigned long align_len = 17;
float coef_tmp[coef_len + align_len];

// Output buffers
const unsigned long out_len = in_len + coef_len; // Longer than necessary
float ref[out_len];
float out[out_len];
buffer_init(out, out_len);

for (int k = 0; k < 2; k++)
{
(void) std::printf("Run %d\n", k);

for (unsigned long i = 0; i < align_len; i++)
{
// Copy coefs to next alignment
buffer_copy(coef, coef_tmp + i, coef_len);

conv_mkl(in, in_len, coef_tmp + i, coef_len, out);

if (i == 0)
buffer_copy(out, ref, out_len);
else
if (!buffer_equal(out, ref, out_len))
(void) std::printf("Results at offset %ld differ from offset 0\n", i);
}
}

return EXIT_SUCCESS;
}

Dmitry_B_Intel · ‎01-13-2010

Hi Eric,

Probable reason for observed differences is threaded execution.
MKL User's Guide in section "Aligning Data for Numerical Stability" states the following:

With a given Intel MKL version, the outputs will be bit-for-bit identical provided all the following conditions are met:

the outputs are obtained on the same platform
the inputs are bit-for-bit identical
the input arrays are aligned identically at 16-byte boundaries
Intel MKL is run in the sequential mode

Though the conditions are formulated as related to LAPACK and BLAS, they aregeneral.

Dima

Gennady_F_Intel · ‎01-13-2010

Eric,
1)looking at your code, as seems to me the comparing of floating data
for (unsigned long i = 0; i < len; i++)
if (*in1++ != *in2++)
is not completely correct. Will be better to compare with some eps
2) btw, the code works fine on my local system: mkl10.2 update3, winXP32,
CPU == Intel Core2 Duo CPU T7300 @ 2.00GHz,
2 threads.
--Gennady

Eric_Backus · ‎01-26-2010

Hi Gennady,

Thank you for your response, and thank you to Dmitry for his response as well.

Yes, I'm well aware of the pitfalls of directly comparing floating-point values. However, in this case, it is the correct thing to do, because the whole point of this program is to verify whether results are bit-for-bit identical when the alignment of the input array is changed.

When you say the "code works fine", do you mean that the program did not print any strings like "Results at offset N differ from offset 0"? What compiler did you use? Which specific MKL libraries did you link with?

In my case, I'm using MS Visual Studio 2008. I see the problem when I link with:

mkl_intel_c_dll.lib, mkl_intel_thread_dll.lib, mkl_core_dll.lib, libiomp5md.lib

I still see the problem when I replace mkl_intel_thread_dll.lib with mkl_sequential_dll.lib.

I'm using MKL 10.2.2.025 on WinXP SP3 (32-bit). CPU == Intel Core2 Duo CPU T9400 @ 2.53GHz.

At runtime, if mkl_vml_p4m2.dll is NOT available (so presumably mkl_vml_def.dll is used) then the problem goes away. From this I conclude that the issues with data alignment probably come from the use of SSE or something similar.

Thanks for any insight you can provide!

--

Eric Backus

eric_backus@agilent.com