I'm finding a very puzzling problem with our code, but it is quite difficult to explain.
Our code is parallelized with MPI, and until now we were compiling it always with -O3 optimization and we didn't find any issues with it. I have tried to compile it with -O3 -xAVX and then I start to see buggy behaviour, which manifests itself in negative numbers in a density matrix. These values appear in the domain layers where the parallelization was done (i.e. we have a domain covering a certain height in Z, and when run in parallel each rank gets a chunk of this domain, dividing only in Z, and we start seeing negative values in the edge layers in each rank, where communication happens to send/receive the neighbour values).
When compiled without -xAVX we never see any issue, but when compiled with -xAVX I always see the problem arising after just a few iterations, and the values are not always the same, so there is some race condition or some uninitialized values somewhere, but which only seem to kick in after compiling with -xAVX.
Actually, to make things a bit more weird, our code is composed of three parts: main program, core library and module library. I can compile everything with -xAVX and only compile without -xAVX two things:
- one of the functions in the core library and the linking stage for the core library.
As far as I do it like that, then the buggy behaviour doesn't show up. The funny thing is that the function above doesn't have any MPI code, and actually it is only run once at the beginning of the code, to store some data that it is used later on throughout the execution of the code.
I have compiled the whole code with gcc and -mavx and in that case I don't get any faulty behaviour, but I don't know internally what gcc or icc do when setting -xAVX.
Any idea/suggestions on how to go about something like this? It certainly looks like a parallelization problem, but that only shows up when -xAVX is used is weird to me.
- Intel® Advanced Vector Extensions (Intel® AVX)
- Intel® Streaming SIMD Extensions
- Parallel Computing
After a few test I found that the code, when compiled without -xAVX (with: -O3 -ipo -g) gives identical (bit-to-bit) results when run with a different number of processes (tried with 1 and 30) for a number of iterations).
When compiled with -xAVX (-O3 -ipo -g -xAVX) the differences appear even when run only with 1 processor after a few iterations, and from the very first iteration when run with different number of processes (tried with 1 and 30).
If my understanding of AVX is correct, this should not happen, right? I mean, AVX will just vectorize some loops as far as they don't have any dependences, but the result is deterministic, in the sence that, provided no random numbers are produced in the code, the result from run to run should be identical. This is what happens when compiled without -xAVX, but for some reason results start to be different between runs when the code has been compiled with -xAVX.
Is your code parallelized with OpenMP as well as MPI?
>>The funny thing is that the function above doesn't have any MPI code, and actually it is only run once at the beginning of the code, to store some data that it is used later on throughout the execution of the code.
This may indicate that the produced data is: a) incomplete, b) incorrect, c) code corrupts something beyond bounds of desired output, d) produced data is correct - returned data is incorrect of incomplete.
Run-to-run within a given build should be (are expected to be) identical, whereas, between builds are not.
I suggest adding some sanity check code (e.g. saving/checkpointing data prior to and following generated output for this function).
You may need to do this with C helper functions.
thanks for replying.
In the end I figured out where the problem was coming from. It turns out that our code (parallelized only with MPI) has some data (one z layer per process) that is duplicated in different MPI processes. After doing the first part of the code (where the values needed there) are properly propagated via MPI, then there is a part in which some other data is calculated point-by-point. When not using AVX the values needed to calculate these last data are identical in different processes in this duplicated layer, so the calculated values are identical and the program behaves fine.
With AVX, some of the required values (some Voigt profiles) to calculate these new data seem to be slightly different in each process, so then the resulting data in the duplicated layer (which when non using AVX were identical) is not identical anymore between processes, and the small changes increase when running the code for several iterations until the code blows up.
The solution was simply to not assume that the values in the duplicated layer are really going to be identical, and just copy them from one of the processes to the other. Problem solved.
>>The solution was simply to not assume that the values in the duplicated layer are really going to be identical, and just copy them from one of the processes to the other. Problem solved.
So you are choosing one of the supposedly duplicated values to use as the valid results? What would cause the results to not be duplicated? Note, if some of the different ranks has different CPU models, then potentially a different code path was taken, or a change in microcode made for same instruction. In particular if one/some CPU/s have FMA and other/s not, or one/some have AVX2 and other/s not, then you might not generate the same bit pattern results (equivalent within precision, yes).