Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.

Data corruption when threading VSL

bellman
Beginner
514 Views
I've just parallelized a fortran routine that simulates individuals behavior and I've had some problems when generating random numbers with Vector Statistical Library. The structure of the program is the following:

program example
...
!$omp parallel do num_threads(proc) default(none) private(private variables) shared(shared variables)
do i=1,n
call firstroutine(...)
enddo
!$omp end parallel do
...
end program example

subroutine firstroutine
...
call secondroutine(...)
...
end subroutine

subroutine secondroutine
...
VSL calls
...
end subroutine

I use the Intel Fortran Compiler for the compilation with a makefile that looks as follows:

f90comp = ifort
libdir = /home
mklpath = /opt/intel/mkl/10.0.5.025/lib/32/
mklinclude = /opt/intel/mkl/10.0.5.025/include/
exec: Example.o Firstroutine.o Secondroutine.o
$(f90comp) -O3 -fpscomp logicals -openmp -o aaa -L$(mklpath) -I$(mklinclude) Example.o -lmkl_ia32 -lguide -lpthread
Example.o: $(libdir)Example.f90
$(f90comp) -O3 -fpscomp logicals -openmp -c $(libdir)Example.f90
Firstroutine.o: $(libdir)Firstroutine.f90
$(f90comp) -O3 -fpscomp logicals -openmp -c $(libdir)Firstroutine.f90
Secondroutine.o: $(libdir)Secondroutine.f90
$(f90comp) -O3 -fpscomp logicals -openmp -c -L$(mklpath) -I$(mklinclude) $(libdir)Secondroutine.f90 -lmkl_ia32 -lguide -lpthread

At compilation time everything works fine. When I run my program generating variables with it, everything seems to work fine. However, from time to time (say once each 200-500 iterations), it generates crazy numbers for a couple of iterations and then runs again in a normal way. I have not found any patern to when does this corruption happen.

Any idea on why is it happening?
0 Kudos
11 Replies
Andrey_N_Intel
Employee
514 Views

Can you please provide more details on this issue: whichbasic random number generator and distribution generator you use,how you initialize random stream, which parallelizationtechniques (skip-ahead, leap-frog, or multi-stream BRNGs) available in VSL you use (if any), how you generate random numbers, etc?Providinga short test case which demonstrates the data corruption would be ideal and will help toisolate origin of the issue. Thanks, Andrey
0 Kudos
bellman
Beginner
514 Views

Can you please provide more details on this issue: whichbasic random number generator and distribution generator you use,how you initialize random stream, which parallelizationtechniques (skip-ahead, leap-frog, or multi-stream BRNGs) available in VSL you use (if any), how you generate random numbers, etc?Providinga short test case which demonstrates the data corruption would be ideal and will help toisolate origin of the issue. Thanks, Andrey

Thanks for your reply. Here it goes:

seed=777
brng=VSL_BRNG_MT2203
errcode=vslnewstream( stream, brng, seed )

method=VSL_METHOD_SUNIFORM_STD
errcode=vsrnguniform( method, stream, n, r, 0., 1. )

method=VSL_METHOD_DGAUSSIAN_ICDF
errcode=vdrnggaussian( method, stream, n, r, 0., 1. )



I think that I'm not using any parallelization technique of those available in VSL. It should be noted that I want to generate exactly the same sequence of random numbers in each thread. That's why I initialize the stream with the same seed in each thread.

Do you need more info? Let me know

Thanks again for your help
0 Kudos
Andrey_N_Intel
Employee
514 Views

Thanks for the additional info.

Is this piece of VSL code located in subroutine "secondroutine"?
What is data type of array r? Youcalltwo VSL routines, vsrnguniform which is single precisionversion of Uniform generatorand vdrnggaussian which is double precision version of Gaussian generator. In both cases the same array r appears to be used to storeoutput of the generators. Can you please check this and let me know?

Also, your codeseems to produce the same sequences of random numbers in parallel. Is this expected behavior of the application or parallel pieces of the code shouldprocess different sequences of random numbers?

Anyway, if you could prepare and send a short test case which reproduces the data corruption it would be easier to detect an issue if any.

Thanks, Andrey
0 Kudos
bellman
Beginner
514 Views

Thanks for the additional info.

Is this piece of VSL code located in subroutine "secondroutine"?
What is data type of array r? Youcalltwo VSL routines, vsrnguniform which is single precisionversion of Uniform generatorand vdrnggaussian which is double precision version of Gaussian generator. In both cases the same array r appears to be used to storeoutput of the generators. Can you please check this and let me know?

Also, your codeseems to produce the same sequences of random numbers in parallel. Is this expected behavior of the application or parallel pieces of the code shouldprocess different sequences of random numbers?

Anyway, if you could prepare and send a short test case which reproduces the data corruption it would be easier to detect an issue if any.

Thanks, Andrey


Yes, the code is located in the secondroutine.

Regarding r, this was a mistake when I "generalized" my code. In fact, you can think as a variable rs (single precision) for the uniform call and another one rd (double precision) for the gaussian call.

As I edited in my previous reply (I'm sorry for that), what I need is to obtain exactly the same sequence of random numbers in each iteration.

Unfortunately, my original code is copyrighted and too long to be posted here. I'm working on a mickey mouse example but by now I can't reproduce the error. Since the data corruption only happens from time to time, it is difficult to know whether it hasn't happened in the runs I've made or there is something missing in the example to reproduce the error. I'll keep working on that, and post it when I reproduce the error.

Finally, I have a question that may help to solve the issue: the main program and the two subroutines are located in different f90 source files. I compile them using the makefile I described in the original post. Should I compile all source codes with the -L$(mklpath) -I$(mklinclude) and -lmkl_ia32 -lguide -lpthread options?

Many thanks!
0 Kudos
Andrey_N_Intel
Employee
514 Views
If you compile source files of your application into object files you would only need to set path to MKL header files -I$(mklinclude) in case your routines use MKL functions. On the final stage of build of the executable from the object files it is also necessary to set path to MKL libs -L$(mklpath) and define MKL libraries your application should be linked against.

As MKL has the layered structure I suggest to slightly modify your makefile and change MKL libs your application is linked against.
Please, replace -lmkl_ia32 -lguide -lpthread with -lmkl_intel -lmkl_intel_thread -lmkl_core -liomp5 -lpthread when building your executable from the object files. Detailed information about layered model of MKL, different linking schemes / examples are available in User's Guide (in particular, Section 5),
http://cache-www.intel.com/cd/00/00/34/74/347460_347460.pdf.

Please, let me know how it works for you.
Also, please let me know if the data corruption is still observed. In case the error situation is reproducible, feel free to report about it to Premier Support.

Thanks,
Andrey
0 Kudos
bellman
Beginner
514 Views
If you compile source files of your application into object files you would only need to set path to MKL header files -I$(mklinclude) in case your routines use MKL functions. On the final stage of build of the executable from the object files it is also necessary to set path to MKL libs -L$(mklpath) and define MKL libraries your application should be linked against.

As MKL has the layered structure I suggest to slightly modify your makefile and change MKL libs your application is linked against.
Please, replace -lmkl_ia32 -lguide -lpthread with -lmkl_intel -lmkl_intel_thread -lmkl_core -liomp5 -lpthread when building your executable from the object files. Detailed information about layered model of MKL, different linking schemes / examples are available in User's Guide (in particular, Section 5),
http://cache-www.intel.com/cd/00/00/34/74/347460_347460.pdf.

Please, let me know how it works for you.
Also, please let me know if the data corruption is still observed. In case the error situation is reproducible, feel free to report about it to Premier Support.

Thanks,
Andrey


If I do this, I get the following error:

./aaa: error while loading shared libraries: libmkl_intel.so: cannot open shared object file: No such file or directory

However, the file does exist in the specified directory!!!
0 Kudos
Andrey_N_Intel
Employee
514 Views

Before running your application, please, set environmental variable LD_LIBRARY_PATH to MKL sub-directory which contains necessary libs.

Thanks,
Andrey
0 Kudos
bellman
Beginner
514 Views

Before running your application, please, set environmental variable LD_LIBRARY_PATH to MKL sub-directory which contains necessary libs.

Thanks,
Andrey


Ok! Now it runs. I'd never done it. Could it be that this is the reason why it did not worked?

In order to check whether it still have the data corruption or not, I have to run the program and wait.... Since it iteration takes some time, I should probably run the program for a few days before I see evidence of corruption or rule it out definitely. I'll let you know whether it appears again.

Thanks a lot!!!
0 Kudos
Andrey_N_Intel
Employee
514 Views

Glad to know that your application works. Yes, before coming to any conclusions it makes sense to test/run the program. I'm waiting for yourtest results. Thanks, Andrey
0 Kudos
bellman
Beginner
514 Views

Glad to know that your application works. Yes, before coming to any conclusions it makes sense to test/run the program. I'm waiting for yourtest results. Thanks, Andrey

Bad news... data is still corrupetd :-(
0 Kudos
Gennady_F_Intel
Moderator
514 Views

At the first glance all are Ok with your code and probably this is really issue with MKL.
I would recommend you submit the issue against MKL to Premier support( https://premier.intel.com/ )
and our experts will work with you there.
--Gennady

0 Kudos
Reply