Hi,
Ifound a problem with MKL_DCSRSYMV. When I ran the following code
Ifound a problem with MKL_DCSRSYMV. When I ran the following code
[cpp] PROGRAM TEST INTEGER, PARAMETER :: N = 1000000 DOUBLE PRECISION X(N) DO WHILE (.TRUE.) CALL MKL_DCSRSYMV('U', N, SPREAD(1.D0, 1, N), (/1 : N + 1/),& (/1 : N/), SPREAD(1.D0, 1, N), X) END DO END PROGRAM [/cpp]compiled with "ifort test.f90 -otest -mkl=parallel" (ifort pro 11.1.038 with the included mkl), memory consumption, as seen in'top',would keep rising until it drained all physical memory and I killed the process. I tried it on a four-socket Opteron 8350 and a dual-socket Xeon 5530. Memory usage blew up on both machines. Any cure for this?
連結已複製
9 回應
Quoting - styc
Hi,
Ifound a problem with MKL_DCSRSYMV. When I ran the following code
Ifound a problem with MKL_DCSRSYMV. When I ran the following code
[cpp] PROGRAM TESTcompiled with "ifort test.f90 -otest -mkl=parallel" (ifort pro 11.1.038 with the included mkl), memory consumption, as seen in'top',would keep rising until it drained all physical memory and I killed the process. I tried it on a four-socket Opteron 8350 and a dual-socket Xeon 5530. Memory usage blew up on both machines. Any cure for this?
INTEGER, PARAMETER :: N = 1000000
DOUBLE PRECISION X(N)
DO WHILE (.TRUE.)
CALL MKL_DCSRSYMV('U', N, SPREAD(1.D0, 1, N), (/1 : N + 1/),&
(/1 : N/), SPREAD(1.D0, 1, N), X)
END DO
END PROGRAM
[/cpp]
one comment on coding. The way you call the routine forces compiler to create 3 temporary arrays (compiler has to make a copy of these arguments before passing them), an obvious performance degradation. Eliminating them? --> don't pass non-contiguous arrays to routines that don't accept arrays by descriptor.
A.
Quoting - ArturGuzik
Hi,
one comment on coding. The way you call the routine forces compiler to create 3 temporary arrays (compiler has to make a copy of these arguments before passing them), an obvious performance degradation. Eliminating them? --> don't pass non-contiguous arrays to routines that don't accept arrays by descriptor.
A.
one comment on coding. The way you call the routine forces compiler to create 3 temporary arrays (compiler has to make a copy of these arguments before passing them), an obvious performance degradation. Eliminating them? --> don't pass non-contiguous arrays to routines that don't accept arrays by descriptor.
A.
You see, this is just a test program. For a test program's sake performance is nonessential. The real problem is within a few hundred iterations the program consumes more than 10 GB worth of memory, an apparent nightmare when I use the routine inaKrylov subspacesolver because I only have 12 GB of memory on my machine.
Quoting - styc
You see, this is just a test program. For a test program's sake performance is nonessential. The real problem is within a few hundred iterations the program consumes more than 10 GB worth of memory, an apparent nightmare when I use the routine inaKrylov subspacesolver because I only have 12 GB of memory on my machine.
I guess that you're on Linux. I have no time to test it there, but on my Winx64 it uses at max 90 MB (I waited until 2,500 iterations passed) and I don't see any leak.
Did you try to replace that spread commands?
A.
Quoting - ArturGuzik
I know. That was just a comment.
I guess that you're on Linux. I have no time to test it there, but on my Winx64 it uses at max 90 MB (I waited until 2,500 iterations passed) and I don't see any leak.
Did you try to replace that spread commands?
A.
I guess that you're on Linux. I have no time to test it there, but on my Winx64 it uses at max 90 MB (I waited until 2,500 iterations passed) and I don't see any leak.
Did you try to replace that spread commands?
A.
Actually I discovered the problem when calling MKL_DCSRSYMV from C. I also tried local and allocatable arrays. The same massive leaks.
Quoting - styc
Actually I discovered the problem when calling MKL_DCSRSYMV from C. I also tried local and allocatable arrays. The same massive leaks.
Hi Styc,
If use the sequential library, what is the result?
The command line is like
ifort test.f -o test-lmkl_intel_lp64 -Wl,--start-group -lmkl_sequential -lmkl_core -Wl,--end-group-lpthread
orwould you like to upgradethe compilerto latest Compiler version, which use latest MKL 10.2.1 version?
I just try Intel Compiler 11.1.046 on 4 core Xeon machine. The test program seems run fine. No memory leak.
<http://software.intel.com/en-us/articles/which-version-of-ipp--mkl--tbb-is-installed-with-intel-compiler-professional-edition/>
Best Regards,
Ying
or
ifort test.f -o test -lmkl_intel_lp64 -Wl,--start-group -lmkl_intel_thread -lmkl_core -Wl,--end-group -liomp5 -lpthread
Quoting - Ying Hu (Intel)
Hi Styc,
If use the sequential library, what is the result?
The command line is like
ifort test.f -o test-lmkl_intel_lp64 -Wl,--start-group -lmkl_sequential -lmkl_core -Wl,--end-group-lpthread
orwould you like to upgradethe compilerto latest Compiler version, which use latest MKL 10.2.1 version?
I just try Intel Compiler 11.1.046 on 4 core Xeon machine. The test program seems run fine. No memory leak.
<http://software.intel.com/en-us/articles/which-version-of-ipp--mkl--tbb-is-installed-with-intel-compiler-professional-edition/>
Best Regards,
Ying
or
ifort test.f -o test -lmkl_intel_lp64 -Wl,--start-group -lmkl_intel_thread -lmkl_core -Wl,--end-group -liomp5 -lpthread
Well, I'm pretty reluctant to do the upgrade now because1) it's tricky 2) I don't have the time.
I tried several possible solutions I could think of and found four ways to makethe test programwork normally:
1) linking with -mkl=sequential
2) OMP_NUM_THREADS=1/2/3/4/5 (see, you need more than four threads to see it break)
3) MKL_DISABLE_FAST_MM=1
4) setting N <= 5242880 / OMP_NUM_THREADS
It seems that 1), 2) and 4) actually address the same problem-limiting the amount of memory MKL_DCSRSYMV requires, i.e. sizeof(double) * N * OMP_NUM_THREADS,tono more than40 MB. Apparently MKL_DCSRSYMV will repeatedly allocate workspaces larger than 40 MB butwon'tbother to deallocate them unless explicitly instructed to by something like 3).
Quoting - styc
Well, I'm pretty reluctant to do the upgrade now because1) it's tricky 2) I don't have the time.
I tried several possible solutions I could think of and found four ways to makethe test programwork normally:
1) linking with -mkl=sequential
2) OMP_NUM_THREADS=1/2/3/4/5 (see, you need more than four threads to see it break)
3) MKL_DISABLE_FAST_MM=1
4) setting N <= 5242880 / OMP_NUM_THREADS
It seems that 1), 2) and 4) actually address the same problem-limiting the amount of memory MKL_DCSRSYMV requires, i.e. sizeof(double) * N * OMP_NUM_THREADS,tono more than40 MB. Apparently MKL_DCSRSYMV will repeatedly allocate workspaces larger than 40 MB butwon'tbother to deallocate them unless explicitly instructed to by something like 3).
Hi Styc,
Good news, I'm able to reproduce the problem with MKL 10.2 and MKL 10.2.1. The problem happenedonly when the size of allocated arrays is huge. (if problem size is small, for example,N=1000, no such problem, right?).
The root cause is the defect in MKL memory manager. I have escaled toMKL engineer team to fix it.
At present, the best solution to avoid this problem is to set MKL_DISABLE_FAST_MM=1 asyou described.
What is your general problem size, N = 1000000?
Best Regards,
Ying
Quoting - Ying Hu (Intel)
Hi Styc,
Good news, I'm able to reproduce the problem with MKL 10.2 and MKL 10.2.1. The problem happenedonly when the size of allocated arrays is huge. (if problem size is small, for example,N=1000, no such problem, right?).
The root cause is the defect in MKL memory manager. I have escaled toMKL engineer team to fix it.
At present, the best solution to avoid this problem is to set MKL_DISABLE_FAST_MM=1 asyou described.
What is your general problem size, N = 1000000?
Best Regards,
Ying
Yes, typically around one million.
Quoting - styc
Yes, typically around one million.
Hi Styc,
Thanks for letting me know.the reference number is DPD200084696, we will notify you whenthe fix version is release (may be around Oct.).
Thanks
Ying
