Re: Memory leak in MKL_DCSRSYMV?

styc · ‎08-15-2009

Hi,

Ifound a problem with MKL_DCSRSYMV. When I ran the following code

[cpp]        PROGRAM TEST

        INTEGER, PARAMETER :: N = 1000000
        DOUBLE PRECISION X(N)

        DO WHILE (.TRUE.)
            CALL MKL_DCSRSYMV('U', N, SPREAD(1.D0, 1, N), (/1 : N + 1/),&
                    (/1 : N/), SPREAD(1.D0, 1, N), X)
        END DO

        END PROGRAM
[/cpp]

compiled with "ifort test.f90 -otest -mkl=parallel" (ifort pro 11.1.038 with the included mkl), memory consumption, as seen in'top',would keep rising until it drained all physical memory and I killed the process. I tried it on a four-socket Opteron 8350 and a dual-socket Xeon 5530. Memory usage blew up on both machines. Any cure for this?

ArturGuzik · ‎08-16-2009

Quoting - styc

Hi,

Ifound a problem with MKL_DCSRSYMV. When I ran the following code

[cpp]        PROGRAM TEST

        INTEGER, PARAMETER :: N = 1000000
        DOUBLE PRECISION X(N)

        DO WHILE (.TRUE.)
            CALL MKL_DCSRSYMV('U', N, SPREAD(1.D0, 1, N), (/1 : N + 1/),&
                    (/1 : N/), SPREAD(1.D0, 1, N), X)
        END DO

        END PROGRAM
[/cpp]

compiled with "ifort test.f90 -otest -mkl=parallel" (ifort pro 11.1.038 with the included mkl), memory consumption, as seen in'top',would keep rising until it drained all physical memory and I killed the process. I tried it on a four-socket Opteron 8350 and a dual-socket Xeon 5530. Memory usage blew up on both machines. Any cure for this?

Hi,

one comment on coding. The way you call the routine forces compiler to create 3 temporary arrays (compiler has to make a copy of these arguments before passing them), an obvious performance degradation. Eliminating them? --> don't pass non-contiguous arrays to routines that don't accept arrays by descriptor.

A.

styc · ‎08-16-2009

Quoting - ArturGuzik

Hi,

one comment on coding. The way you call the routine forces compiler to create 3 temporary arrays (compiler has to make a copy of these arguments before passing them), an obvious performance degradation. Eliminating them? --> don't pass non-contiguous arrays to routines that don't accept arrays by descriptor.

A.

You see, this is just a test program. For a test program's sake performance is nonessential. The real problem is within a few hundred iterations the program consumes more than 10 GB worth of memory, an apparent nightmare when I use the routine inaKrylov subspacesolver because I only have 12 GB of memory on my machine.

ArturGuzik · ‎08-16-2009

Quoting - styc

You see, this is just a test program. For a test program's sake performance is nonessential. The real problem is within a few hundred iterations the program consumes more than 10 GB worth of memory, an apparent nightmare when I use the routine inaKrylov subspacesolver because I only have 12 GB of memory on my machine.

I know. That was just a comment.

I guess that you're on Linux. I have no time to test it there, but on my Winx64 it uses at max 90 MB (I waited until 2,500 iterations passed) and I don't see any leak.

Did you try to replace that spread commands?

A.

styc · ‎08-16-2009

Quoting - ArturGuzik

I know. That was just a comment.

I guess that you're on Linux. I have no time to test it there, but on my Winx64 it uses at max 90 MB (I waited until 2,500 iterations passed) and I don't see any leak.

Did you try to replace that spread commands?

A.

Actually I discovered the problem when calling MKL_DCSRSYMV from C. I also tried local and allocatable arrays. The same massive leaks.

Ying_H_Intel · ‎08-17-2009

Quoting - styc

Actually I discovered the problem when calling MKL_DCSRSYMV from C. I also tried local and allocatable arrays. The same massive leaks.

Hi Styc,

If use the sequential library, what is the result?
The command line is like
ifort test.f -o test-lmkl_intel_lp64 -Wl,--start-group -lmkl_sequential -lmkl_core -Wl,--end-group-lpthread

orwould you like to upgradethe compilerto latest Compiler version, which use latest MKL 10.2.1 version?
I just try Intel Compiler 11.1.046 on 4 core Xeon machine. The test program seems run fine. No memory leak.

<http://software.intel.com/en-us/articles/which-version-of-ipp--mkl--tbb-is-installed-with-intel-compiler-professional-edition/>

Best Regards,
Ying

or

ifort test.f -o test -lmkl_intel_lp64 -Wl,--start-group -lmkl_intel_thread -lmkl_core -Wl,--end-group -liomp5 -lpthread

styc · ‎08-17-2009

Quoting - Ying Hu (Intel)

Hi Styc,

If use the sequential library, what is the result?
The command line is like
ifort test.f -o test-lmkl_intel_lp64 -Wl,--start-group -lmkl_sequential -lmkl_core -Wl,--end-group-lpthread

orwould you like to upgradethe compilerto latest Compiler version, which use latest MKL 10.2.1 version?
I just try Intel Compiler 11.1.046 on 4 core Xeon machine. The test program seems run fine. No memory leak.

<http://software.intel.com/en-us/articles/which-version-of-ipp--mkl--tbb-is-installed-with-intel-compiler-professional-edition/>

Best Regards,
Ying

or

ifort test.f -o test -lmkl_intel_lp64 -Wl,--start-group -lmkl_intel_thread -lmkl_core -Wl,--end-group -liomp5 -lpthread

Well, I'm pretty reluctant to do the upgrade now because1) it's tricky 2) I don't have the time.

I tried several possible solutions I could think of and found four ways to makethe test programwork normally:

1) linking with -mkl=sequential
2) OMP_NUM_THREADS=1/2/3/4/5 (see, you need more than four threads to see it break)
3) MKL_DISABLE_FAST_MM=1
4) setting N <= 5242880 / OMP_NUM_THREADS

It seems that 1), 2) and 4) actually address the same problem-limiting the amount of memory MKL_DCSRSYMV requires, i.e. sizeof(double) * N * OMP_NUM_THREADS,tono more than40 MB. Apparently MKL_DCSRSYMV will repeatedly allocate workspaces larger than 40 MB butwon'tbother to deallocate them unless explicitly instructed to by something like 3).

Ying_H_Intel · ‎08-18-2009

Quoting - styc

Well, I'm pretty reluctant to do the upgrade now because1) it's tricky 2) I don't have the time.

I tried several possible solutions I could think of and found four ways to makethe test programwork normally:

1) linking with -mkl=sequential
2) OMP_NUM_THREADS=1/2/3/4/5 (see, you need more than four threads to see it break)
3) MKL_DISABLE_FAST_MM=1
4) setting N <= 5242880 / OMP_NUM_THREADS

It seems that 1), 2) and 4) actually address the same problem-limiting the amount of memory MKL_DCSRSYMV requires, i.e. sizeof(double) * N * OMP_NUM_THREADS,tono more than40 MB. Apparently MKL_DCSRSYMV will repeatedly allocate workspaces larger than 40 MB butwon'tbother to deallocate them unless explicitly instructed to by something like 3).

Hi Styc,

Good news, I'm able to reproduce the problem with MKL 10.2 and MKL 10.2.1. The problem happenedonly when the size of allocated arrays is huge. (if problem size is small, for example,N=1000, no such problem, right?).
The root cause is the defect in MKL memory manager. I have escaled toMKL engineer team to fix it.

At present, the best solution to avoid this problem is to set MKL_DISABLE_FAST_MM=1 asyou described.
What is your general problem size, N = 1000000?

Best Regards,
Ying

styc · ‎08-20-2009

Quoting - Ying Hu (Intel)

Hi Styc,

Good news, I'm able to reproduce the problem with MKL 10.2 and MKL 10.2.1. The problem happenedonly when the size of allocated arrays is huge. (if problem size is small, for example,N=1000, no such problem, right?).
The root cause is the defect in MKL memory manager. I have escaled toMKL engineer team to fix it.

At present, the best solution to avoid this problem is to set MKL_DISABLE_FAST_MM=1 asyou described.
What is your general problem size, N = 1000000?

Best Regards,
Ying

Yes, typically around one million.

Ying_H_Intel · ‎08-25-2009

Quoting - styc

Yes, typically around one million.

Hi Styc,

Thanks for letting me know.the reference number is DPD200084696, we will notify you whenthe fix version is release (may be around Oct.).

Thanks
Ying