Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Beginner
52 Views

Memory continous increase When I run ccsm with intel mpi,why?

I run ccsm whith intel mpi ,memory is small when starting,It is about 1G.
But,Memory continous increase.it is about 1 hours later,Memory is 24G,my system is hung,I must restart system.
I used:
l_mpi_pu_4.0.0.027
l_cproc_p_11.1.069
l_cprof_p_11.1.069
l_mkl_p_10.2.4.032.tar
And I use Qlogic infiniband switch and HCA apater
mpirun -nolocal -machinefile mpd.hosts -genv I_MPI_FABRICS tmi \\
-np $NTASKS[1] $EXEROOT/all/$COMPONENTS[1] : \\
-np $NTASKS[2] $EXEROOT/all/$COMPONENTS[2] : \\
-np $NTASKS[3] $EXEROOT/all/$COMPONENTS[3] : \\
-np $NTASKS[4] $EXEROOT/all/$COMPONENTS[4] : \\
-np $NTASKS[5] $EXEROOT/all/$COMPONENTS[5]
0 Kudos
10 Replies
Highlighted
Beginner
52 Views

Myintegrate scripts:

#!/bin/csh -f
#===============================================================================
# This is a CCSM batch job script for latecomer
#===============================================================================
## BATCH INFO
#$ -S /bin/csh -cwd
#$ -N ccsm_latecomer
#$ -pe climate 208
#$ -q climateq
#-----------------------------------------------------------------------
# Determine necessary environment variables
#-----------------------------------------------------------------------
cd /lustrefs/soa02/home/ccsm/model/ccsm3/roytest/test1
setenv MACH latecomer
source env_conf || "problem sourcing env_conf" && exit -1
source env_run || "problem sourcing env_run" && exit -1
source env_mach.latecomer || "problem sourcing env_mach.latecomer" && exit -1
## Warning: SCRATCH not defined in system environment. Set SCRATCH to be /lustrefs/soa02/home/ccsm/model/ccsm3/
#-----------------------------------------------------------------------
# Resolved task/thread counts
# This is provider as user information only
# These are csh comments, DO NOT UNCOMMENT
#-----------------------------------------------------------------------
### COMPONENTS = ( cpl csim clm pop cam )
### NTASKS_CPL=16 NTHRDS_CPL=1
### NTASKS_ICE=16 NTHRDS_ICE=1
### NTASKS_LND=16 NTHRDS_LND=1
### NTASKS_OCN=96 NTHRDS_OCN=1
### NTASKS_ATM=64 NTHRDS_ATM=1
#-----------------------------------------------------------------------
# Determine time-stamp/file-ID string
#-----------------------------------------------------------------------
setenv LID "`date +%y%m%d-%H%M%S`"
# -------------------------------------------------------------------------
# Run machine dependent module commands
# -------------------------------------------------------------------------
if (-f modules.$MACH) then
echo sourcing modules.$MACH
source modules.$MACH || exit 1
endif
# -------------------------------------------------------------------------
# Build the models
# -------------------------------------------------------------------------
./${CASE}.${MACH}.build || exit 1
# -------------------------------------------------------------------------
# Create processor count input files
# -------------------------------------------------------------------------
cp mpd.hosts $EXEROOT/all
cd $EXEROOT/all
@ PROC = 0 # counts total number of tasks
echo "0" >! mpirun.pgfile1;
foreach n (1 2 3 4 5)
set comp = $COMPONENTS[$n]
set model = $MODELS[$n]
set nthrd = $NTHRDS[$n]
set ntask = $NTASKS[$n]
@ M = 0
while ( $M < $ntask )
# @ M++
# @ PROC++
if (($n == 1) && ($M == 0)) then
echo "skipping first model"
else
echo "1 $EXEROOT/all/$comp" >>! mpirun.pgfile1;
endif
@ M++
@ PROC++
end
ln -s $EXEROOT/$model/$comp $EXEROOT/all/. # link binaries into all dir
end
# -------------------------------------------------------------------------
# Run the model
# -------------------------------------------------------------------------
env | egrep '(MP_|LOADL|XLS|FPE|DSM|OMP|MPC)' # document env vars
cd $EXEROOT/all
echo "`date` -- CSM EXECUTION BEGINS HERE"
#mpdboot -n 50 -r rsh -f /lustrefs/soa02/home/ccsm/model/ccsm3/roytest/test1/mpd.hosts
#mpiexec -nolocal -genv I_MPI_FABRICS tmi -genv I_MPI_DEBUG 5 \
mpirun -nolocal -machinefile mpd.hosts -genv I_MPI_FABRICS tmi -genv TMI_DEBUG 1 \
#mpirun -machinefile mpd.hosts -genv -I_MPI_TMI_PROVIDER psm \
#mpirun -machinefile mpd.hosts \
-np $NTASKS[1] $EXEROOT/all/$COMPONENTS[1] : \
-np $NTASKS[2] $EXEROOT/all/$COMPONENTS[2] : \
-np $NTASKS[3] $EXEROOT/all/$COMPONENTS[3] : \
-np $NTASKS[4] $EXEROOT/all/$COMPONENTS[4] : \
-np $NTASKS[5] $EXEROOT/all/$COMPONENTS[5]
wait
#mpdallexit
echo "`date` -- CSM EXECUTION HAS FINISHED"
# -------------------------------------------------------------------------
# Save model output stdout and stderr
# -------------------------------------------------------------------------
cd $EXEROOT/cpl
set CplLogFile = `ls -1t cpl.log* | head -1`
grep 'end of main program' $CplLogFile || echo "Model did not complete - see $CplLogFile" && exit -1
cd $EXEROOT
gzip */*.$LID
if ($LOGDIR != "") then
if (! -d $LOGDIR/bld) mkdir -p $LOGDIR/bld || echo " problem in creating $LOGDIR/bld" && exit -1
cp -p */*buildexe*$LID.* $LOGDIR/bld || echo "Error in copy of logs " && exit -1
cp -p */*log*$LID.* $LOGDIR || echo "Error in copy of logs " && exit -1
endif
# -------------------------------------------------------------------------
# Perform short term archiving of output
# -------------------------------------------------------------------------
if ($DOUT_S == 'TRUE') then
echo "Archiving ccsm output to $DOUT_S_ROOT"
echo "In $CASEROOT directory using the short term archiving script ccsm_s_archive.csh"
cd $CASEROOT; $UTILROOT/Tools/ccsm_s_archive.csh
endif
# -------------------------------------------------------------------------
# Submit longer term archiver if appropriate
# -------------------------------------------------------------------------
if ($DOUT_L_MS == 'TRUE' && $DOUT_S == 'TRUE') then
echo "Long term archiving ccsm output using the script $CASE.$MACH.l_archive"
qsub $CASE.$MACH.l_archive
endif
# -------------------------------------------------------------------------
# Resubmit another run script
# -------------------------------------------------------------------------
set echo
cd $CASEROOT
source env_run
if ($RESUBMIT > 0) then
echo RESUBMIT is $RESUBMIT
@ RESUBMIT = $RESUBMIT - 1
echo RESUBMIT is $RESUBMIT
sed '1,/^ *setenv *CONTINUE_RUN .*/s//setenv CONTINUE_RUN TRUE/' \
env_run > env_run.tmp; mv env_run.tmp env_run
sed "s/^ *setenv *RESUBMIT .*/setenv RESUBMIT $RESUBMIT/;" \
env_run > env_run.tmp; mv env_run.tmp env_run
qsub $CASE.$MACH.run
endif
endif
0 Kudos
Highlighted
52 Views

Hi Zhang,

Just to narrow down the problem, could you please try to use another provider?

I'll contact with the author of the tmi provider and let him know about potential memory leak.

Regards!
Dmitry
0 Kudos
Highlighted
Beginner
52 Views

Thank you!

You mean this problem is provider ? Provider is tmi , tmi caused this error?
If I use rdma and shm:tmi ,I will get a error in begining.
0 Kudos
Highlighted
52 Views

Zhang,

I don't know the real reason of that error. Might be this is application itself consumes memory - who knows.

What error do you get? Could you provide details? Your command line and output with I_MPI_DEBUG set to 9 could help to understand to reason of these fails.

Regards!
Dmitry
0 Kudos
Highlighted
52 Views

Hi Zhang,

I've got an answer from the developer of TMI module: "A memory leak was recently discovered in the tmi module with non-contiguous messages. It was fixed."
Unfortunately I don't know when updated library will be available. If you need new library you need to create a tracker at http://premier.intel.com.


BTW: It would be better to use "-env I_MPI_FABRICS shm:tmi" - shared memory will be used in case as well.

Regards!
Dmitry
0 Kudos
Highlighted
Beginner
52 Views

Thank you!

I want to knowwhere I can get the updated library even if it is a beta version?

Can you get the updated library from the developer of TMI module?
0 Kudos
Highlighted
52 Views

Zhang,

Unfortunately we cannot provide it on ISN forum. You need to submit a tracker via premier.intel.com and we will be able to attached new library to that tracker.
The library need to be built. That issue has just been fixed and new library is not ready yet.

Regards!
Dmitry
0 Kudos
Highlighted
Beginner
52 Views

I can't login inpremier.intel.com,why? How do I can login in?


Welcome to Intel Premier Support.

We were unable to authenticate your access to the Intel Premier Support web site. Please check that your login ID and password were entered correctly and that the URL used was "https://premier.intel.com".

If you have forgotten your login or password, the fastest method to gain access to the system is to use the automated login and password links Forgot your password or Forgot your Login ID on the login page.

If you continue to have problems, please contact Intel Customer Support via email at quadsupport@mailbox.intel.com.

0 Kudos
Highlighted
52 Views

Zhang,

If you buy Intel product you can register at http://registrationcenter.intel.com
And you'll get Login ID and password. After that you can submit a request at Premier.

Have you registered your product?

Regards!
Dmitry

0 Kudos
Highlighted
Beginner
52 Views

Yes ,I can login in now ,thank you!
0 Kudos