WRF 3.5 poor performance run on PHI

IT_I_ · ‎08-09-2013

Hello,

Regarding article http://software.intel.com/en-us/articles/how-to-get-wrf-running-on-the-intelr-xeon-phitm-coprocessor I've compiled either netcdf 3.6.2 or wrf 3.5 and finally run the latter. Both compilations passed without problems but running wrf against CONUS_12km case terminates due to signal 9 after c.a. 3 minutes. Running another wrf's tutorial case Katrina (180 threads with affinity 60C, 3T) gives worse runtime than running the same on host CPU only Xeon CPU E5-2620@2GHz (12 threads) - the difference is 16 seconds. Our one another real case runtime is not acceptable at all - it takes hours to calculate compared to 1h20m with 12 CPU threads. Anyone has any idea what's wrong, any tip what to check, etc ?

Our configuration is :

- Scientific Linux 6.3 (2.6.32-279.el6.x86_64)

- MPSS gold update 3-2.1.6720-15

- Intel Cluster Studio XE 2013.5.192

If any more information is needed please let me know.

Regards

Sumedh_N_Intel · ‎08-09-2013

Hi,

Could you please try the following changes for the CONUS12 case:

1) Add the following in the namelist.input:

use_baseparam_fr_nml = .t.,

2) Copy over all the contents of CONUS12_rundir to the run directory and execute from the run directory.

Also, could give you us more information about your coprocessor? #cores, memory, frequency, etc?

-Sumedh

IT_I_ · ‎08-12-2013

Hi Sumedh,

Thanks for reply. I tried adding mentioned parameter to dynamics section of namelist.input but it still terminates due to signal 9 (see namelist.input attached).. Copying the files to WRF run directory didn't solve the problem either.

We have Intel Phi 60 cores, 1GHz, 8GB RAM (GDDR5-Elpida) - see micinfo.txt file attached with full micinfo output.

Best regards,

Leszek

IT_I_ · ‎08-12-2013

Today, while doing another series of tests of CONUS_12km, it caused mic0 to hang and after reset and booting it again wrf successfully finished calculations. I repeated it 3 more times and it really magically fixed itself. Don't know what happened because prior to reboot nothing had looked broken. What is concerning me now is the performance - CONUS_12km takes 5m31s to finish calculations on Phi (60C,3T, 180 OMP threads). I can't compare it to native CPU runtime now because I get segfault.

But for Katrina tutorial case and for our own one the performance of Phi for both is worse than native host CPU run. Is 5m31s for CONUS_12km on Phi satisfying ?

Regards

Indraneil_G_Intel · ‎08-12-2013

Hi,

On the CPU, how are you running the conus12 workload?

--Indraneil

Indraneil_G_Intel · ‎08-12-2013

can you please try ulimit –s unlimited

IT_I_ · ‎08-12-2013

Hi Indraneil,

I have another WRF 3.5 instance compiled against netcdf-3.6.2 for x86_64. Files from CONUS_12km are the same as for mic. One difference is wrf.sh script :

#!/bin/bash

source /opt/intel/impi/4.1.1/intel64/bin/mpivars.sh
export LD_LIBRARY_PATH=/opt/intel/composer_xe_2013.5.192/compiler/lib/intel64/:/home/it/netcdf-3.6.2-intel64/lib
ulimit -s unlimited
export WRF_NUM_TILES_X=3
export WRF_NUM_TILES_Y=60
export OMP_NUM_THREADS=12

time ./wrf.exe

When I run only 1 omp thread it's finished correctly but for more threads I get segfault in relax_bdy_scalar() subroutine in module_bc_em.F. But this problem isn't as important as Phi performance. We don't know if buying another mic cards is worthwhile because with no performance gain in WRF they're rather useless toys for us. Maybe there is some way to optimize runtime that's why it's important to use the same tutorial cases to compare.

Katrina tutorial case takes ~40s to calculate with 11 omp threads (2 x Xeon E5-2620@2GHz). The same one on Phi (180 threads) takes ~1m3s. Why is it 50% slower ? Could you please check it at your site and post runtime together with run script like as in CONUS_12km ? Thanks in advance.

Regards

marek_kaletka · ‎08-30-2013

If you are using NFS over emulated Ethernet mount as a working directory for your benchmarks, significant amount of time is spent reading the wrfrst and writing wrfout files. It's quite slow (around 16MB/s in my case).
In my CONUS12 case runs, calculation time, excluding I/O times, is around 2 times better then using 2 x E5-2650 (32 Sandy Bridge cores).
Please take a look at your rsl.out.0000 file.

Indraneil_G_Intel · ‎09-18-2013

Hello,

why are you using just 12 threads?
export WRF_NUM_TILES_X=3
export WRF_NUM_TILES_Y=60
export OMP_NUM_THREADS=12

If you are running on a SNB, then you don't need to set WRF_NUM_TILES_X,Y, you only need to set WRF_NUM_TILES=32 (4 tiles per thread)

Ig you are running on MIC, the optimum performnce is at 180 threads.
please follow http://software.intel.com/en-us/articles/how-to-get-wrf-running-on-the-intelr-xeon-phitm-coprocessor for instructions.