Software Archive
Read-only legacy content

WRF 3.5 poor performance run on PHI

IT_I_
Beginner
971 Views

Hello,

Regarding article http://software.intel.com/en-us/articles/how-to-get-wrf-running-on-the-intelr-xeon-phitm-coprocessor I've compiled either netcdf 3.6.2 or wrf 3.5 and finally run the latter. Both compilations passed without problems but running wrf against CONUS_12km case terminates due to signal 9 after c.a. 3 minutes. Running another wrf's tutorial case Katrina (180 threads with affinity 60C, 3T) gives worse runtime than running the same on host CPU only Xeon CPU E5-2620@2GHz (12 threads) - the difference is 16 seconds. Our one another real case runtime is not acceptable at all - it takes  hours to calculate compared to 1h20m with 12 CPU threads. Anyone has any idea what's wrong, any tip what to check, etc ?

Our configuration is :

- Scientific Linux 6.3 (2.6.32-279.el6.x86_64)

- MPSS gold update 3-2.1.6720-15

- Intel Cluster Studio XE 2013.5.192

If any more information is needed please let me know.

Regards

0 Kudos
8 Replies
Sumedh_N_Intel
Employee
971 Views

Hi, 

Could you please try the following changes for the CONUS12 case: 

1) Add the following in the namelist.input

use_baseparam_fr_nml                = .t.,

2) Copy over all the contents of CONUS12_rundir to the run directory and execute from the run directory. 

Also, could give you us more information about your coprocessor? #cores, memory, frequency, etc? 

-Sumedh

0 Kudos
IT_I_
Beginner
971 Views

Hi Sumedh,

Thanks for reply. I tried adding mentioned parameter to dynamics section of namelist.input but it still terminates due to signal 9 (see namelist.input attached).. Copying the files to WRF run directory didn't solve the problem either.

We have Intel Phi 60 cores, 1GHz, 8GB RAM (GDDR5-Elpida) - see micinfo.txt file attached with full micinfo output.

Best regards,

Leszek

0 Kudos
IT_I_
Beginner
971 Views

Today, while doing another series of tests of CONUS_12km, it caused mic0 to hang and after reset and booting it again wrf successfully finished calculations. I repeated it 3 more times and it really magically fixed itself. Don't know what happened because prior to reboot nothing had looked broken. What is concerning me now is the performance - CONUS_12km takes 5m31s to finish calculations on Phi (60C,3T, 180 OMP threads). I can't compare it to native CPU runtime now because I get segfault.

But for Katrina tutorial case and for our own one the performance of Phi for both is worse than native host CPU run. Is 5m31s for CONUS_12km on Phi satisfying ?

Regards

0 Kudos
Indraneil_G_Intel
971 Views

Hi,

On the CPU, how are you running the conus12 workload?

--Indraneil

0 Kudos
Indraneil_G_Intel
971 Views

can you please try ulimit –s unlimited

0 Kudos
IT_I_
Beginner
971 Views

Hi Indraneil,

I have another WRF 3.5 instance compiled against netcdf-3.6.2 for x86_64. Files from CONUS_12km are the same as for mic. One difference is wrf.sh script :

#!/bin/bash

source /opt/intel/impi/4.1.1/intel64/bin/mpivars.sh
export LD_LIBRARY_PATH=/opt/intel/composer_xe_2013.5.192/compiler/lib/intel64/:/home/it/netcdf-3.6.2-intel64/lib
ulimit -s unlimited
export WRF_NUM_TILES_X=3
export WRF_NUM_TILES_Y=60
export OMP_NUM_THREADS=12

time ./wrf.exe

When I run only 1 omp thread it's finished correctly but for more threads I get segfault in relax_bdy_scalar() subroutine in module_bc_em.F. But this problem isn't as important as Phi performance. We don't know if buying another mic cards is worthwhile because with no performance gain in WRF they're rather useless toys for us. Maybe there is some way to optimize runtime that's why it's important to use the same tutorial cases to compare.

Katrina tutorial case takes ~40s to calculate with 11 omp threads (2 x Xeon E5-2620@2GHz). The same one on Phi (180 threads) takes ~1m3s. Why is it 50% slower ? Could you please check it at your site and post runtime together with run script like as in CONUS_12km ? Thanks in advance.

Regards

0 Kudos
marek_kaletka
Beginner
971 Views

If you are using NFS over emulated Ethernet mount as a working directory for your benchmarks, significant amount of time is spent reading the wrfrst and writing wrfout files. It's quite slow (around 16MB/s in my case).
In my CONUS12 case runs, calculation time, excluding I/O times, is around 2 times better then using 2 x E5-2650 (32 Sandy Bridge cores).
Please take a look at your rsl.out.0000  file.

0 Kudos
Indraneil_G_Intel
970 Views

Hello,

why are you using just 12 threads?
export WRF_NUM_TILES_X=3
export WRF_NUM_TILES_Y=60
export OMP_NUM_THREADS=12

 If you are running on a SNB, then you don't need to set WRF_NUM_TILES_X,Y, you only need to set WRF_NUM_TILES=32 (4 tiles per thread)

Ig you are running on MIC, the optimum performnce is at 180 threads.
please follow http://software.intel.com/en-us/articles/how-to-get-wrf-running-on-the-intelr-xeon-phitm-coprocessor  for instructions. 

0 Kudos
Reply