Does aps-report have a limit?

Matt_Thompson · ‎03-16-2018

All,

I'm trying to figure out how to use APS to help benchmark/characterize a code I help maintain. However, it seems that when the program gets too big, APS breaks down. Well, aps-report breaks down.

First, the code I'm working on is GEOS, a climate model, and we can run in various resolutions from C24 (~400 km resolution, mainly for debugging/regression) to C720 (~12 km) and beyond. To run these, we run on 96 processors up to 1500+ for C720. Now, for the lower resolutions, C24->C180, aps-report seems just happy. C180 is run on 216 processors, the others on 96.

But, when we get to C360 (864 processors) and C720 (1536), aps-report starts failing. APS is definitely making a stat-XXX._bin file for each process, but when aps-report tries to process them, it takes a long time and eventually just fails.

To wit, the script I wrote to make a report essentially runs:

aps-report --all $APS_OUTPUT_DIR > aps_report.txt 2> /dev/null

where the /dev/null bit is to keep that stderr progress percentage print from writing a bajillion lines in the output file.

Now with 96 processes in the run, aps-report takes about 8 seconds. With 216, it takes about 19. With 864, aps-report runs for 200 seconds and then just crashes out. The C720 job obviously fails as well.

So, does anyone know if I'm hitting some limit? Am I filling up a TMPDIR silently? I can't seem to find any debugging flags for aps or aps-report that would report more information.

I also tried:

(4774) $ /usr/bin/time -p aps --report=aps_result_20180316/ 
ERROR: Cannot parse directory: aps_result_20180316//hwmetrics
aps Error: Failed to generate the report.
Command exited with non-zero status 2
real 182.68
user 141.68
sys 5.23

which is a clue perhaps? But I'm not calling aps any differently in one case compared to the other.

Any ideas?
Matt

Dmitry_P_Intel1 · ‎03-16-2018

Hello Matt,

Could you please send a mail to Dmitry.Prohorov@intel.com for further communication on the issue?

Thanks & Regards, Dmitry

Dixon__David · ‎10-17-2019

Was this issue resolved? I am hitting a similar issue i.e. failure to parse the hwmetrics directory. It worked the first time I tried using aps and now it won't.