Re: Problem with ifort compiler on BEEGFS (catastrophic error: Unable to read ...)

LESOCC · ‎07-19-2022

Compiler:

ifort (IFORT) 2021.6.0 20220226
Copyright (C) 1985-2022 Intel Corporation. All rights reserved.

Code:

Four subroutines distributed as follows:

sip_12.f contains subroutine 1 and 2
sip_34.f contains subroutine 3 and 4
sip_1234.f contains all subroutines (1-4)

using the include-files comex0.h, comex1.h and alloc_module.mod

Problem on Parallel-File-System: BEEGFS

run ./compile.sh to compile all three files

sip_12.f --> no problem
sip_34.f --> no problem
sip_1234.f --> error:

sip_1234.f(1121): catastrophic error: Unable to read 'sip_1234.f'
compilation aborted for sip_1234.f (code 1)

Note that the same script runs without any problems on
a non-parallel file system such as used on /tmp

Question: Why is the compiler unable to compile the file 'sip_1234.f' if the
same works on a non-parallel file system?

Ron_Green · ‎07-20-2022

Parallel file systems are tricky. There's a key POSIX file function, flock(), used to lock a file for reading or writing by a process or thread. For a local file that is a simple operation, grab a mutex and lock the file. For a parallel file system you have to coordinate with all cluster nodes or remote servers using that PFS. Lots of coordination WHICH leads to poor performance for the PFS. So many times your admins will set up PFS to use a cache for file metadata. This helps improve performance, but file locking may not be supported due to it's inherent slowness. the cache can and does catch up, but if you do a flock() call it may fail since it really cannot perform that operation.

I've been in your situation. The local enterprise uses PFS and it has a lot of storage compared to your home dir quota or even somethings NFS project space. OR perhaps you want to keep the code co-located with the data you have on PFS. In any event, the takeaway is that PFS is designed for fast parallel reads and writes. You may notice global file ops like file creation, deletion, or even ls performance is slower on PFS due to the synchronization needed for the file metadata. you can check with your admin if flock() is enabled and caching disabled. I would guess caching is enabled for performance, and hence flock() is not guaranteed to work.

Takeaway: if you can, keep your code on NFS or project space and not PFS. Use PFS for it's intended target which is very large datasets used during runs. PFS sucks at small file access and any operations changing metadata.