Unable to read binary file and giving error forrtl: severe (67): input statement requires too much data

dhilonpatel · ‎09-08-2009

Hello,

I am a biginner in using clusters. We have 24 nodes cluster intel xeon X86 processors operating with Linux RHEL5.2 which uses infiniband for applications and Ethernetport for management. Installed with mvapich-1.1_intel, fullpackage of intel compiler.

I have an application CPMD a molecular dynamic package installed on 24 nodes cluster. While restarting job for the second step it uses RESTART.1 binary file for the next step. When I submit my job from /home/username/cwd (current working directory) it successfully read binary file to restart the job for next step but when i submit my job from /home/username/data/subdirectory it finsihes the first step of my job successfully without using binary file but in the second step while using binary file while restarting job it shows an error forrtl: severe (67): input statement requires too much data, unit 1, file /student/username/cpmd_amit_test/linear-BG/20/opt/./RESTART.1
I would like to mention that data is a separate linked file for each users which represents /student/username mounted to all nodes from common storage.
I dont understand why my job is giving an error while using restart file generated during the process in /home/username/data/../cwd. But not when i generate same restart file and run my job through /home/username. One more thing I would like to mention that the same binary generated in /home/username/cwd and /home/username/data/../cwd shows difference while i compare with differ command. What can be the reason for this???

i have compiled my application using the following libraries and compilers for the parallel processing.

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
SRC = .
DEST = .
BIN = .
FFLAGS = -c -openmp -w90 -w95 -O2 -unroll -ip -cm -xT -convert big_endian
LFLAGS = -L/opt/intel/mkl/10.1.0.015/lib/em64t -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -lguide
CFLAGS = -c -openmp -O2 -Wall
CPP = /lib/cpp -P -C -traditional
CPPFLAGS = -D__Linux -D__PGI -DFFT_DEFAULT -DPOINTER8 -DINTEL_MKL \
-DPARALLEL -DMYRINET -DLINUX_IFC
NOOPT_FLAG =
CC = /opt/intel/impi/3.2.0.011/bin64/mpicc -cc=icc
FC = /opt/intel/impi/3.2.0.011/bin64/mpiifort -fc=ifort
LD = /opt/intel/impi/3.2.0.011/bin64/mpiifort -fc=ifort -openmp
AR = ar
-------------------------------------------------------------------------------------------------------------------------------------------------------------------

I will appreciate any help in solving this issue. I would like to provide necessary information in solving this issue.
Thanks in advance.
Dhilon

Dmitry_K_Intel2 · ‎09-09-2009

Quoting - dhilonpatel

Hello,

I am a biginner in using clusters. We have 24 nodes cluster intel xeon X86 processors operating with Linux RHEL5.2 which uses infiniband for applications and Ethernetport for management. Installed with mvapich-1.1_intel, fullpackage of intel compiler.

I have an application CPMD a molecular dynamic package installed on 24 nodes cluster. While restarting job for the second step it uses RESTART.1 binary file for the next step. When I submit my job from /home/username/cwd (current working directory) it successfully read binary file to restart the job for next step but when i submit my job from /home/username/data/subdirectory it finsihes the first step of my job successfully without using binary file but in the second step while using binary file while restarting job it shows an error forrtl: severe (67): input statement requires too much data, unit 1, file /student/username/cpmd_amit_test/linear-BG/20/opt/./RESTART.1
I would like to mention that data is a separate linked file for each users which represents /student/username mounted to all nodes from common storage.
I dont understand why my job is giving an error while using restart file generated during the process in /home/username/data/../cwd. But not when i generate same restart file and run my job through /home/username. One more thing I would like to mention that the same binary generated in /home/username/cwd and /home/username/data/../cwd shows difference while i compare with differ command. What can be the reason for this???

i have compiled my application using the following libraries and compilers for the parallel processing.

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
SRC = .
DEST = .
BIN = .
FFLAGS = -c -openmp -w90 -w95 -O2 -unroll -ip -cm -xT -convert big_endian
LFLAGS = -L/opt/intel/mkl/10.1.0.015/lib/em64t -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -lguide
CFLAGS = -c -openmp -O2 -Wall
CPP = /lib/cpp -P -C -traditional
CPPFLAGS = -D__Linux -D__PGI -DFFT_DEFAULT -DPOINTER8 -DINTEL_MKL
-DPARALLEL -DMYRINET -DLINUX_IFC
NOOPT_FLAG =
CC = /opt/intel/impi/3.2.0.011/bin64/mpicc -cc=icc
FC = /opt/intel/impi/3.2.0.011/bin64/mpiifort -fc=ifort
LD = /opt/intel/impi/3.2.0.011/bin64/mpiifort -fc=ifort -openmp
AR = ar
-------------------------------------------------------------------------------------------------------------------------------------------------------------------

I will appreciate any help in solving this issue. I would like to provide necessary information in solving this issue.
Thanks in advance.
Dhilon

Hi Dhilon,
Thanks for the question.
First of all I'd ask you a bit more information about your MPI Library. I know nothing about mvapich1.1_intel - where is it from? If you use Intel MPI Library tell me please the version which can be found in mpisupport.txt file in your installation directory.
Second: don't use mpicc -cc=icc - just use mpiicc - this wrapper for Intel C compiler. The same is correct for mpiifort (you don't need to use -fc=ifort).
As I see you are building a hybrid application. Tha't OK but you need to set I_MPI_PIN_DOMAIN to omp and set KMP_AFFINITY.

The error 67 is happening on a READ which follows a WRITE. It is complaining that the total size of the variables you have asked to read exceed the size of the record. It could be that the file is corrupt or that it is not in the form expected. So this the application's expectation.

Do an "od -t x4" on the data file to see what the first few 32-bit chunks are for correct and incorrect files and compare. Might be there are some requirements in CPMD on these files and they should be located on a common device.

If you were able to start your application probably it means that nothing wrong in the MPI Library you used. Please read documentation on the CPMD and visit cpms.org - might be you'll find useful information there.

Best wishes,
Dmitry

TimP · ‎09-09-2009

I would agree with Dmitry's advice to use the Intel MPI exclusively, if it is installed on your cluster. You can't mix mvapich1 with Intel MPI; the headers aren't compatible, for example there are different hex codes for the common MPI data types. Even mvapich2, which shares common ancestry with Intel MPI, isn't compatible.