- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear all,
I am referring to the guide on Explicit offload QE to Xeon Phi KNC (7120P) here,
https://software.intel.com/en-us/articles/explicit-offload-for-quantum-espresso
I tried to follow the above steps but I failed to run pw.x (QE v5.3.0) on 2 Xeon Phi 7120P using the mpirun.sh script. The error reads:
allocating buffers 2048 2048 1024
on device 0
threshold 20000000000.0000
allocating buffers 2048 2048 1024
on device 0
threshold 20000000000.0000
offload error: cannot create buffer on device 0 (error code 14)
offload error: cannot create buffer on device 0 (error code 14)
This is how I run the script,
[qeuser@node09 ~]$ ~/mpirun/mpirun.sh -p 1 -w ~/libxphi/xphilibwrapper.sh -x ~/QE530-KNC-OL/espresso-5.3.0/bin/pw.x -i ~/rolly/AUSURF112/ausurf.in
I have already scp all the lib and bin files to each Xeon Phi 7120P and I have also compiled the libxphi lib. This is how the libxphi directory reads,
[qeuser@node09 libxphi]$ ls
build-library.sh libmkl_proxy.so LICENSE README.md xphilibmod.mod xphilib.o xphilib_proxy.o
clean.sh libxphi.so mkl_proxy.c xphilib.f90 xphilibmod.modmic xphilib_proxy.f90 xphilibwrapper.sh
I suppose this is okay.
However, I found it interesting that I can run a single instance on mic0 but it is very slow. This is how I did it,
[qeuser@node09 ~]$ export LD_LIBRARY_PATH=/home/qeuser/libxphi/:$LD_LIBRARY_PATH
[qeuser@node09 ~]$ LD_PRELOAD="/home/qeuser/libxphi/libxphi.so" /home/qeuser/QE530-KNC-OL/espresso-5.3.0/bin/pw.x < /home/qeuser/rolly/AUSURF112/ausurf.in
Error messages were also produced,
allocating buffers 2048 2048 1024
on device 0
threshold 20000000000.0000
ERROR: ld.so: object '/home/qeuser/libxphi/libxphi.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
ERROR: ld.so: object '/home/qeuser/libxphi/libxphi.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
buffer allocation 4.02019500732422 s
The AUSURF112 benchmark completed, it took very long time, and it reads
PWSCF : 9h56m CPU 0h36m WALL
This run was terminated on: 2:25: 1 11Mar2017
=------------------------------------------------------------------------------=
JOB DONE.
=------------------------------------------------------------------------------=
2
offload error: cannot unload library from the device 0 (error code 14)
On the host I can see one copy of pw.x is running, and on mic0 I can see that offload_main and coi_daemon are running by the micuser. But it is very slow.
So, is this offload error: cannot create buffer on device 0 (error code 14) related to the mpirun.sh script and the libxphi.so were not preloaded even it is present and the libxphi cannot be unload after completion???
With two Xeon E5-2683v3 CPU of 56 threads, it tooks only 9 mins and 15 seconds, but with this single instance offloaded to Xeon Phi, it tooks 10 hours?
I am running CentOS 7.1 + Intel MPSS 3.8.1 + Intel psxe 2017 update 1, and I have already made a symbolic link of the psxevars.sh to /etc/profile.d and I can use mpirun to pw.x on the host, but not offload to mic0 and mic1
Are these compatibility issues because the libxphi and mpirun.sh were written 2 years ago? How can these be fixed?
Thank you,
Rolly
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi all,
Thanks to Dr. Dahnken, developer of the libxphi library. He found the solution to my question.
https://github.com/cdahnken/libxphi
https://github.com/hfp/mpirun/blob/master/mpirun.py
In configuration with CentOS 7.1+ MPSS-3.8.1 + Intel PSXE 2017 update 1, a line have to be removed from the mpirun.py
runstring = "mpirun -bootstrap ssh" if (None == args.mri): runstring = runstring \ + " -genv I_MPI_PIN_DOMAIN=auto" \ # + " -genv OFFLOAD_INIT=on_start" \ + " -genv MIC_USE_2MB_BUFFERS=2m" \ + " -genv MIC_ENV_PREFIX=" + micenv \ + " -genv " + micenv + "_KMP_AFFINITY=" + args.micaffinity + ",granularity=fine" \ + " -genv " + micenv + "_OMP_SCHEDULE=" + args.schedule \ + " -genv " + micenv + "_OMP_NUM_THREADS=" + str(max((mcores + min(0, args.reserved)) * 4, args.mthreads))
After removal of the line "+ " -genv OFFLOAD_INIT=on_start" \", it runs offloading to on 2x 7120P.
With these parameters, 2x 7120P complete the QE AUSURF112 benchmark in 17min56sec,
$ QE_MIC_BLOCKSIZE_M=2048 $ QE_MIC_BLOCKSIZE_N=2048 $ QE_MIC_BLOCKSIZE_K=512
By increasing K to 1024, 2x 7120P complete the QE AUSURF112 benchmark in 17min17sec,
I found 3 warning are given by the code:
(1) OMP: Info #256: KMP_PLACE_THREADS variable deprecated, please use KMP_HW_SUBSET instead.
(2) OMP: Warning #250: KMP_HW_SUBSET "o" offset designator deprecated, please use @ prefix for offset value.
(3) offload error: cannot unload library from the device 0 (error code 14)
and I am looking into these and I will report my findings.
Thanks for your attention.
Rolly
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Re 1: KMP_PLACE_THREADS has been replaced with KMP_HW_SUBSET which better captures what the environmnet variable actually does.(The link below also points this out!)
Re 2: KMP_HW_SUBSET is documented in the compiler manual at https://software.intel.com/en-us/node/684224#DA65065D-8C30-4E3A-BA61-4AAE42DD505C (and elsewhere).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi James,
Thanks for the info.
I took a look at the link, I believe KMP_HW_SUBSET reads something like this, KMP_HW_SUBSET=socketsS[@offset],coresC[@offset],threadsT
The mpirun.py script generates something like this,
mpirun -bootstrap ssh -genv I_MPI_PIN_DOMAIN=auto -genv MIC_USE_2MB_BUFFERS=2m -genv MIC_ENV_PREFIX=MIC -genv MIC_KMP_AFFINITY=balanced,granularity=fine -genv MIC_OMP_SCHEDULE=static -genv MIC_OMP_NUM_THREADS=32 -host node09 -np 1 -env KMP_AFFINITY=compact,1,granularity=fine -env OMP_SCHEDULE=static -env OMP_NUM_THREADS=4 -env OFFLOAD_DEVICES=0 -env MIC_KMP_PLACE_THREADS=8c,4t,0o -env MIC_OMP_SCHEDULE=static /home/qeuser/libxphi/xphilibwrapper.sh /home/qeuser/QE530-KNC-OL/espresso-5.3.0/bin/pw.x < /home/qeuser/rolly/AUSURF112/ausurf.in : -host node09 -np 1 -env KMP_AFFINITY=compact,1,granularity=fine -env OMP_SCHEDULE=static -env OMP_NUM_THREADS=4 -env OFFLOAD_DEVICES=0 -env MIC_KMP_PLACE_THREADS=8c,4t,8o -env MIC_OMP_SCHEDULE=static /home/qeuser/libxphi/xphilibwrapper.sh /home/qeuser/QE530-KNC-OL/espresso-5.3.0/bin/pw.x < /home/qeuser/rolly/AUSURF112/ausurf.in : ... -host node09 -np 1 -env KMP_AFFINITY=compact,1,granularity=fine -env OMP_SCHEDULE=static -env OMP_NUM_THREADS=4 -env OFFLOAD_DEVICES=1 -env MIC_KMP_PLACE_THREADS=8c,4t,48o -env MIC_OMP_SCHEDULE=static /home/qeuser/libxphi/xphilibwrapper.sh /home/qeuser/QE530-KNC-OL/espresso-5.3.0/bin/pw.x < /home/qeuser/rolly/AUSURF112/ausurf.in
So, is it correct to change environmental variable MIC_KMP_PLACE_THREADS to MIC_KMP_HW_SUBSET? Shall it looks like this?
-host node09 -np 1 -env KMP_AFFINITY=compact,1,granularity=fine -env OMP_SCHEDULE=static -env OMP_NUM_THREADS=4 -env OFFLOAD_DEVICES=0 -env MIC_KMP_HW_SUBSET=0s,8c@0,4t -env MIC_OMP_SCHEDULE=static /home/qeuser/libxphi/xphilibwrapper.sh /home/qeuser/QE530-KNC-OL/espresso-5.3.0/bin/pw.x < /home/qeuser/rolly/AUSURF112/ausurf.in : -host node09 -np 1 -env KMP_AFFINITY=compact,1,granularity=fine -env OMP_SCHEDULE=static -env OMP_NUM_THREADS=4 -env OFFLOAD_DEVICES=0 -env MIC_KMP_HW_SUBSET=0s,8c@8,4t -env MIC_OMP_SCHEDULE=static /home/qeuser/libxphi/xphilibwrapper.sh /home/qeuser/QE530-KNC-OL/espresso-5.3.0/bin/pw.x < /home/qeuser/rolly/AUSURF112/ausurf.in : ... -host node09 -np 1 -env KMP_AFFINITY=compact,1,granularity=fine -env OMP_SCHEDULE=static -env OMP_NUM_THREADS=4 -env OFFLOAD_DEVICES=1 -env MIC_KMP_HW_SUBSET=0s@1,8c@48,4t -env MIC_OMP_SCHEDULE=static /home/qeuser/libxphi/xphilibwrapper.sh /home/qeuser/QE530-KNC-OL/espresso-5.3.0/bin/pw.x < /home/qeuser/rolly/AUSURF112/ausurf.in :
The MIC_KMP_PLACE_THREADS also assign different thread numbers to different core via the "o" offset designator, but how can I do the same with MIC_KMP_HW_SUBSET? By the way, is MIC_KMP_HW_SUBSET the correct environmental variable?
Thanks,
Rolly
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi all,
I am able to remove the first two warnings by modifying the mpirun.py. Can anyone check if these are correct? Thank you!
#! /usr/bin/env python ############################################################################### ## Copyright (c) 2014-2015, Intel Corporation ## ## All rights reserved. ## ## ## ## Redistribution and use in source and binary forms, with or without ## ## modification, are permitted provided that the following conditions ## ## are met: ## ## 1. Redistributions of source code must retain the above copyright ## ## notice, this list of conditions and the following disclaimer. ## ## 2. Redistributions in binary form must reproduce the above copyright ## ## notice, this list of conditions and the following disclaimer in the ## ## documentation and/or other materials provided with the distribution. ## ## 3. Neither the name of the copyright holder nor the names of its ## ## contributors may be used to endorse or promote products derived ## ## from this software without specific prior written permission. ## ## ## ## THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS ## ## "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT ## ## LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR ## ## A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT ## ## HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, ## ## SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED ## ## TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR ## ## PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF ## ## LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING ## ## NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS ## ## SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. ## ############################################################################### ## Hans Pabst, Christopher Dahnken, Intel Corporation ## ############################################################################### import platform import sys import os try: import argparse except ImportError, e: print 'Failed to import the argparse module: ', e print 'Please follow the troubleshooting instructions at:' print 'https://github.com/hfp/mpirun#troubleshooting' def micshift(i, rcores, ncores, nthreads): return str(max(ncores + min(0, rcores), nthreads)) + "c@" + str(i * ncores) + "," + str(nthreads) + "t" nodename = platform.node().split(".")[0] micenv = "MIC" parser = argparse.ArgumentParser() parser.add_argument("-n", "--nodelist", help="list of comma separated node names", default=["localhost", nodename]["" != nodename]) parser.add_argument("-p", "--cpuprocs", help="number of processes per socket (host)", type=int, default=1) parser.add_argument("-q", "--micprocs", help="number of processes per mic (native)", type=int, default=1) parser.add_argument("-s", "--nsockets", help="number of sockets per node", type=int, default=1) parser.add_argument("-d", "--ndevices", help="number of devices per node", type=int, default=0) parser.add_argument("-e", "--cpucores", help="number of CPU cores per socket", type=int, default=1) parser.add_argument("-t", "--nthreads", help="number of CPU threads per core", type=int, default=2) parser.add_argument("-m", "--miccores", help="number of MIC cores per device", type=int, default=57) parser.add_argument("-r", "--reserved", help="number of MIC cores reserved", type=int, default=sys.maxint) parser.add_argument("-u", "--mthreads", help="number of MIC threads per core", type=int, default=4) parser.add_argument("-a", "--cpuaffinity", help="affinity (CPU) e.g., compact", default="compact") parser.add_argument("-b", "--micaffinity", help="affinity (MIC) e.g., balanced", default="balanced") parser.add_argument("-c", "--schedule", help="schedule, e.g. dynamic", default="static") parser.add_argument("-w", "--wrapper", help="wrapper") parser.add_argument("-g", "--debugger", help="debugger") parser.add_argument("-i", "--inputfile", help="inputfile (<)") parser.add_argument("-0", "--hr0", help="executable (rank-0)") parser.add_argument("-x", "--hri", help="executable (host)") parser.add_argument("-y", "--mri", help="executable (mic)") parser.add_argument("-z", "--micpre", help="prefixed mic name", action="store_true") parser.add_argument("-v", "--dryrun", help="dryrun", action="store_true") args, unknown = parser.parse_known_args() if (None != args.inputfile): arguments = " ".join(unknown) + " < " + args.inputfile else: arguments = " ".join(unknown) if ("" != arguments): arguments = " " + arguments if (None != args.wrapper): wrapper = args.wrapper + [" ", " " + str(args.debugger) + " -ex run --args"][None != args.debugger] else: wrapper = ["", str(args.debugger) + " -ex run --args"][None != args.debugger] if ("" != wrapper): wrapper = wrapper + " " if (1 < args.nsockets): args.cpuaffinity = args.cpuaffinity + ",1" if (1 < args.nthreads): args.cpuaffinity = args.cpuaffinity + ",granularity=fine" if (None == args.mri): if (sys.maxint == args.reserved): args.reserved = 1 if (0 < args.ndevices): cpusockets = min(args.nsockets, args.ndevices) else: cpusockets = args.nsockets nparts = args.cpuprocs else: if (sys.maxint == args.reserved): args.reserved = 0 cpusockets = args.nsockets nparts = args.micprocs cputhreads = cpusockets * args.cpucores * args.nthreads mcores = int(0 < nparts) * max((args.miccores - max(args.reserved, 0)), nparts) / max(nparts, 1) pthreads = int(0 < (args.cpuprocs * cpusockets)) * cputhreads / max(args.cpuprocs * cpusockets, 1) remainder = cputhreads - pthreads * args.cpuprocs * cpusockets runstring = "mpirun -bootstrap ssh" if (None == args.mri): runstring = runstring \ + " -genv I_MPI_PIN_DOMAIN=auto" \ + " -genv MIC_USE_2MB_BUFFERS=2m" \ + " -genv MIC_ENV_PREFIX=" + micenv \ + " -genv " + micenv + "_KMP_AFFINITY=" + args.micaffinity + ",granularity=fine" \ + " -genv " + micenv + "_OMP_SCHEDULE=" + args.schedule \ + " -genv " + micenv + "_OMP_NUM_THREADS=" + str(max((mcores + min(0, args.reserved)) * 4, args.mthreads)) else: runstring = runstring \ + " -genv I_MPI_MIC=1" if (None != args.hr0): runstring = runstring \ + " -host " + args.nodelist.split(",")[0] + " -np 1" \ + " -env I_MPI_PIN_DOMAIN=auto" \ + " -env KMP_AFFINITY=" + args.cpuaffinity \ + " -env OMP_NUM_THREADS=" + str(args.nthreads) \ + " " + wrapper + args.hr0 + arguments \ + " :" for n in args.nodelist.split(","): for s in range(0, cpusockets): for p in range(0, args.cpuprocs): runstring = runstring \ + " -host " + n + " -np 1" \ + " -env KMP_AFFINITY=" + args.cpuaffinity \ + " -env OMP_SCHEDULE=" + args.schedule \ + " -env OMP_NUM_THREADS=" + str(pthreads - [0, args.nthreads][None != args.hr0 and 0 == s and 0 == p]) if (None == args.mri): runstring = runstring \ + " -env OFFLOAD_DEVICES=" + str(s) \ + " -env " + micenv + "_KMP_HW_SUBSET=" + str(s) + "s," + micshift(p, args.reserved, mcores, args.mthreads) \ + " -env " + micenv + "_OMP_SCHEDULE=" + args.schedule else: runstring = runstring \ + " -env I_MPI_PIN_DOMAIN=auto" if (None != args.hri): runstring = runstring \ + " " + wrapper + args.hri + arguments \ + " :" if (None != args.mri): for d in range(0, args.ndevices): for m in range(0, args.micprocs): runstring = runstring \ + " -host " + ["", n + "-"][int(args.micpre)] + "mic" + str(d) + " -np 1" \ + " -env LD_LIBRARY_PATH=$MIC_LD_LIBRARY_PATH" \ + " -env I_MPI_PIN=off" \ + " -env KMP_HW_SUBSET=" + str(d) + "s," + micshift(m, args.reserved, mcores, args.mthreads) \ + " -env KMP_AFFINITY=" + args.micaffinity + ",granularity=fine" \ + " -env OMP_SCHEDULE=" + args.schedule \ + " -env OMP_NUM_THREADS=" + str(max(mcores + min(0, args.reserved), args.mthreads) * 4) \ + " " + args.mri + arguments \ + " :" if (None != args.hri and 0 < remainder and 0 < cputhreads and 0 < args.cpuprocs): runstring = runstring \ + " -host " + n + " -np 1" \ + " -env I_MPI_PIN_DOMAIN=auto" \ + " -env KMP_AFFINITY=" + args.cpuaffinity \ + " -env OMP_SCHEDULE=" + args.schedule \ + " -env OMP_NUM_THREADS=" + str(remainder) \ + " " + wrapper + args.hri + arguments \ + " :" runstring = runstring[0:len(runstring)-2] print runstring print result = 0 if (False == args.dryrun): result = os.system(runstring) sys.exit([0, 1][0 != result])
But for the 3rd warning, (3) offload error: cannot unload library from the device 0 (error code 14), can anyone help?
Thanks,
Rolly
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page