- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Kevin,
Cray MPI on their XC systems sets env var ALPS_APP_PE to the rank, unique for each rank, from 0 to N-1 for N ranks. They do not use the same env vars as MPICH, Intel MPI or OpenMPI to pass rank information down to applications.
advixe-cl run under MPI needs to open a results dir for each rank. I believe it is looking at MPICH, iMPI and OpenMPI env vars to find the rank number to use for the results directories. I am pretty sure it is not looking for Cray's env var ALPS_APP_PE. What I'm seeing is if I launch a Cray MPI job thusly:
aprun -n 1 advixe-cl --results-dir=/foodir ./a.out this works. When I run more than 1 rank I get a file open error on the results dir.
As background, Vtune used to have this problem also. They modified their collector to look for an MPI job's rank via the env vars of MPICH, OpenMPI and Cray's ALPS_APP_PE. I think advixe-cl needs this similar mod, to look for env var ALPS_APP_PE to flag an MPI job and to fetch the rank to use in the results dir name.
Could you confirm that the collector is not looking for ALPS_APP_PE to indicate an MPI job and to fetch the rank? If not, consider this a feature request to get advixe-cl to work under Cray's ALP MPI environment.
As a workaround, I aprun a wrapper script that launces multiple collectors with results dir set to <results dir>.$ALP_APP_PE. This works around the issue.
thanks
Ron
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks Ron!
I'll file a feature request!
Regards,
Kevin
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Ron,
I can confirm the behavior you mention. I have filed a feature request.
The development mentioned that the Advisor -trace-mpi command-line option is a workaround.
Can you try this?
Thanks!
Kevin
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Ron,
We wanted to verify that the -trace-mpi was sufficient for your requirements..
Kevin
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'll go try it ... give me an hour.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
well, mixed results. The -trace-mpi does indeed help - we now get results dirs and files with <node>.<rank> names. However, the results files are empty - the .aux files are only like 46 bytes. We run thusly:
aprun -n 60 -N 30 -S 15 -j 1 advixe-cl -trace-mpi -collect survey -search-dir=all:r=$PWD -project-dir $SCRATCH/cice/advisor/haswell/intel-p4-60pe -- ./cice
this runs a 60 MPI rank job on 2 nodes of haswell, 30 ranks per node, 15 ranks per socket, 1st hypercontext (not using hyperthreads).
The attached is STDOUT output from this run. Note all the "collection stopped" messages. The run actually ran for a minute or 2 with correct program output and results.
Also, re-ran this problem on 1 node with 20 ranks, 10 per socket. Same, the .aux files are just 46 bytes, and the GUI shows no user code and runtime of 0.02s.
The code is compiled with Intel Fortran 16.0.3 AND 17.0.0 with -g -qopt-report=5.
Any idea why "collection stopped" happens? The Cray compute nodes run a streamlined SLES OS, not quite a micro kernel but if the collectors are trying to coordinate via memory mapped files or RPC then we may have some issue.
Ron
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Let me discuss with our development team.
Can you share the project directory (including the result dirs.)?
Kevin
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Ron,
VTune and Advisor should work on Cray compute nodes.
Can you confirm this is all being run on the Lustre file system?
Thanks!
Kevin
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes, we are running the binary and storing results on Lustre.
I'm working on getting you project dir and results dir.
Ron
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
We've been able to run an MPI analysis on a Cray XC environment. I'll capture the details in a KB article on IDZ later.
BUT -trace-mpi is NOT SUFFICIENT to collect data on an Cray system. As I said, Cray's MPI sets env var ALPS_APP_PE to each rank numbered 0 to N-1. It does NOT set PMI_RANK. Without explicitly setting PMI_RANK to ALPS_APP_PE in a wrapper script the correct directories are not set up by the collection - specifically the "rank.<n>/rank.<n>.advixeexp files are NOT created if PMI_RANK is not set. This proves that -trace-mpi is NOT sufficient.
Observe: the first experiment I run a simple MPI pi.c with 4 ranks. In this case I do NOT set PMI_RANK in the wrapper. Here is the wrapper and run command and the resulting directories attached in file 'wrapper-no-pmt-rank-set.txt'. Note there are NO rank.<n>/rank.<n>.advixeexp files created.
green/collectorbug> more runit.sh
# --- script needed to lauch collector on Cray XC cluster ---
# --- we do NOT set PMI_RANK or PMI_PE, only ALPS_APP_PE is set to rank --
# export PMI_RANK=${ALPS_APP_PE}
# export PMI_PE=${ALPS_APP_PE}
export PMI_NO_FORK=1
advixe-cl --collect survey -trace-mpi --project-dir ./adviproj --search-dir all:r=/lustre/ttscratch1/green/collectorbug -- ./cpi
Here is the job launch
aprun -n 4 -N 4 -j1 -d1 -cc depth advixe-cl --collect survey -trace-mpi --project-dir /lustre/ttscratch1/green/collectorbug/adviproj --search-dir all:r=/lustre/ttscratch1/green/collectorbug -- bash runit.sh
the directories created are missing "rank" designation. Advisor can't open this collection
find adviproj -type f -exec ls -l {} \; >& wrapper-no-pmi-rank-set.txt
Now, rerun this run but uncomment out the line and set PMI_RANK thusly:
green/collectorbug> more runit.sh
export PMI_RANK=${ALPS_APP_PE}
# export PMI_PE=${ALPS_APP_PE}
export PMI_NO_FORK=1
advixe-cl --collect survey -trace-mpi --project-dir ./adviproj --search-dir all:r=/lustre/ttscratch1/green/collectorbug -- ./cpi
and look at the file created via find:
aprun -n 4 -N 4 -j1 -d1 -cc depth advixe-cl --collect survey -trace-mpi --project-dir /lustre/ttscratch1/green/collectorbug/adviproj --search-dir all:r=/lustre/ttscratch1/green/collectorbug -- bash runit.sh
find adviproj -type f -exec ls -l {} \; >& wrapper-with-pmi-rank-set.txt
Observe that the correct "rank" files are created in this case and Advisor GUI can view the collected data.
I would consider this a bug - the collector should be able to get the rank in a Cray MPI environment if it merely looks for ALPS_APP_PE instead of PMI_RANK. Seems simple enough, especially since I clued it in by using -trace-mpi. This is would help a lot of us trying to use Advisor XE on Cray XC systems.
Ron
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
We also have problems with the intel advisor on our CRAY XC40.
We use now the following wrapper script to perform the survey collection through
aprun_opt="-n $n -N $N -j2 -d $[ $T * 2 ] -cc numa_node" aprun $aprun_opt ./advixe-cl_survey.sh ${workdir} ${BIN}
##### #advixe-cl_survey.sh $1 source path $2 Binary ##### #!/bin/bash export PMI_RANK=${ALPS_APP_PE} export PMI_NO_FORK=1 #export PMI_NO_PREINITIALIZE=1 # not required for survey advixe-cl -collect survey -trace-mpi --no-auto-finalize -flops-and-masks -project-dir ./advisor -search-dir src:r=$1 $2 ./input.par > output.log.survey
This works for MPI and OpenMP applications for us. The only point is, that we are not able to get any Flops and bandwith reports with the Haswell CPUs. If I remember right, with another clustre and Haswell CPUs this worked, but I have to check this again.
But when we want to collect the trip count data, we get the following error message:
Mon Nov 7 14:58:01 2016: [PE_0]:_pmi_alps_sync:alps response not OKAY
Mon Nov 7 14:58:01 2016: [PE_0]:_pmi_init:_pmi_alps_sync failed -1
advixe: Warning: The application returned a non-zero exit value.
We were able to fix this problem with export PMI_NO_PREINITIALIZE=1 so that our advixe-cl_tripcounts.sh wrapper is:
###### #advixe-cl_tripcounts.sh $1 source path $2 Binary ###### #!/bin/bash export PMI_RANK=${ALPS_APP_PE} export PMI_NO_FORK=1 export PMI_NO_PREINITIALIZE=1 export PMI_MMAP_SYNC_WAIT_TIME=300 #We have to check if this is really required advixe-cl -collect tripcounts -trace-mpi --no-auto-finalize -project-dir ./advisor -search-dir src:r=$1 $2 ./input.par > output.log.tripcounts
The problem is, that we still get the following warning/error:
advixe: Warning: The application returned a non-zero exit value.
I have also checked with the following hello world, if it is a problem with the application or advixe-cl.
PROGRAM HELLO INTEGER NTHREADS, TID, OMP_GET_NUM_THREADS, OMP_GET_THREAD_NUM !C Fork a team of threads giving them their own copies of variables !$OMP PARALLEL PRIVATE(NTHREADS, TID) !C Obtain thread number TID = OMP_GET_THREAD_NUM() PRINT *, 'Hello World from thread = ', TID !C Only master thread does this IF (TID .EQ. 0) THEN NTHREADS = OMP_GET_NUM_THREADS() PRINT *, 'Number of threads = ', NTHREADS END IF !C All threads join master thread and disband !$OMP END PARALLEL END Again the survey work fine, but the tripcounts does not work. In the interactive mode we get the following extended output:
aprun -n 1 advixe-cl --collect tripcounts --no-auto-finalize -project-dir ./advisor -search-dir all:r=./ -- ./hw_advisor_test Intel(R) Advisor Command Line Tool Copyright (C) 2009-2016 Intel Corporation. All rights reserved. advixe: Error: Internal error. Please contact Intel customer support team. advixe: Error: Analysis terminated abnormally. advixe: Error: An internal error has occurred. Our apologies for this inconvenience. Please gather a description of the steps leading up to the problem and contact the Intel customer support team. advixe: Warning: The application returned a non-zero exit value. Application 5827006 resources: utime ~0s, stime ~1s, Rss ~52160, inblocks ~11650, outblocks ~3178
Iused the following environment:
PrgEnv-intel/5.2.82 (ifort version 16.0.3)
Intel Advisor 17.1 (build 477503)
Anybody some idea?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
We also tested the following version without success:
Update 1 (build 486553)

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page