- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you in advance for any help with this.
Description of Problem
After recently upgrading our system, we are unable to run code when using multiple mics on the same node or host cpu + mics on the same node while using the ofa-v2-scif0-u DAPL provider. A representative example output with I_MPI_DEBUG set to 5 for the host + 1 mic is:
=========== Host + One MIC ============== [0] MPI startup(): Multi-threaded optimized library [0] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-scif0-u [1] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-scif0-u [1] MPI startup(): DAPL provider ofa-v2-scif0-u [0] MPI startup(): DAPL provider ofa-v2-scif0-u [2] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-scif0-u [3] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-scif0-u beacon034-mic0:UCM:1eb1a:1d929b40: 39984 us(39984 us): UCM: CM service: ERR ibv_mr sbuf (Cannot allocate memory) beacon034-mic0:UCM:1eb1a:1d929b40: 40116 us(132 us): ucm_create_services: ERR Cannot allocate memory beacon034-mic0:UCM:1eb1b:ac164b40: 37310 us(37310 us): UCM: CM service: ERR ibv_mr sbuf (Cannot allocate memory) beacon034-mic0:UCM:1eb1b:ac164b40: 37441 us(131 us): ucm_create_services: ERR Cannot allocate memory =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 125722 RUNNING AT beacon034-mic0 = EXIT CODE: 11 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)
Additional Details
- The same code, which is hello work examples, runs fine on our old system configuration which had an older kernel and MPSS stack version
- The code runs fine on just the host side or just on a single mic
- Other providers, ofa-v2-mlx4_0-1u and ofa-v2-mcm-1 execute as expected
- I'm more then happy to try other kernels or ofed versions as necessary, but would like to keep the MPSS 3.6.1 stack version shown below
- MPSS modules and OFED being was built successfully against minor kernel
Host System Information
OS: CentOS 6.6
Kernel: 2.6.32-504.30.3.el6.x86_64
OFED: OFED-3.18-1-rc1
MPSS: 3.6.1
MPI: Intel 5.1.2.150
Compilers: Intel 2016.1.056
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi William,
I tried I_MPI_DAPL_PROVIDER=ofa-v2-scif0-u and ran the hello program successfully on my system with a MIC card. My system is running on RHEL 6.6, I installed MPSS 3.6.1 and OFED-3.18-1.
What is your locked memory size on your system? (type "ulimit -a")
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you for your assistance. I've posted the ulimit output for both host and mics below. It does appear the max locked memory is set to 64kb on the cards. We currently don't have pam enabled on the mics' OS, and I cannot find a way to set the max locked memory for user sessions on the card side (without enabling pam). Is there a way to set the max locked memory on the cards without enabling pam so I can test if this is the root of the problem?
On the host side for locked memory we have
core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 2066043 max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) unlimited open files (-n) 131072 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) unlimited cpu time (seconds, -t) unlimited max user processes (-u) 2066043 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited
On the co-processor we have:
core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 61236 max locked memory (kbytes, -l) 64 max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 61236 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
After enabling PAM on the mics, we now don't have the locked memory issue. However, we are still experiencing problems now with the scif providers. Trying to use the ofa-v2-scif0 provider fails. If I run a dtest between the host and mics we encounter the error below. There is no issue running the same test between 2 mics on the same host.
[root@host ~]# dtest -P ofa-v2-scif0-u -t 16253 Running as server - ofa-v2-scif0-u v2 16253 Local Address AF_INET6 - 4c79:ba29:385:: flowinfo(QPN)=0x1, port(LID)=0x3e8 16253 Server is waiting for client connection to send server info 16253 Server waiting for connect request on port b0c0 beacon047:UCM:3f7d:30952a00: 3236489 us(3236489 us!!!): modify_qp_state: ERR type 2 qpn 0x6 gid 0x2b0d68000a80 (1) lid 0x3e9 port 1 state 1 mtu 4 rd 4 rnr 12 sl 0 host:UCM:3f7d:30952a00: 3236513 us(24 us): DAPL ERR modify_qp_state Network is unreachable host:UCM:3f7d:30952a00: 3236518 us(5 us): ACCEPT_USR: QPS_RTR ERR Network is unreachable -> lid 3e9 qpn 6 16253 Error dat_cr_accept: DAT_INTERNAL_ERROR 16253 Error connect_ep: DAT_INTERNAL_ERROR
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page