Software Archive
Read-only legacy content
17061 Discussions

Cannot allocate memory using provider ofa-v2-scif0-u

William_Howell
Beginner
982 Views

Thank you in advance for any help with this.

Description of Problem

After recently upgrading our system, we are unable to run code when using multiple mics on the same node or host cpu + mics on the same node while using the ofa-v2-scif0-u DAPL provider. A representative example output with I_MPI_DEBUG set to 5 for the host + 1 mic is:

=========== Host + One MIC ==============
[0] MPI startup(): Multi-threaded optimized library
[0] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-scif0-u
[1] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-scif0-u
[1] MPI startup(): DAPL provider ofa-v2-scif0-u
[0] MPI startup(): DAPL provider ofa-v2-scif0-u
[2] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-scif0-u
[3] DAPL startup(): trying to open DAPL provider from I_MPI_DAPL_PROVIDER: ofa-v2-scif0-u
beacon034-mic0:UCM:1eb1a:1d929b40: 39984 us(39984 us): UCM: CM service: ERR ibv_mr sbuf (Cannot allocate memory)
beacon034-mic0:UCM:1eb1a:1d929b40: 40116 us(132 us):  ucm_create_services: ERR Cannot allocate memory
beacon034-mic0:UCM:1eb1b:ac164b40: 37310 us(37310 us): UCM: CM service: ERR ibv_mr sbuf (Cannot allocate memory)
beacon034-mic0:UCM:1eb1b:ac164b40: 37441 us(131 us):  ucm_create_services: ERR Cannot allocate memory

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 125722 RUNNING AT beacon034-mic0
=   EXIT CODE: 11
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)

Additional Details

  • The same code, which is hello work examples, runs fine on our old system configuration which had an older kernel and MPSS stack version
  • The code runs fine on just the host side or just on a single mic
  • Other providers, ofa-v2-mlx4_0-1u and ofa-v2-mcm-1 execute as expected
  • I'm more then happy to try other kernels or ofed versions as necessary, but would like to keep the MPSS 3.6.1 stack version shown below
  • MPSS modules and OFED being was built successfully against minor kernel

Host System Information

OS: CentOS 6.6

Kernel: 2.6.32-504.30.3.el6.x86_64

OFED: OFED-3.18-1-rc1

MPSS: 3.6.1

MPI: Intel 5.1.2.150

Compilers: Intel 2016.1.056

0 Kudos
4 Replies
Loc_N_Intel
Employee
982 Views

Hi William,

Let's me take a look on this issue and get back to you. Thank you.

0 Kudos
Loc_N_Intel
Employee
982 Views

Hi William,

I tried  I_MPI_DAPL_PROVIDER=ofa-v2-scif0-u and ran the hello program successfully on my system with a MIC card. My system is running on RHEL 6.6, I installed MPSS 3.6.1 and OFED-3.18-1.

What is your locked memory size on your system? (type "ulimit -a")

0 Kudos
William_Howell
Beginner
982 Views

Thank you for your assistance. I've posted the ulimit output for both host and mics below. It does appear the max locked memory is set to 64kb on the cards. We currently don't have pam enabled on the mics' OS, and I cannot find a way to set the max locked memory for user sessions on the card side (without enabling pam). Is there a way to set the max locked memory on the cards without enabling pam so I can test if this is the root of the problem?

 

On the host side for locked memory we have

core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 2066043
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 131072
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) 2066043
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

 

On the co-processor we have:

core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 61236
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 61236
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

 

0 Kudos
William_Howell
Beginner
982 Views

After enabling PAM on the mics, we now don't have the locked memory issue. However, we are still experiencing problems now with the scif providers. Trying to use the ofa-v2-scif0 provider fails. If I run a dtest between the host and mics we encounter the error below. There is no issue running the same test between 2 mics on the same host.

 

[root@host ~]# dtest -P ofa-v2-scif0-u -t
16253 Running as server - ofa-v2-scif0-u v2 
16253 Local Address AF_INET6 - 4c79:ba29:385:: flowinfo(QPN)=0x1, port(LID)=0x3e8
16253 Server is waiting for client connection to send server info
16253 Server waiting for connect request on port b0c0
beacon047:UCM:3f7d:30952a00: 3236489 us(3236489 us!!!):  modify_qp_state: ERR type 2 qpn 0x6 gid 0x2b0d68000a80 (1) lid 0x3e9 port 1 state 1 mtu 4 rd 4 rnr 12 sl 0
host:UCM:3f7d:30952a00: 3236513 us(24 us):  DAPL ERR modify_qp_state Network is unreachable
host:UCM:3f7d:30952a00: 3236518 us(5 us):  ACCEPT_USR: QPS_RTR ERR Network is unreachable -> lid 3e9 qpn 6
16253 Error dat_cr_accept: DAT_INTERNAL_ERROR 
16253 Error connect_ep: DAT_INTERNAL_ERROR 
0 Kudos
Reply