Software Archive
Read-only legacy content
17061 Discussions

Xeon Phi crashes on too-large SCIF memory registration

Bryant_L_
New Contributor I
295 Views

Is there a mechanism with SCIF to register a memory region with all endpoints? At the moment, I have a for-loop with scif_register() on this memory region with each endpoint. Memory registration is rather expensive and I would like to avoid unnecessarily incurring this cost repeatedly if there is possibly a faster way to register with all endpoints.

With my current method, if the memory region is sufficiently large (e.g., 6 GB+), the coprocessor crashes during scif_register():

  1. Error occurs: "Connection to mic0 closed by remote host." and the ssh connection drops.
  2. Attempting further ssh connections fail
  3. `micctrl -s' still reports "online", but attempting `micctrl --reboot mic0' will stall with status "shutdown". Only power-cycling the host platform will restore operation.

System Info:

  • Xeon Phi 5110P; MPSS 3.5.1

EDIT 20150706-1517EST: 3.2 GB works. 3.8 GB and above will crash the device.

0 Kudos
1 Reply
Frances_R_Intel
Employee
295 Views

I am not sure exactly what you mean by "to register a memory region with all endpoints". Do you mean you have set up multiple endpoints on the coprocessor and are trying to register the same physical address multiple times, once into the individual registered address space for each endpoint? Perhaps if you explained more about what you were trying to do (what purpose you are setting this up for), someone on the forum might be able to propose a less costly alternative.

As far as why the coprocessor is going off into the weeds when you try to register 3.8 GB of memory - I believe you are, as you suspect, running out of available memory. Remember that the 5110P has only 8 GB of memory max; some of this will be used as RAM disk, unless you are NFS mounting the root directory and some will be used by the kernel. I don't know how big your program is, beyond that your are allocating at least 3.8 GB of memory at some point, but yes, running out of memory seems the most likely culprit. You might want to use micsmc to check the space being used as the program runs.

Remember that the coprocessor has no swap space. So if you are out of memory, the coprocessor has no place to swap anything so that it can get space to run shutdown. Instead of using the reboot option, try using reset, then boot. Reset doesn't do a shutdown; it is concerned only with getting the coprocessor back to a state where it is listening for a boot request.

0 Kudos
Reply