Solved: Scif_send error??

Vaios_B_ · ‎09-23-2014

Hello to you All, Strange message appeared when stoping the mpps service: (MPSS 3.3.1 in Centos 7) systemd: Stopping Intel(R) MPSS control service... Sep 23 10:21:18 localhost kernel: mic0: Transition from state online to shutdown Sep 23 10:21:18 localhost kernel: mic1: Transition from state online to shutdown Sep 23 10:21:29 localhost kernel: host: scif node 1 exiting Sep 23 10:21:29 localhost kernel: scif_send to node: 1 port: 1089 failed with error -104 Sep 23 10:21:29 localhost kernel: host: scif node 2 exiting Sep 23 10:21:29 localhost kernel: scif_send to node: 2 port: 1089 failed with error -104 Sep 23 10:21:50 localhost kernel: mic1: Transition from state shutdown to resetting Sep 23 10:21:50 localhost kernel: mic0: Transition from state shutdown to resetting Sep 23 10:21:52 localhost kernel: mic1: Resetting (Post Code 3C) Sep 23 10:21:52 localhost kernel: mic0: Resetting (Post Code 3C) Sep 23 10:21:53 localhost kernel: mic1: Resetting (Post Code 3d) Sep 23 10:21:53 localhost kernel: mic0: Resetting (Post Code 3d) Sep 23 10:21:54 localhost kernel: mic1: Resetting (Post Code 3d) Sep 23 10:21:54 localhost kernel: mic0: Resetting (Post Code 3d) Sep 23 10:21:55 localhost kernel: mic1: Resetting (Post Code 3d) Sep 23 10:21:55 localhost kernel: mic0: Resetting (Post Code 3d) Sep 23 10:21:56 localhost kernel: mic1: Resetting (Post Code 3d) Sep 23 10:21:56 localhost kernel: mic0: Resetting (Post Code 3d) Sep 23 10:21:57 localhost kernel: mic1: Resetting (Post Code 3E) Sep 23 10:21:57 localhost kernel: mic0: Resetting (Post Code 3E) Sep 23 10:21:58 localhost kernel: mic1: Resetting (Post Code 3E) Sep 23 10:21:58 localhost kernel: mic0: Resetting (Post Code 3E) Sep 23 10:21:59 localhost kernel: mic1: Resetting (Post Code 3E) Sep 23 10:21:59 localhost kernel: mic0: Resetting (Post Code 3E) Sep 23 10:22:00 localhost kernel: mic1: Resetting (Post Code 09) Sep 23 10:22:00 localhost kernel: mic0: Resetting (Post Code 09) Sep 23 10:22:01 localhost kernel: mic1: Resetting (Post Code 09) Sep 23 10:22:01 localhost kernel: mic0: Resetting (Post Code 09) Sep 23 10:22:02 localhost kernel: mic1: Resetting (Post Code 12) Sep 23 10:22:02 localhost kernel: mic1: Transition from state resetting to ready Sep 23 10:22:02 localhost kernel: mic0: Resetting (Post Code 12) Sep 23 10:22:02 localhost kernel: mic0: Transition from state resetting to ready Sep 23 10:22:08 localhost mpss: Shutting down Intel(R) MPSS: [ OK ] Sep 23 10:22:08 localhost systemd: Stopped Intel(R) MPSS control service. Looking for the error code, i still have not find anything that matches. Have you seen this before? Thank you so much in advance.

Loc_N_Intel · ‎10-01-2014

Hello,

After looking at the source code and talking with the expert, here is the root cause of the issue:

When users stop MPSS (e.g., service mpss stop) or shutdown a coprocessor (e.g., micctrl -S mic0), besides the driver
shutdown the coprocessor, it triggers the host Power Management to send a SCIF message to the coprocessor independently in order to close the power management service. Because the coprocessor is already down, the SCIF message cannot be sent. The error message reflects the fact that the SCIF message is not sent successfully.

However, since the coprocessor already closed the PM service, this error message is not harmful and can be ignored.

I will generate an internal ticket to handle this case. Thank you for reporting this error message.

View solution in original post

Loc_N_Intel · ‎09-26-2014

Hi Vaios,

I see this skip_send error on my system too. Let me investigate this issue and get back to you. Thank you.

Frances_R_Intel · ‎09-26-2014

Vaios

Do you see this error every time?

Do you have power management turned off? (You can check by looking in the /etc/mpss/mic*.conf files.) If it is turned off, does turning it on have any effect? (micctrl --pm=set) I don't know that this will have any effect but it might be worth a try.

Vaios_B_ · ‎09-29-2014

Thank you both,

Frances, i have already made that check but.....nada....

The msg is repeated every time i stop the service (Mpss).

I will try to run a debugger when stoping the service to see some more detailed info.

Can this, by any chance, be considered a Normal Connection Termination message?

Just to be more specific. The error is being displayed when from online to shutdown state. In all other cases we have normal stop/start messages.

Thank you for your efforts

BR

Vaios

Frances_R_Intel · ‎09-29-2014

Loc is looking for the definitive answer, but to your questions "Can this, by any chance, be considered a Normal Connection Termination message?", I would say, yes, probably. You get this message if the host tries to send a message over the virtual interface, after the operating system on the coprocessor has shut it down. (Errno 104 is connection reset by peer.) This message seems to be ubiquitous and the developers seem to consider it normal. But I don't know why the host is trying to talk to the coprocessor at this point. Is it just checking to make sure the OS on the coprocessor has gone away? Maybe. In any event, you have gotten my curiosity up and I personally will be interested to see what Loc digs up.

Vaios_B_ · ‎09-30-2014

The online to shutdown state differs, the truth is, from the online to reset. But, wouldn't it be safe to assume that the 104 reset should be appeared in that state too?

Thank you

BR

Vaios

Frances_R_Intel · ‎09-30-2014

Actually, no. In a reset, the mpss on the host shuts down the scif without waiting for the coprocessor to take any action, then sends a low level reset message to the coprocessor, effectively destroying the scif on the coprocessor. The scif on the host doesn't come back up until you reboot the coprocessor, so the first remote scif connection it sees is the new coprocessor scif - hence no reset by peer.

Think of it in terms of what would happen if you were to shut down an ethernet interface. If you shut down an ethernet interface on a system, you don't get "reset by peer" messages on that system; it is resetting the ethernet interface on the remote system that causes the message.

The thing that confuses me is that with the shutdown option, the host side scif gets a "reset by peer" rather than unreachable or timeout. The implication is that after the coprocessor shuts down, the coprocessor scif comes back up at least to some extent.

Loc_N_Intel · ‎10-01-2014

Hello,

After looking at the source code and talking with the expert, here is the root cause of the issue:

When users stop MPSS (e.g., service mpss stop) or shutdown a coprocessor (e.g., micctrl -S mic0), besides the driver
shutdown the coprocessor, it triggers the host Power Management to send a SCIF message to the coprocessor independently in order to close the power management service. Because the coprocessor is already down, the SCIF message cannot be sent. The error message reflects the fact that the SCIF message is not sent successfully.

However, since the coprocessor already closed the PM service, this error message is not harmful and can be ignored.

I will generate an internal ticket to handle this case. Thank you for reporting this error message.

Vaios_B_ · ‎10-01-2014

Thank you so much for your actions and detailed answer.

Best Regards

Vaios