Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
302 Views

How do you read package temperature on NVME device inside of SPDK?

Jump to solution

We are evaluating using SPDK as an internal framework to build a data recorder with NVMe devices.

Disk and SSD devices have had smartctl interfaces which give you package temperatures for a while. It looks like smartctl is now smart enough to do NVMe devices as well. However, once SPDK is "setup", the kernel module that supports smartctl is gone and no longer functions on those devices.

I'm finding references to "temperature" thresholds in the spec, but I am not finding a "read the current device package temperature".

The SPDK under Linux looks like a nice performance package, but if it blocks getting basic health information on the underlying hardware, then it's a non-starter.

 

0 Kudos

Accepted Solutions
Highlighted
Moderator
45 Views

Hello, jsaar5.

 

Thank you for contacting Intel Community Support.

 

I have checked your ticket regarding SPDK and how to get the package temperature.

 

We can provide you with direct support when you are working with Intel Tools, like the Intel SSD Toolbox or Intel Data Center Tool; one of the best options right now would be to direct your question to the support/community forum of the developer since your question is directly related to SPDK.

 

If you are directed back yo us for any particular reason, just let me know.

 

I hope to hear from you soon.

 

Best regards,

 

Bruce C.

Intel Customer Support Technician

A Contingent Worker at Intel

View solution in original post

0 Kudos
8 Replies
Highlighted
Moderator
46 Views

Hello, jsaar5.

 

Thank you for contacting Intel Community Support.

 

I have checked your ticket regarding SPDK and how to get the package temperature.

 

We can provide you with direct support when you are working with Intel Tools, like the Intel SSD Toolbox or Intel Data Center Tool; one of the best options right now would be to direct your question to the support/community forum of the developer since your question is directly related to SPDK.

 

If you are directed back yo us for any particular reason, just let me know.

 

I hope to hear from you soon.

 

Best regards,

 

Bruce C.

Intel Customer Support Technician

A Contingent Worker at Intel

View solution in original post

0 Kudos
Highlighted
45 Views

I have discovered the "identify" example which extracts the "health" information for the NVMe device. This is the SPDK/DPDK equivalent to the startctl accesses. The information on the device, including package temperature, it accessible with this facility.

 

However, the examples as delivered will not allow an exerciser like "perf" to be run concurrently with "identify". The NVMe devices look to be allocated and locked by one process or the other.

 

The docs say that the NVMe interface should be "shareable" - that is, two processes can access a single NVMe module, send commands and get responses in a sensible way. I'm now puzzling over how to modify the "perf" and "identify" examples to get this to happen (or if it's possible at all). The docs on this locking/sharing process are a bit thin.

 

The next place to try is to include a "health" query in-line with the "perf" operation to query the "health" of the device periodically. The results of the "health" query would be put in a shared memory object to be queried asynchronously by a "health monitor" process.

0 Kudos
Highlighted
45 Views

And this morning I found that the "NVMe Multi Process" feature is included in both "identify" and "perf", and they can be used concurrently if the feature is activated. It is buried under the option title "shared memory group ID", and I am not clear on what exactly it is doing [yet]. https://spdk.io/doc/nvme.html

 

$ while sleep 5 ; do date ; ./identify -i 1 |& egrep '^NVMe Control|Current Temperature' | while read a ; do read b ; echo $a $b ; done ; done

Mon Nov 18 09:54:55 PST 2019

NVMe Controller at 0000:53:00.0 [144d:a808] Current Temperature: 336 Kelvin (63 Celsius)

NVMe Controller at 0000:54:00.0 [144d:a808] Current Temperature: 337 Kelvin (64 Celsius)

NVMe Controller at 0000:55:00.0 [144d:a808] Current Temperature: 338 Kelvin (65 Celsius)

NVMe Controller at 0000:56:00.0 [144d:a808] Current Temperature: 334 Kelvin (61 Celsius)

Mon Nov 18 09:55:00 PST 2019

NVMe Controller at 0000:53:00.0 [144d:a808] Current Temperature: 336 Kelvin (63 Celsius)

NVMe Controller at 0000:54:00.0 [144d:a808] Current Temperature: 337 Kelvin (64 Celsius)

NVMe Controller at 0000:55:00.0 [144d:a808] Current Temperature: 338 Kelvin (65 Celsius)

NVMe Controller at 0000:56:00.0 [144d:a808] Current Temperature: 334 Kelvin (61 Celsius)

 

Start up “perf” in another window:

 

$ nohup ./perf -i 1 -q 128 -o 4096 -w randwrite -t 3600 -c 0x000f -D -LL -r 'trtype:PCIe traddr:0000.53.00.0' -r 'trtype:PCIe traddr:0000.54.00.0' -r 'trtype:PCIe traddr:0000.55.00.0' -r 'trtype:PCIe traddr:0000.56.00.0'

 

Now watch the package temperature rise:

 

Mon Nov 18 09:55:05 PST 2019

NVMe Controller at 0000:53:00.0 [144d:a808] Current Temperature: 339 Kelvin (66 Celsius)

NVMe Controller at 0000:54:00.0 [144d:a808] Current Temperature: 340 Kelvin (67 Celsius)

NVMe Controller at 0000:55:00.0 [144d:a808] Current Temperature: 341 Kelvin (68 Celsius)

NVMe Controller at 0000:56:00.0 [144d:a808] Current Temperature: 337 Kelvin (64 Celsius)

Mon Nov 18 09:55:11 PST 2019

NVMe Controller at 0000:53:00.0 [144d:a808] Current Temperature: 342 Kelvin (69 Celsius)

NVMe Controller at 0000:54:00.0 [144d:a808] Current Temperature: 343 Kelvin (70 Celsius)

NVMe Controller at 0000:55:00.0 [144d:a808] Current Temperature: 344 Kelvin (71 Celsius)

NVMe Controller at 0000:56:00.0 [144d:a808] Current Temperature: 340 Kelvin (67 Celsius)

Mon Nov 18 09:55:16 PST 2019

NVMe Controller at 0000:53:00.0 [144d:a808] Current Temperature: 345 Kelvin (72 Celsius)

NVMe Controller at 0000:54:00.0 [144d:a808] Current Temperature: 346 Kelvin (73 Celsius)

NVMe Controller at 0000:55:00.0 [144d:a808] Current Temperature: 346 Kelvin (73 Celsius)

NVMe Controller at 0000:56:00.0 [144d:a808] Current Temperature: 342 Kelvin (69 Celsius)

Mon Nov 18 09:55:21 PST 2019

NVMe Controller at 0000:53:00.0 [144d:a808] Current Temperature: 348 Kelvin (75 Celsius)

NVMe Controller at 0000:54:00.0 [144d:a808] Current Temperature: 349 Kelvin (76 Celsius)

NVMe Controller at 0000:55:00.0 [144d:a808] Current Temperature: 349 Kelvin (76 Celsius)

NVMe Controller at 0000:56:00.0 [144d:a808] Current Temperature: 345 Kelvin (72 Celsius)

Mon Nov 18 09:55:26 PST 2019

NVMe Controller at 0000:53:00.0 [144d:a808] Current Temperature: 350 Kelvin (77 Celsius)

NVMe Controller at 0000:54:00.0 [144d:a808] Current Temperature: 351 Kelvin (78 Celsius)

NVMe Controller at 0000:55:00.0 [144d:a808] Current Temperature: 352 Kelvin (79 Celsius)

NVMe Controller at 0000:56:00.0 [144d:a808] Current Temperature: 347 Kelvin (74 Celsius)

Mon Nov 18 09:55:32 PST 2019

NVMe Controller at 0000:53:00.0 [144d:a808] Current Temperature: 353 Kelvin (80 Celsius)

NVMe Controller at 0000:54:00.0 [144d:a808] Current Temperature: 354 Kelvin (81 Celsius)

NVMe Controller at 0000:55:00.0 [144d:a808] Current Temperature: 354 Kelvin (81 Celsius)

NVMe Controller at 0000:56:00.0 [144d:a808] Current Temperature: 349 Kelvin (76 Celsius)

Mon Nov 18 09:55:37 PST 2019

NVMe Controller at 0000:53:00.0 [144d:a808] Current Temperature: 355 Kelvin (82 Celsius)

NVMe Controller at 0000:54:00.0 [144d:a808] Current Temperature: 356 Kelvin (83 Celsius)

NVMe Controller at 0000:55:00.0 [144d:a808] Current Temperature: 356 Kelvin (83 Celsius)

NVMe Controller at 0000:56:00.0 [144d:a808] Current Temperature: 351 Kelvin (78 Celsius)

Mon Nov 18 09:55:42 PST 2019

NVMe Controller at 0000:53:00.0 [144d:a808] Current Temperature: 356 Kelvin (83 Celsius)

NVMe Controller at 0000:54:00.0 [144d:a808] Current Temperature: 356 Kelvin (83 Celsius)

NVMe Controller at 0000:55:00.0 [144d:a808] Current Temperature: 356 Kelvin (83 Celsius)

NVMe Controller at 0000:56:00.0 [144d:a808] Current Temperature: 352 Kelvin (79 Celsius)

Mon Nov 18 09:55:47 PST 2019

NVMe Controller at 0000:53:00.0 [144d:a808] Current Temperature: 356 Kelvin (83 Celsius)

NVMe Controller at 0000:54:00.0 [144d:a808] Current Temperature: 356 Kelvin (83 Celsius)

NVMe Controller at 0000:55:00.0 [144d:a808] Current Temperature: 356 Kelvin (83 Celsius)

NVMe Controller at 0000:56:00.0 [144d:a808] Current Temperature: 354 Kelvin (81 Celsius)

Mon Nov 18 09:55:53 PST 2019

NVMe Controller at 0000:53:00.0 [144d:a808] Current Temperature: 356 Kelvin (83 Celsius)

NVMe Controller at 0000:54:00.0 [144d:a808] Current Temperature: 356 Kelvin (83 Celsius)

NVMe Controller at 0000:55:00.0 [144d:a808] Current Temperature: 356 Kelvin (83 Celsius)

NVMe Controller at 0000:56:00.0 [144d:a808] Current Temperature: 356 Kelvin (83 Celsius)

Mon Nov 18 09:55:58 PST 2019

NVMe Controller at 0000:53:00.0 [144d:a808] Current Temperature: 356 Kelvin (83 Celsius)

NVMe Controller at 0000:54:00.0 [144d:a808] Current Temperature: 356 Kelvin (83 Celsius)

NVMe Controller at 0000:55:00.0 [144d:a808] Current Temperature: 356 Kelvin (83 Celsius)

NVMe Controller at 0000:56:00.0 [144d:a808] Current Temperature: 356 Kelvin (83 Celsius)

Mon Nov 18 09:56:03 PST 2019

NVMe Controller at 0000:53:00.0 [144d:a808] Current Temperature: 356 Kelvin (83 Celsius)

NVMe Controller at 0000:54:00.0 [144d:a808] Current Temperature: 356 Kelvin (83 Celsius)

NVMe Controller at 0000:55:00.0 [144d:a808] Current Temperature: 356 Kelvin (83 Celsius)

NVMe Controller at 0000:56:00.0 [144d:a808] Current Temperature: 356 Kelvin (83 Celsius)

0 Kudos
Highlighted
45 Views

When "identify" operates on each device, "perf' dumps messages to its log:

 

EAL: Driver cannot detach the device (0000:53:00.0)

EAL: Failed to hotplug remove device on primary

EAL: Driver cannot detach the device (0000:54:00.0)

EAL: Failed to hotplug remove device on primary

EAL: Driver cannot detach the device (0000:55:00.0)

EAL: Failed to hotplug remove device on primary

EAL: Driver cannot detach the device (0000:56:00.0)

EAL: Failed to hotplug remove device on primary

0 Kudos
Highlighted
45 Views

nvme/hello_world can be extended to take advantage of this feature by populating spdk_env_opts.shm_id with the same shared ID N put in "perf -i N ..."

 

Note here I've also extended hello_world to populate the pci_blacklist to avoid a troublesome device, and I used "-i N" on hello_world for this feature just to be confusing ("perf" from above is still running):

 

$ ./hello_world -B 1b:00.0

Starting SPDK v20.01-pre git sha1 dc6d89b / DPDK 19.08.0 initialization...

[ DPDK EAL parameters: hello_world --no-shconf -c 0x1 --pci-blacklist=0000:1b:00.0 --log-level=lib.eal:6 --log-level=lib.cryptodev:5 --log-level=user1:6 --base-virtaddr=0x200000000000 --match-allocations --file-prefix=spdk_pid90532 ]

Initializing NVMe Controllers

Attaching to 0000:53:00.0

Cannot create lock on device /tmp/spdk_pci_lock_0000:53:00.0, probably process 89678 has claimed it

nvme_pcie.c: 814:nvme_pcie_ctrlr_construct: *ERROR*: could not claim device 0000:53:00.0 (Permission denied)

nvme.c: 428:nvme_ctrlr_probe: *ERROR*: Failed to construct NVMe controller for SSD: 0000:53:00.0

EAL: Requested device 0000:53:00.0 cannot be used

Attaching to 0000:54:00.0

Cannot create lock on device /tmp/spdk_pci_lock_0000:54:00.0, probably process 89678 has claimed it

nvme_pcie.c: 814:nvme_pcie_ctrlr_construct: *ERROR*: could not claim device 0000:54:00.0 (Permission denied)

nvme.c: 428:nvme_ctrlr_probe: *ERROR*: Failed to construct NVMe controller for SSD: 0000:54:00.0

EAL: Requested device 0000:54:00.0 cannot be used

Attaching to 0000:55:00.0

Cannot create lock on device /tmp/spdk_pci_lock_0000:55:00.0, probably process 89678 has claimed it

nvme_pcie.c: 814:nvme_pcie_ctrlr_construct: *ERROR*: could not claim device 0000:55:00.0 (Permission denied)

nvme.c: 428:nvme_ctrlr_probe: *ERROR*: Failed to construct NVMe controller for SSD: 0000:55:00.0

EAL: Requested device 0000:55:00.0 cannot be used

Attaching to 0000:56:00.0

Cannot create lock on device /tmp/spdk_pci_lock_0000:56:00.0, probably process 89678 has claimed it

nvme_pcie.c: 814:nvme_pcie_ctrlr_construct: *ERROR*: could not claim device 0000:56:00.0 (Permission denied)

nvme.c: 428:nvme_ctrlr_probe: *ERROR*: Failed to construct NVMe controller for SSD: 0000:56:00.0

EAL: Requested device 0000:56:00.0 cannot be used

no NVMe controllers found

 

 

$ ./hello_world -B 1b:00.0 -i 1

Starting SPDK v20.01-pre git sha1 dc6d89b / DPDK 19.08.0 initialization...

[ DPDK EAL parameters: hello_world -c 0x1 --pci-blacklist=0000:1b:00.0 --log-level=lib.eal:6 --log-level=lib.cryptodev:5 --log-level=user1:6 --base-virtaddr=0x200000000000 --match-allocations --file-prefix=spdk1 --proc-type=auto ]

Initializing NVMe Controllers

Attached to 0000:53:00.0

Using controller Samsung SSD 970 EVO (S4Z7NG0M708615T    ) with 1 namespaces.

 Namespace ID: 1 size: 2000GB

Attached to 0000:54:00.0

Using controller Samsung SSD 970 EVO (S4Z7NG0M708617Y    ) with 1 namespaces.

 Namespace ID: 1 size: 2000GB

Attached to 0000:55:00.0

Using controller Samsung SSD 970 EVO (S4Z7NG0M708610N    ) with 1 namespaces.

 Namespace ID: 1 size: 2000GB

Attached to 0000:56:00.0

Using controller Samsung SSD 970 EVO (S4Z7NG0M708614V    ) with 1 namespaces.

 Namespace ID: 1 size: 2000GB

Initialization complete.

INFO: using host memory buffer for IO

Hello world!

INFO: using host memory buffer for IO

Hello world!

INFO: using host memory buffer for IO

Hello world!

INFO: using host memory buffer for IO

Hello world!

EAL: Failed to hotplug remove device

EAL: Failed to hotplug remove device

EAL: Failed to hotplug remove device

EAL: Failed to hotplug remove device

0 Kudos
Highlighted
45 Views

These "EAL: Failed to hotplug remove device" are non-fatal artifacts in the SPDK (I am using v20.1.0).

 

https://github.com/spdk/spdk/issues/701

 

Those messages are somewhat expected in multi-process w/ DPDK 18.11.

 

SPDK calls rte_pci_detach() on shutdown in each process and for DPDK <18.11 this resulted in releasing PCI resources for that single process - meaning it had to be called in each process separately. With DPDK 18.11 rte_pci_detach() does an internal IPC and tries to detach the device in all the processes within a shared-memory group, meaning it has to be called just once. SPDK, as a stopgap, fails those "unexpected" detach requests that came through IPC. It results in those error messages being printed, but doesn't cause any issues. SPDK still requires some work on the device management, but since the only visible drawback of current implementation are those messages, this is not a priority for now.

0 Kudos
Highlighted
Moderator
45 Views

Hello, jsaar5.

 

Thank you for sharing your findings.

 

As previously mentioned, the best option is to share this with the community/developers of the tool you are currently using since they or other users may be able to provide more insight.

 

Let us know if we can proceed to close the ticket.

 

Best regards,

 

Bruce C.

Intel Customer Support Technician

A Contingent Worker at Intel

0 Kudos
Highlighted
Moderator
45 Views

Hello, jsaar5.

 

I wanted to follow up on your ticket before closing it, in case there is anything we can do for you.

 

Best regards,

 

Bruce C.

Intel Customer Support Technician

A Contingent Worker at Intel

0 Kudos