Intel® Gaudi® AI Accelerator
Support for the Intel® Gaudi® AI Accelerator
19 Discussions

ERROR patching node labels, Invalid value

egivandor82
Novice
10,334 Views

Hi,

 

this is my board name:

cat /sys/devices/virtual/dmi/id/product_name
Standard PC (i440FX + PIIX, 1996)

and I can see the following error in my log:

ERROR patching node labels: Node "xx.xx.xx.xx" is invalid: metadata.labels: Invalid value: "Standard_PC_(i440FX_+_PIIX__1996": a valid label must be an empty string or consist of alphanumeric characters, '-', '_' or '.', and must start and end with an alphanumeric character (e.g. 'MyValue', or 'my_value', or '12345', regex used for validation is '(([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9])?')`

Node Feature Discovery collects this information:

dmiIDAttributeNames := []string{"sys_vendor", "product_name"}

Does anybody have experience with this if it is a habana operator issue or an NFD issue or is it safe to ignore or suppress it?

 

Thank you

Labels (2)
0 Kudos
1 Solution
12 Replies
James_Edwards
Employee
10,249 Views

An "ERROR patching node labels" message indicates that your attempt to modify labels on a Kubernetes node has failed, likely due to issues with the label data you provided, access permissions, or a problem with the node itself.

Could you provide the following information:

1) What log does this error message appear in?

2) Did this occur when you were deploying a Habana operator in your cluster? 

 

 

0 Kudos
egivandor82
Novice
10,237 Views

Hi,

 

thank you for your response.

1) this message is in pod/habana-ai-feature-discovery-ds

2) I'm using

helm repo add gaudi-helm https://vault.habana.ai/artifactory/api/helm/gaudi-helm
helm install habana-ai-operator gaudi-helm/habana-ai-operator --version 1.19.1-26 -n habana-ai-operator

for installation.

The issue is the following:

cat /sys/devices/virtual/dmi/id/product_name
Standard PC (i440FX + PIIX, 1996)

product_name contains invalid characters.

Node feature discovery's system/system.go (https://github.com/kubernetes-sigs/node-feature-discovery) at line 106 find's this value

dmiIDAttributeNames := []string{"sys_vendor", "product_name"}

and tries to set as node label I think.

My question is if I can ignore this error? Is it an NFD issue or a habana operator issue?

 

Thank you,

 

 

0 Kudos
James_Edwards
Employee
10,208 Views

The feature discovery pods label each Kubernetes pod with information about the node it is running on. These labels contain information about the Gaudi device availability and the driver version. These labels help deploy other DaemonSets to the appropriate nodes, so this could affect deployment. However,  a Standard PC isn't going to be supporting Gaudi hardware so it is doubtful that this will cause any harm in your cluster, even though the name was parsed incorrectly.. I think this is ok to suppress, as no Gaudi devices are being detected and Gaudi drivers will not be installed. I will open a ticket with the R&D team to look into the parsing issue, however.

0 Kudos
egivandor82
Novice
10,084 Views

Thank you for you help.

After running the hfd binary from habana-ai-operator/habanalabs-feature-discovery image it prints the same value:

habana.ai/product.name=Standard_PC_(i440FX_+_PIIX__1996

This is not a standard PC it seems like a QEMU machine, this is the product name on a node if you are using IBM cloud.

And in case of nodes with and without Gaudi hardware we will see this error message on machines which are part of the cluster but don't contain Gaudi hardware. That's why it would be important to fix this issue.

Thank you!

0 Kudos
pallavijaini
Employee
9,588 Views

Issue we are seeing here is coming from the Node-feature-discovery which internally calling the K8s apimachinery for validating the labels values.

Node feature discovery code - https://github.com/kubernetes-sigs/node-feature-discovery/blob/master/pkg/apis/nfd/validate/validate.go#L116

api machinery code - https://github.com/kubernetes/apimachinery/blob/v0.32.2/pkg/util/validation/validation.go#L171

 

having the above error with product name causing any issues with visibility/using the habana devices ?

0 Kudos
egivandor82
Novice
9,566 Views

In our case I think it, it does not cause any error because the product name of the node which has gaudi hardware is different. There are only growing number of error lines in our logs of workers which does not have gaudi hardware but has the product name containing "invalid" characters (parentheses).

Unfortunately I was not able to reproduce the issue with node feature discovery, just with habanalabs-feature-discovery.

0 Kudos
pallavijaini
Employee
9,425 Views

Yes issue is coming from habanalabs-feature-dicovery.

We need to add more validation to ensure the machine type matches the required format '(([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9])?')` for NFD labels and update the name accordingly.

This will require modifying the existing code, as seen in this reference https://github.com/HabanaAI/habanalabs-feature-discovery/blob/master/main.go#L151

0 Kudos
egivandor82
Novice
9,399 Views

Thank you.

Is there any workaround until the fix?

For node feature discovery there is /etc/kubernetes/node-feature-discovery/nfd-worker.conf

For habana-container-runtime there is a configmap.

Is there a setting somewhere to skip labeling product_name?

0 Kudos
James_Edwards
Employee
9,178 Views

I do not think Development will provide or document a work around as they are focused on a fix. I will try and get a fix version for the solution.

0 Kudos
egivandor82
Novice
9,106 Views
0 Kudos
James_Edwards
Employee
9,098 Views

Thank you for your patience.

0 Kudos
Reply