Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2272 Discussões

hpl_pairwise failure using clck

Frank_Fu_
Principiante
2.498 Visualizações

Hi,

 

I have 8 nodes, and they are in two groups.

I firstly run health_user test on the 8 nodes, then no issues are detected.

I assume the 8 nodes are identical in the environment.

 

Then I run hpl_cluster_perfomance on each group.

The 1st group reports no issues, while the 2nd group reports "hpl-pairwise-failed".

 

I have attached the database and logs of health_user, hpl test on group 1 and group 2.

Could you give some insight what is the different configuration between the two groups?

How can I get away with the hpl-pairwise-failed?

0 Kudos
7 Respostas
AishwaryaCV_Intel
Moderador
2.423 Visualizações

Hi,


Thanks for providing your observations. we are working on your issue and will get back to you soon.


Thanks And Regards,

Aishwarya


AishwaryaCV_Intel
Moderador
2.356 Visualizações

Hi,

 

Apologies for the delay in my response.

 

Could you please try running following command line to get more information out of the db file:

clckdb -D hpl_cluster_performance.db --provider hpl_pairwise

 

 

And the issue seems to be numerical and not performance. For having more information, run the extended health framework:

 

$ clck -F health_extended_user

 

 

Are you using any virtual nodes?Could you also provide us with the OS and processor details?

NOTE: I would also recommend upgrading on latest oneapi version. Maybe MPI is failing because it uses a KNL definition. 

 

Thanks And Regards,

Aishwarya

 

AishwaryaCV_Intel
Moderador
2.305 Visualizações

Hi,  


We haven't heard back from you, Could you please provide us the requested details asked in my previous response?


Thank you and best regards, 

Aishwarya



Frank_Fu_
Principiante
2.261 Visualizações

Hi Aishwarya,

 

Sorry for the late reply.

I am running on RHEL 8.1 on vSphere 8b.

The processor is Intel(R) Xeon(R) Gold 6248R CPU

 

Here is my current version

$ clck -v
Intel(R) Cluster Checker 2021 Update 7 (build 20230112)

 

I have attached the log of the clckdb command and we can find the numerical errors there.

 

I have also run the extended health check and it also reports me the `hpl-pairwise-failed` error.

AishwaryaCV_Intel
Moderador
2.201 Visualizações

Hi,

 

Could you also please provide output of the following command line:

 

 

$ clck -F health_extended_user -f nodefile

 

 

NOTE: Could you please try it with latest Intel oneAPI version (2023.1.0).

 

Thanks And Regards,

Aishwarya

 

AishwaryaCV_Intel
Moderador
2.136 Visualizações

Hi,  


We haven't heard back from you, Could you please provide us the requested details asked in my previous response?


Thank you and best regards, 

Aishwarya


AishwaryaCV_Intel
Moderador
2.044 Visualizações

Hi,

 

We have not heard back from you. This thread will no longer be monitored by Intel. If you need further assistance, please post a new question.


Thanks And Regards,

Aishwarya


Responder