Intel® MPI Library
Get help with building, analyzing, optimizing, and scaling high-performance computing (HPC) applications.
2163 Discussions

missing-oneapi-libraries:DPL/MKL when running platformspec test by using CLCK(update7)

Frank_Fu
Beginner
1,410 Views

Hi,

 

I have installed the latest Intel toolkits (base, HPC, AI) on the Rocky Linux 8.7 clusters in our lab.
So far, I have passed three tests of clck without issues using clck (update 7).
The 3rd platformspec test currently failed with the error: missing-oneapi-libraries:DPL, MLK.

 

  1. I have verified by re-running the installation script: l_BaseKit_p_2023.0.0.25537.sh, l_HPCKit_p_2023.0.0.25400.sh, and checked that DPC++ and MLK have already been installed.

[x] Intel® DPC++ Compatibility Tool 2023.0.0 | 145 MB <already installed>

[x] Intel® oneAPI DPC++ Library 2022.0.0 | 4.4 MB <already installed>

[x] Intel® oneAPI Math Kernel Library 2023.0.0 | 7.1 GB <already installed>

[x] Intel® oneAPI DPC++/C++ Compiler & Intel® C++ Compiler Classic 2023.0.0 | 5.8 GB <already installed>  

But somehow, clck cannot detect them when running the 3rd test.

I also checked my current /etc/intel-hpc-platform-release file on the frontend node and all compute nodes. They are synced as the following content:

$ cat /etc/intel-hpc-platform-release

# Created by intel-hpc-platform-2.0-core

INTEL_HPC_PLATFORM_VERSION=core-2.0:core-intel-runtime-2.0:hpc-cluster-2.0:compat-hpc-cluster-2.0:high-performance-fabric-2.0:compat-hpc-2018.0:core-intel-runtime-2018.0:core-2018.0:core-intel-runtime-2018.0:high-performance-fabric-2018.0:hpc-cluster-2018.0

If I remember correctly, this file is related to the 3rd test.

Since we use cascade lake intel CPU, you mentioned before that we need to use the platform spec with the suffix 2018 in the above parameters.

 

 

2. A small issue(or bug?) in the 4th test (Sim_mod): although the HPL results have been listed in the output log, it still says that “hpl-cluster-failed” in the clck_execution_warnings.log

 

I also attached the database and log of platformspec and sim_mod test for your reference.

 

0 Kudos
7 Replies
AishwaryaCV_Intel
Moderator
1,371 Views

Hi , 


Thank You for posting in Intel communities. 


Could you please let us know about the tests you are referring to like 3rd test, 4th test? Also, please provide the exact path of the tests you are running if they are part of Intel MPI Library(HPC toolkit). If not, please provide us the URL of the tests you are running.

And could you please provide us the complete steps so that we can reproduce the issue at our end. 


Thanks And Regards,

Aishwarya


0 Kudos
Frank_Fu_
Beginner
1,322 Views

Hi Aishwarya,

 

1. Here is my step for the 3rd test.

```

# install all base toolkits

sh l_BaseKit_p_2023.0.0.25537.sh

# install all HPC toolkits

sh l_HPCKit_p_2023.0.0.25400.sh

# install all AI toolkits

sh l_AIKit_p_2023.1.0.31760.sh

 

# dnf install the following intel-packages (Added today)

dnf -y install intel-oneapi-clck intel-oneapi-tbb-2021.5.1.x86_64 intel-oneapi-openmp-2022.0.2.x86_64 intel-oneapi-mpi-2021.5.1.x86_64 intel-oneapi-mkl-2022.0.2.x86_64 intel-oneapi-mkl-devel-2022.0.2.x86_64 intel-oneapi-compiler-fortran-2022.0.2.x86_64 intel-oneapi-compiler-dpcpp-cpp-classic-fortran-shared-runtime-2022.0.2.x86_64 intel-oneapi-compiler-dpcpp-cpp-and-cpp-classic-2022.0.2.x86_64 intel-oneapi-compiler-dpcpp-cpp-2022.0.2.x86_64 intel-oneapi-clck intel-hpckit-runtime-2022.1.2 intel-oneapi-python

 

# sync the `/etc/intel-hpc-platform-release` file on frontend and all 4 compute nodes with the following contents

# Created by intel-hpc-platform-2.0-core

INTEL_HPC_PLATFORM_VERSION=core-2.0:core-intel-runtime-2.0:hpc-cluster-2.0:compat-hpc-cluster-2.0:high-performance-fabric-2.0:compat-hpc-2018.0:core-intel-runtime-2018.0:core-2018.0:core-intel-runtime-2018.0:high-performance-fabric-2018.0:hpc-cluster-2018.0

 

# Then I start my test

source /opt/intel/oneapi/setvars.sh

 

$ clck -f nodelist -F intel_hpc_platform_compat-hpc-cluster-2.0 --db platformspec.db -o platformspec.log

Intel(R) Cluster Checker 2021 Update 7 (build 20230112)

 

Running Collect

 

............................................................................................................................................

Running Analyze

 

SUMMARY

  Command-line:   clck -f nodelist -F intel_hpc_platform_compat-hpc-cluster-2.0 --db platformspec.db -o platformspec.log

  Tests Run:      intel_hpc_platform_compat-hpc-cluster-2.0

  Overall Result: 1 issue found - FUNCTIONALITY (1)

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

VALIDATION FAILED

  Intel HPC Platform Specification compat-hpc-cluster-2.0

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

4 nodes tested:         rocky-compute-[13-16]

0 nodes with no issues:

4 nodes with issues:    rocky-compute-[13-16]

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

FUNCTIONALITY

The following functionality issues were detected:

  1. The library, Intel(R) oneAPI Math Kernel Library, required by Intel HPC Platform Specification layer core-intel-runtime-2.0 is missing on the system.

       4 nodes: rocky-compute-[13-16]

 

HARDWARE UNIFORMITY

No issues detected.

 

PERFORMANCE

No issues detected.

 

SOFTWARE UNIFORMITY

No issues detected.

 

See the following files for more information: platformspec.log, clck_execution_warnings.log

```

I notice that the long dnf install command (added today) will help me remove the DPC++ missing libraries issue, but currently, the missing MLK issue still exists.

I attached the new update DB and log in to this post.

 

 

2. I test again on the health_extended_user today but found there is an hpl_pairwise failure.

After examining the DB content, I find 2 of total 12 "HPL-pairwise" tests failed on the checksum.

E.g., on rocky-compute-14 and rocky-compute-15

 

$ clckdb --db health_extended_user.db --provider hpl_pairwise | grep FAILED

||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   6.52451543e+06 ...... FAILED

||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   3.20671603e+07 ...... FAILED

 

Besides, I also notice in the log 

[0] MPI startup(): File "/opt/intel/oneapi/mpi/2021.8.0/etc/tuning_knl_shm-ofi_mlx_100.dat" not found

[0] MPI startup(): Load tuning file: "/opt/intel/oneapi/mpi/2021.8.0/etc/tuning_knl_shm-ofi.dat"

 

Although I am using cascade lake CPU, the turning file is still using KNL related tunning file?

Any idea how to resolve it?

 

0 Kudos
AishwaryaCV_Intel
Moderator
1,294 Views

Hi,


We are working on this internally, will get back to you.


Thanks And Regards,

Aishwarya


0 Kudos
AishwaryaCV_Intel
Moderator
1,240 Views

Hi,

 

Thanks for your patience & apologies for the delay in response. 

 

To begin using the cluster-checker tool, it is recommended that to start with short names because the tests are structured in a tree-like manner from generic to specialized. And the specific frameworks you are referring to “intel_hpc_platform_compat-hpc-cluster-2.0” does not apply on your cluster so please test with the generic names like "health_base".

 

Could you please follow the Getting Started Guide https://www.intel.com/content/www/us/en/docs/cluster-checker/user-guide/2021-7-2/getting-started.html#ID1?

 

You can try with using below command line:

$ clck -c <path/to/local/copy/of/clck.xml> -F health_base -f ./nodefile

 

>>>A small issue(or bug?) in the 4th test (Sim_mod): although the HPL results have been listed in the output log, it still says that “hpl-cluster-failed” in the clck_execution_warnings.log

The hpl benchmark test might have failed because the framework wanted to have another version of libstdc++.so

 

You can try for various separate frameworks mentioned below, which appears in performance to be good:

  • dgemm_cpu_performance
  • hpl_cluster_performance
  • imb_pingpong_fabric_performance (and other imb tests)
  • mpi_* frameworks
  • osu * frameworks
  • stream_memory_bandwidth_performance

For the right tuning parameters in Intel MPI you may set I_MPI_PLATFORM can refer here:

https://www.intel.com/content/www/us/en/docs/mpi-library/developer-reference-linux/2021-8/other-environment-variables.html

 

Please use the getting started guide and try running the test frameworks. let us know if you face any issues further.

 

Thanks And Regards,

Aishwarya


0 Kudos
AishwaryaCV_Intel
Moderator
1,212 Views

Hi, 

 

We haven't heard back from you, Could you please let us know whether your issue is resolved or not. If yes, make sure to accept this as a solution.

 

Thank you and best regards,

Aishwarya


0 Kudos
AishwaryaCV_Intel
Moderator
1,182 Views

Hi,


We assume that your issue is resolved. If you need any additional information, please post a new question as this thread will no longer be monitored by Intel.


Thanks and Regards,

Aishwarya


0 Kudos
Frank_Fu_
Beginner
1,166 Views

Hi Aishwarya,

 

Sorry for the late reply, as I was testing clck on RHEL VMs with an older previous-workable version of CLCK (update6)

But the hpl-fail problem shows up again.

 

I run the hpl_cluster_performance test as you mentioned, but I still found the hpl_pairwise failure.

-----------------------

**WARNING**:    1 test failed to run. Information may be incomplete. See clck_execution_warnings.log for more information.

-----------------------

 

$ cat clck_execution_warnings.log

Intel(R) Cluster Checker 2021 Update 6 (build 20220318)

Command-line: clck -f nodelist -F hpl_cluster_performance --db hpl_cluster_performance.db -o hpl_cluster_performance.log

RUNTIME ERRORS

Intel(R) Cluster Checker encountered the following errors during execution:

  1. hpl-pairwise-failed

       Message: The High Performance Linpack benchmark run on a given pair of

                nodes failed.

       4 nodes: compute-[13-16]

       Test:    hpl_cluster_performance

 

 

After examining the DB content, I find the following "HPL-pairwise" tests failed on the checksum.

 

$ clckdb --db hpl_cluster_performance.db --provider hpl_pairwise | grep FAILED

||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   4.03125162e+06 ...... FAILED

||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   1.54934359e+07 ...... FAILED

||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   1.20432549e+07 ...... FAILED

||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   7.42001552e+07 ...... FAILED

 

Could you help me how to resolve this? 

How do you know it is related to a different libstdc++.so since hpl shows failure on checksum?

0 Kudos
Reply