Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

Slower memory bandwidth on identical nodes reported by STREAM benchmark

Londhe__Ashutosh
Beginner
666 Views

I am running stream benchmark on two identical nodes, but one node is reporting almost 5X slower performance compare to other node

Following is the node configuration

Processor

2 X Intel(R) Xeon(R) CPU E5-2698 v4

2 X 20 Cores, 2.20GHz, L1d cache:            32 K, L1i cache:             32 K, L2 cache:              256 K, L3 cache:              51200 K

Memory

128 GB, 2400 Hz, 4 memory channels (32GB each)

 

Please help me to identify the issue.

I have checked BIOS setting and drivers available, its identical for both nodes.

0 Kudos
4 Replies
McCalpinJohn
Honored Contributor III
666 Views

The specific performance numbers might help narrow down the possible mechanisms....

What compiler and compilation options were used?  What is the OS?

I would start by comparing the two systems with a set of smaller tests:

  • Single-thread performance bound to each socket on each node
    • export OMP_NUM_THREADS=1; numactl --membind=0 --cpunodebind=0 ./stream
    • export OMP_NUM_THREADS=1; numactl --membind=1 --cpunodebind=1 ./stream
  • Multi-thread performance bound to each socket on each node, using 2..20 cores.
    • If HyperThreading is enabled, set OMP_PROC_BIND=spread

 

0 Kudos
Londhe__Ashutosh
Beginner
665 Views

Hello John,

Following are the details you asked

compilation: 

gcc -fopenmp -O3 -DSTREAM_ARRAY_SIZE=60000000 stream.c -o Stream_60M.exe

gcc version : 6.2.0

OS: Linux

I will try the experiments you suggested and let you know.

Thanks for the feedback.

 

0 Kudos
Londhe__Ashutosh
Beginner
665 Views

Hello john,

Issue resolved. It was due to faulty PSU which limiting the node performance.

Thanks for your help.

0 Kudos
McCalpinJohn
Honored Contributor III
666 Views

Glad you found the problem! 

This is an area that often causes problems in our supercomputing environment -- in many cases we would rather have a node fail than have it run slowly.... 

0 Kudos
Reply