Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.
1711 Discussions

Slower memory bandwidth on identical nodes reported by STREAM benchmark

Londhe__Ashutosh
Beginner
695 Views

I am running stream benchmark on two identical nodes, but one node is reporting almost 5X slower performance compare to other node

Following is the node configuration

Processor

2 X Intel(R) Xeon(R) CPU E5-2698 v4

2 X 20 Cores, 2.20GHz, L1d cache:            32 K, L1i cache:             32 K, L2 cache:              256 K, L3 cache:              51200 K

Memory

128 GB, 2400 Hz, 4 memory channels (32GB each)

 

Please help me to identify the issue.

I have checked BIOS setting and drivers available, its identical for both nodes.

0 Kudos
4 Replies
McCalpinJohn
Honored Contributor III
695 Views

The specific performance numbers might help narrow down the possible mechanisms....

What compiler and compilation options were used?  What is the OS?

I would start by comparing the two systems with a set of smaller tests:

  • Single-thread performance bound to each socket on each node
    • export OMP_NUM_THREADS=1; numactl --membind=0 --cpunodebind=0 ./stream
    • export OMP_NUM_THREADS=1; numactl --membind=1 --cpunodebind=1 ./stream
  • Multi-thread performance bound to each socket on each node, using 2..20 cores.
    • If HyperThreading is enabled, set OMP_PROC_BIND=spread

 

0 Kudos
Londhe__Ashutosh
Beginner
694 Views

Hello John,

Following are the details you asked

compilation: 

gcc -fopenmp -O3 -DSTREAM_ARRAY_SIZE=60000000 stream.c -o Stream_60M.exe

gcc version : 6.2.0

OS: Linux

I will try the experiments you suggested and let you know.

Thanks for the feedback.

 

0 Kudos
Londhe__Ashutosh
Beginner
694 Views

Hello john,

Issue resolved. It was due to faulty PSU which limiting the node performance.

Thanks for your help.

0 Kudos
McCalpinJohn
Honored Contributor III
695 Views

Glad you found the problem! 

This is an area that often causes problems in our supercomputing environment -- in many cases we would rather have a node fail than have it run slowly.... 

0 Kudos
Reply