Software Tuning, Performance Optimization & Platform Monitoring
Discussion around monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform monitoring
Announcements
This community is designed for sharing of public information. Please do not share Intel or third-party confidential information here.

Measuring traffic across SKX mesh

aozcan
New Contributor I
626 Views

Hi,

 

I am trying to unveil topology of my SKX server. This is a 40 core server with hyperthreading disabled. I am attaching my lscpu output as well, if that would be of help regarding my following questions.

 

I already know CHA locations thanks to work of @McCalpinJohn . I basically read CAPID6 registers on my system. I have two sockets on my system and they both have their 8 tiles disabled, meaning that they have CHAs from 0 to 19, in total 20; considering that there are 28 tiles. Below I attached image of my two sockets and their corresponding CAPID6 register value. I believe 20 enabled tiles (CHAs) per socket is accurate since there are 20 cores per socket and there is a one-to-one relationship between cores and CHAs.

aozcan_2-1623747421082.png

 

aozcan_3-1623747436632.png

Is my interpretation of CAPID6 register and deriving CHA numbering correct?

Assuming that my CHA was correct, I focused on deriving cha-to-core relation since that would yield me the entire topology information. For this purpose, I have configured my uncore registers to measure traffic direction across each mesh stop. I based my work on @McCalpinJohn and I explicitly made use of his codebase https://github.com/jdmccalpin/periodic-performance-counters Then I kind of made a sanity check. First of all, I have read all traffic counters in each direction just before fetching data from memory. Then I bound my thread to core 0 and read 2GB of data from the RAM (I made sure to flush it from the cache hierarchy before doing that), and read traffic counters just after again. Thinking that probably core0 is co-located with CHA0 and I was hoping to measure way more UP value than DOWN value in CHA0 considering that CHA0 is above 2 IMCs but this was not the case. Below image is taken from @McCalpinJohn 's "Mapping Core and L3  Slice Numbering to Die Location in Intel Xeon Scalable Processors" technical paper. And also, is below image really correct? I thought that meaning of left and right changed in alternating columns according to John McCalpin's findings.

aozcan_4-1623750713829.png

 

This led me think that I was not doing my measurements correctly. I have attached my code, and I have a few suspicions regarding my code. I will also attach the output of it on my SKX server. So here goes my questions:

 

1) Did I configure registers right to read traffic direction across meshes? Are my filter inputs right? I used "Uncore Performance Monitoring Reference Manual" to configure. I am not sure that I configured them right, because reference manual was kind of confusing to me. Below is HORZ_RING_BL_IN_USE for example. What is the meaning of EVEN and ODD here? Which "two" rings are we talking about here?

aozcan_5-1623751086016.png

2) As I mentioned I am doing my measurements on core-0. Is this a bad choice for my sanity check considering that core-0 is OS's first choice doing its tasks. There might be lots of noise involved in measurements in this case.

3) Is my way of creating data in RAM, reading it (I am just fetching a vector and adding 5 to all of its items so that compiler would not optimize it away somehow), and flushing correct?

4) I have used first logical core (0) and last logical core (39 in my case) to configure registers in each socket. Is this correct?

 

EDIT: Realized that I could not attach my attachments. Attaching them and adding my code here:

 

 

 

 

#include <iostream>
#include <vector>
#include <unistd.h>
#include <sys/sysinfo.h>
#include <fcntl.h>
#include <x86intrin.h>

static constexpr long CHA_MSR_PMON_CTRL_BASE = 0x0E01L;
static constexpr long CHA_MSR_PMON_CTR_BASE = 0x0E08L;

static constexpr unsigned int LEFT_READ = 0x004003AB; /// horizontal_bl_ring
static constexpr unsigned int RIGHT_READ = 0x00400CAB; /// horizontal_bl_ring
static constexpr unsigned int UP_READ = 0x004003AA; /// vertical_bl_ring
static constexpr unsigned int DOWN_READ = 0x00400CAA; /// vertical_bl_ring
static constexpr unsigned int FILTER0 = 0x00000000; /// FILTER0.NULL
static constexpr unsigned int FILTER1 = 0x0000003B; /// FILTER1.NULL

static constexpr int CACHE_LINE_SIZE = 64;
static constexpr int NUM_SOCKETS = 2;
static constexpr int NUM_CHA_BOXES = 20;
static constexpr int NUM_CHA_COUNTERS = 4;

uint64_t before_cha_counts[NUM_SOCKETS][NUM_CHA_BOXES][NUM_CHA_COUNTERS];
uint64_t after_cha_counts[NUM_SOCKETS][NUM_CHA_BOXES][NUM_CHA_COUNTERS];
long processor_in_socket[NUM_SOCKETS];

using namespace std;

int stick_this_thread_to_core(int core_id) {
        int num_cores = sysconf(_SC_NPROCESSORS_ONLN);
        if (core_id < 0 || core_id >= num_cores)
                return EINVAL;

        cpu_set_t cpuset;
        CPU_ZERO(&cpuset);
        CPU_SET(core_id, &cpuset);

        pthread_t current_thread = pthread_self();
        return pthread_setaffinity_np(current_thread, sizeof(cpu_set_t), &cpuset);
}

int main()
{
    long logical_core_count = sysconf(_SC_NPROCESSORS_ONLN);
    std::vector<int> msr_fds(logical_core_count);
    char filename[100];

    processor_in_socket[0] = 0;
    processor_in_socket[1] = logical_core_count - 1;

    for(auto i = 0; i < logical_core_count; ++i) {
        sprintf(filename, "/dev/cpu/%d/msr",i);
        int fd = open(filename, O_RDWR);
        if (msr_fds[i] == -1) {
            std::cout << "could not open." << std::endl;
            exit(-1);
        } else {
            msr_fds[i] = fd;
        }
    }

    uint64_t msr_val = 0;
    uint64_t msr_num = 0;
    ssize_t rc64 = 0;
    std::vector<unsigned int> counters{LEFT_READ, RIGHT_READ, UP_READ, DOWN_READ, FILTER0, FILTER1}; /// last 2 are actually filters.

    for(int socket = 0; socket < NUM_SOCKETS; ++socket) {
        for(int cha = 0; cha < NUM_CHA_BOXES; ++cha) {
            long core = processor_in_socket[socket];

            for(int counter = 0; counter < counters.size(); ++counter) {
                msr_val = counters[counter];
                msr_num = CHA_MSR_PMON_CTRL_BASE + (0x10 * cha) + counter;
                // msr_fds[0] for socket 0, 
                rc64 = pwrite(msr_fds[core], &msr_val, sizeof(msr_val), msr_num);
                if(rc64 !=  {
                    fprintf(stdout, "ERROR writing to MSR device on core %d, write %ld bytes\n", core, rc64);
                    exit(EXIT_FAILURE);
                } else {
                    cout << "Configuring socket" << socket << "-CHA" << cha << " by writing 0x" << std::hex << msr_val 
                    << " to core " << std::dec << core << ", offset 0x" << std::hex << msr_num << std::dec << std::endl; 
                }
            }
        }
    }

    /// create 2GB of data in RAM that would be accessed by core 0 in the next step.
    std::vector<int> data(536870912);

    /// Flush the data from the cache in case it is in the cache somehow (might be futile here but just wanted to make sure).
    for(int i = 0; i < data.size(); i = i + CACHE_LINE_SIZE) {
        _mm_clflush(&data[i]);
    }

    /// in the first place, I will just stick thread to core0 and read data from RAM on this thread.
    int lproc = 0;
    cout << "Sticking main thread to core " << lproc << endl;
    stick_this_thread_to_core(lproc);

    cout << "---------------- FIRST READINGS ----------------" << endl;
    for(int socket = 0; socket < NUM_SOCKETS; ++socket) {
        long core = processor_in_socket[socket];

        for(int cha = 0; cha < NUM_CHA_BOXES; ++cha) {
            for(int counter = 0; counter < NUM_CHA_COUNTERS; ++counter) {
                msr_num = CHA_MSR_PMON_CTR_BASE + (0x10*cha) + counter;
                rc64 = pread(msr_fds[core], &msr_val, sizeof(msr_val), msr_num);
                if (rc64 != sizeof(msr_val)) {
                    exit(EXIT_FAILURE);
                } else {
                    std::cout << "Read " << msr_val << " from socket" << socket << "-CHA" << cha << 
                    " on core " << core << ", offset 0x" << std::hex << msr_num << std::dec << std::endl;
                    before_cha_counts[socket][cha][counter] = msr_val;
                }
            }
        }
    }

    /// I am basically fetching data from RAM to cache here.    
    for(auto& val : data) {
        val += 5;
    }

    cout << "---------------- SECOND READINGS ----------------" << endl;
    for(int socket = 0; socket < NUM_SOCKETS; ++socket) {
        long core = processor_in_socket[socket];

        for(int cha = 0; cha < NUM_CHA_BOXES; ++cha) {
            for(int counter = 0; counter < NUM_CHA_COUNTERS; ++counter) {
                msr_num = CHA_MSR_PMON_CTR_BASE + (0x10*cha) + counter;
                rc64 = pread(msr_fds[core], &msr_val, sizeof(msr_val), msr_num);
                if (rc64 != sizeof(msr_val)) {
                    exit(EXIT_FAILURE);
                } else {
                    std::cout << "Read " << msr_val << " from socket" << socket << "-CHA" << cha << 
                    " on core " << core << ", offset 0x" << std::hex << msr_num << std::dec << std::endl;
                    after_cha_counts[socket][cha][counter] = msr_val;
                }
            }
        }
    }
    
    cout << "---------------- TRAFFIC ANALYSIS ----------------" << endl;

    for(int socket = 0; socket < NUM_SOCKETS; ++socket) {
        for(int cha = 0; cha < NUM_CHA_BOXES; ++cha) {
            cout << "Socket" << socket << '-' << "CHA" << cha << ": ";
            for(int counter = 0; counter < NUM_CHA_COUNTERS; ++counter) {
                if(counter == 0) {
                    cout << "left->";
                } else if(counter == 1) {
                    cout << "right->";
                } else if(counter == 2) {
                    cout << "up->";
                } else if(counter == 3) {
                    cout << "down->";
                }

                cout << after_cha_counts[socket][cha][counter] - before_cha_counts[socket][cha][counter] << ' ';
            }
            cout << endl;
        }
    }

    return 0;
}

 

 

 

 

Thanks in advance.

 

Best regards

 

 

 

 

 

0 Kudos
4 Replies
SergioS_Intel
Moderator
575 Views

Hello aozcan,


In order to better assist you, please provide us the model of the server board and memory that you are using?


Best regards,

Sergio S.

Intel Customer Support Technician

For firmware updates and troubleshooting tips, visit :https://intel.com/support/serverbios 


McCalpinJohn
Black Belt
565 Views

There is an updated version of the technical paper on locations of cores and CHAs in Xeon Scalable Processors:  http://dx.doi.org/10.26153/tsw/13119.  Figure 3 (that you included above) has been slightly updated in the newer version, but in either case it shows the *expected* traffic on the mesh -- before the realization that the meaning of left and right swaps in alternating columns.

The exact PerfEvtSel values that I used for the CHA mesh data traffic counters were:

  • cha_perfevtsel[0] = 0x004003b8;     // HORZ_RING_BL_IN_USE.LEFT

  • cha_perfevtsel[1] = 0x00400cb8;     // HORZ_RING_BL_IN_USE.RIGHT

  • cha_perfevtsel[2] = 0x004003b2;     // VERT_RING_BL_IN_USE.UP

  • cha_perfevtsel[3] = 0x00400cb2;     // VERT_RING_BL_IN_USE.DN

The CHA unit Filter0 register is not used for these "Common Mesh Stop" events.  I also usually set Filter1 to 0x3b, but have not tested whether that makes a difference for these particular events.

The documentation is definitely confusing -- it clearly involves cut-and-paste from the documentation for the earlier ring-based systems (going back to at least Sandy Bridge EP).  You can ignore the "even" and "odd" sub-types for the moment -- each mesh stop is only able to receive on either even-numbered or odd-numbered cycles for the horizontal links (and same for the vertical links), but to count total traffic it is easiest to count both.

Core 0 is co-located with CHA 0 on every system I have tested.  The Core0/CHA0 pair is located in the upper left corner unless CAPID6 bit 0 is clear, so that is usually a good choice for testing to make sure the counters work.

There is no guarantee that the properties I have seen on TACC systems also apply to systems from other vendors.  Intel does not publicly disclose the mechanisms used to re-number the various units on the chip, so it is not clear how much of the observed patterns are due to requirements of the hardware, inheritance from the Intel reference BIOS, or just a lack of interest by the vendors in changing the BIOS options.

aozcan
New Contributor I
542 Views
Hi, thanks for your elaborate answer. I had used 0xaa and 0xab event code values for VERT_RING_BL_IN_USE and HORZ_RING_BL_IN_USE respectively since these values were documented in Intel® Xeon® Processor Scalable Memory Family Uncore Performance Monitoring Reference Manual July 2017 on page 38. However, as I see you used 0xb2 and 0xb8 values respectively for these fields. Am I reading the documentation wrong, or is there something else I am missing?

Thanks and regards
McCalpinJohn
Black Belt
539 Views

Sorry -- those are the events for the Ice Lake Xeon.  For SKX/CLX I used the same events that you listed.

  • 0x004003ab
  • 0x00400cab
  • 0x004003aa
  • 0x00400caa

 

Reply