Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.
1708 Discussions

NUMA: Read/Write speed degradation after calling libnuma's move_pages

Daniel_O_1
Beginner
331 Views

I am interested in moving pages from one NUMA node to another. To this end I am using libnuma’s move_pages directive. So far I have managed to use this call successfully, but I have noticed that its use leads to a degradation in the performance when the pages that were moved are accessed again.

I have developed a test algorithm that aims to explore the effects of this call using three threads. (two threads use OpenMp and the other uses pthreads). The idea is to place each of the different OpenMP threads on a different NUMA node:

EXPORT OMP_NUM_THREADS=2
EXPORT GOMP_CPU_AFFINITY=7,14 //cpus 7 and 14 are on different NUMA nodes

The first thread initializes the array (OMP master segment) and allocates the memory on the first node because of the present first touch policy.

#pragma omp master
{   
    int cpu, node;
    int nthreads=omp_get_num_threads() ;
    cpu= sched_getcpu();
    time_total= get_time();
    time1 = get_time();
    for (int i = 0; i < total_size; i++) {
      A = 3.1416;
    }
    diff=get_ToD_diff_time(time1);
    printf ("end of initialization by cpu %d in %f, page size  %ld , no. threads %d  \n\t\t",cpu,diff,sysconf(_SC_PAGESIZE) ,nthreads );
    }

Later, the other node accesses this array and modifies it. The array is accessed in a random fashion. Every fixed number of iterations the time is taken and from this a throughput figure is calculated. When both OpenMP threads are allocated on the same NUMA node it runs faster:

#pragma omp parallel 
   {
      if (omp_get_thread_num()!=0){
        int cpu= sched_getcpu();
        printf ("Access from thread %d / %d  \n\t\t",cpu, omp_get_num_threads());
        time1 = get_time();
        #pragma omp parallel for 
            for (long i = 0; i < total_size; i++) {
                int index = random_step ? rand() % total_size : i ;
                A[index] += 1;
                j++;
                if(j==step){
                    diff=get_ToD_diff_time(time1);
                    rratio=N_ACCESSES/diff;
                    time1 = get_time();
                    printf("%s Step %f %f \n\t\t",label,diff,rratio);
                    j=0;
                }
            }

            diff=get_ToD_diff_time(time_total);
            printf("%s Run time; %f seconds \n\t\t",label,diff);

      }
   }
 

Additional to this, X seconds after beginning the algorithm the third thread awakes and invokes the move_pages directive (function call_move_pages). After moving the pages the throughput figure drops in a manner proportional to the number of pages moved and it never recovers from there.

void * call_move_pages(void * arg){
    memory_region *mr=(memory_region*)arg;
    void** pages;

    pages=malloc(PAGES_2MOVE*sizeof(void *));
    int * status=malloc(sizeof(int) * PAGES_2MOVE);
    int * nodes=malloc(sizeof(int) * PAGES_2MOVE);

    sleep(PHAS1_TIME);

    for(int i=0; i<PAGES_2MOVE; i++){
        double * offset= mr->start+ (rand() % mr->size);
        pages=(void *) offset;
        nodes=1;
    }

    int ret=    move_pages(getpid(), PAGES_2MOVE, pages, nodes, status,0);

    printf("move pages res %d \n", ret);
 

Does any one now of: -alternatives for moving pages other than move_pages that can be called from user code? -where can I find support for libnuma? -Are there workarounds for this performance slowdown?

 

 
0 Kudos
1 Reply
McCalpinJohn
Honored Contributor III
331 Views

I would start off by using performance counters to verify that the data is actually located where you think it is located in each of the phases of execution.   If the data is big enough to overflow the last-level cache, then the memory controller counters are the most reliable counters (though certainly not the easiest to access).    For any data size the QPI Link-Layer counters can be used to measure coherent data traffic on the QPI links (provided that your BIOS allows access to these counters).   The easiest counters to use are the core performance counters, where the OFFCORE_RESPONSE event can be used (along with programming an auxiliary MSR) to measure many different types of snoop response and data response.  I have not worked with these counters enough to be confident in their accuracy, but it should be pretty easy to set up tests for the small number of combinations needed here.

If the data is all in the right place, then I would check cache miss rates to make sure that (after the move) the pages don't have systematically conflicting addresses.  If there is a significant difference in cache miss rates, then I would check the virtual to physical address mapping (using the /proc/pid/self/pagemap interface) before and after the move to look for conflicts.

If the goal is a quick fix rather than detailed understanding, then simply copying the data to a local array is likely to be the easiest approach.

 

0 Kudos
Reply