Software Tuning, Performance Optimization & Platform Monitoring
Discussion around monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform monitoring
This community is designed for sharing of public information. Please do not share Intel or third-party confidential information here.
1622 Discussions

How can I assign threads to OpenMP parallel sections by NUMA socket?


I'm trying to implement a partitioned SpGEMM algorithm on a multi-socket system, the goal is to distribute the multiplication work to all sockets and restrict the memory access to local socket only, so we can enjoy the best memory speed.


partitoned SpGEMM algorithm


Specially, the machine I'm using is a two-socket Intel Skylake system, with 24 cores per socket. So I was thinking about using nested parallel regions with 2 threads in the outer region, once a thread encounters the section block, it spawns to 24 threads and performs partitioned SpGEMM.


  for (int i = 0; i < ITERS; ++i) {

  start = omp_get_wtime();

  #pragma omp parallel sections num_threads(2)
  #pragma omp section
          SpGEMM(A_upper, B, C, 24);
  #pragma omp section
          SpGEMM(A_lower, B, C, 24);
  end = omp_get_wtime();
  ave_msec += (end - start) * 1000;


export OMP_PLACES=sockets
export OMP_PROC_BIND=spread
export OMP_NESTED=True

// run the program
numactl --localalloc ./partitioned_spgemm


I can now get thread affinity correctly set up, but the performance is worse than what I would expect.

A `C_upper = A_upper * B` or a `C_lower = A_lower * B` on a single socket yield to 700 MFLOPS(flop per second). The original SpGEMM `C = A * B` on two sockets yields to 900 MFLOPS (as you may know, the number is way lower than 2 * 700 MFLOPS due to NUMA access). With proper thread affinity, I was expecting my partitioned SpGEMM could hit 1400 MFLOPS, but I can only get 400 MFLOPS in my current setup.

I'm using GCC-8.2.0 with OpenMP-4.5, OS Red Hat Enterprise Linux Server 7.6. The SpGEMM I'm using is an outer product based SpGEMM, it uses OpenMP parallelizing outer loops.

I think it must have something to do with nested threads but can not figure it out myself. How can I solve this?

I also posted the same question on Stackoverflow for better visibility, let me know if it violates the term here and I'll delete it.

0 Kudos
0 Replies