Solved: Normally one does not program

sv1 · ‎12-23-2015

Greetings,

While experienced on gpgpu, I am new to OpenMP. My cpu has 4 cores / 8 threads.

To verify my affinity settings, I tried this:

static void test(int numThreads){
 omp_set_num_threads(numThreads);
#pragma omp parallel
 {
  int tmax = omp_get_max_threads();
  int tnum = omp_get_thread_num();
  int nproc = omp_get_num_procs();
  int ncores = nproc / 2;

  kmp_affinity_mask_t mask;
  kmp_create_affinity_mask(&mask);
  kmp_set_affinity_mask_proc(tnum, &mask);
  if(0 != kmp_set_affinity(&mask))
    printf("Error::kmp_set_affinity() for thread %d\n", tnum);
  //Just to see load on the thread(s)
  while(1);
 }
}

and, sure enough, utilization went all the way up on all 8 threads. Also tried some combinations of threads (eg. even numbered, odd numbered, etc) and utilization was just as I expected: selected threads went all the way up.

Next, I tried a pointwise array multiplication:

static void test(int numThreads, uint64_t* a, uint64_t* b, uint64_t* c, uint64_t segSz){
 omp_set_num_threads(numThreads);
#pragma omp parallel
 {
  int tmax = omp_get_max_threads();
  int tnum = omp_get_thread_num();
  int nproc = omp_get_num_procs();
  int ncores = nproc / 2;

  kmp_affinity_mask_t mask;
  kmp_create_affinity_mask(&mask);
  kmp_set_affinity_mask_proc(tnum, &mask);
  if(0 != kmp_set_affinity(&mask))
   printf("Error::kmp_set_affinity() for thread %d\n", tnum);
  for(int i = tnum * segSz; i < (tnum + 1) * segSz; i++)
   a = c * b;
 }
}

The speedup over the serial version is about x4, not x8 that I was expecting.

Does the cpu have an ALU per core or per thread? If it has only 4 ALUs (one ALU per core => 2 threads sharing an ALU), that would explain the x4 speedup due to partial serialization.

Any other explanations ?

Thanks in advance.

jimdempseyatthecove · ‎12-23-2015

Normally one does not program affinity settings using the kmp_affinity_mask. Rather they specify eternally using environment variables (then manipulate number of threads for parallel region). This said...

Depending on environment variables the topology (OpenMP thread pool team member to physical processor + core within physical processor + hardware thread within core) can vary. Thus you may have no assurance as to the mapping and/or if the system logical processor numbering overlays with the OpenMP team numbering. This is another reason for avoiding explicit mapping in your code. When doing this, I suggest you use CPUID and its extended format to obtain the mapping and availability of (and number of) hardware threads per core. Note, this may/will need revisions in code as architecture evolves.

There is no indication in your code that your segSz is inversely proportional to the number of threads. Let's assume it does.
The loop you show will likely get vectorized using the AVX2 instruction set (Single Instruction Multiple Data - SIMD).
Each core has one SIMD functional unit meaning the hardware threads of a core share this functional unit. The functional unit is pipelined such that the fetches and stores from one HT of a core can overlap with those of the other HT(s) of the same core, however the execution (e.g. *) is serialized amongst the HT(s) of the same core. (some of the different types of operators * or / can be used concurrently with + or -).
Depending on environment variable KMP_AFFINITY, adjacent OpenMP thread team team member numbers may be in the same core (compact) or spread across cores (scatter) (then wrap back around to prior assigned cores). Note there are other environment variable settings.

Therefore, depending on environment variables, and BIOS setting (HT enabled or disabled) 2 threads may run ~2x that of 1 thread or with other settings it may run significantly less than 2x (for the sample loop presented above). The HT(s) within the same core share the L1 and L2 cache of that core so depending on what you want to do and how you do it, it will affect your choice of compact or scatter or use of HT.

This is one of the reasons why you see 1x, 2x, 3x, 4x, less than 5x, less than 6x, less than 7x, less than 8x.
You could alternately have seen:
1x, less than 2x, 2x+less than 1x,, 2x+less than 2x, ...

The second factor affecting scaling is the number of memory channels. Your CPU has 2 memory channels (other CPUs have more).

The third factor (not necessarily in order listed) is cacheability. This is as much of a factor of how you distribute the work as it is of the total amount of cache available.

Jim Dempsey

View solution in original post

jimdempseyatthecove · ‎12-23-2015

Normally one does not program affinity settings using the kmp_affinity_mask. Rather they specify eternally using environment variables (then manipulate number of threads for parallel region). This said...

Depending on environment variables the topology (OpenMP thread pool team member to physical processor + core within physical processor + hardware thread within core) can vary. Thus you may have no assurance as to the mapping and/or if the system logical processor numbering overlays with the OpenMP team numbering. This is another reason for avoiding explicit mapping in your code. When doing this, I suggest you use CPUID and its extended format to obtain the mapping and availability of (and number of) hardware threads per core. Note, this may/will need revisions in code as architecture evolves.

There is no indication in your code that your segSz is inversely proportional to the number of threads. Let's assume it does.
The loop you show will likely get vectorized using the AVX2 instruction set (Single Instruction Multiple Data - SIMD).
Each core has one SIMD functional unit meaning the hardware threads of a core share this functional unit. The functional unit is pipelined such that the fetches and stores from one HT of a core can overlap with those of the other HT(s) of the same core, however the execution (e.g. *) is serialized amongst the HT(s) of the same core. (some of the different types of operators * or / can be used concurrently with + or -).
Depending on environment variable KMP_AFFINITY, adjacent OpenMP thread team team member numbers may be in the same core (compact) or spread across cores (scatter) (then wrap back around to prior assigned cores). Note there are other environment variable settings.

Therefore, depending on environment variables, and BIOS setting (HT enabled or disabled) 2 threads may run ~2x that of 1 thread or with other settings it may run significantly less than 2x (for the sample loop presented above). The HT(s) within the same core share the L1 and L2 cache of that core so depending on what you want to do and how you do it, it will affect your choice of compact or scatter or use of HT.

This is one of the reasons why you see 1x, 2x, 3x, 4x, less than 5x, less than 6x, less than 7x, less than 8x.
You could alternately have seen:
1x, less than 2x, 2x+less than 1x,, 2x+less than 2x, ...

The second factor affecting scaling is the number of memory channels. Your CPU has 2 memory channels (other CPUs have more).

The third factor (not necessarily in order listed) is cacheability. This is as much of a factor of how you distribute the work as it is of the total amount of cache available.

Jim Dempsey

TimP · ‎12-23-2015

A primary purpose of Hyperthreads is to handle TLB and LLC misses concurrently. In the absence of such issues, a single thread per core may be able to gain higher floating point throughput than 2 (with OMP_PLACES=cores). This is discussed in Intel MKL documentation.

You would normally place a #pragma for simd on a for loop which is to be vectorized and distributed across threads inside an omp parallel region, which will divide the iterations automatically among threads. With static scheduling, the effect would be similar to what you posted, except that if you didn't achieve auto-vectorization, you are comparing threaded scaling from an unrealistically slow baseline. Unnecessary use of unsigned 64-bit integers may inhibit parallel performance; without either the omp for simd or * __restrict qualifier, you may not get full optimization.

Bernard · ‎12-24-2015

Answering OP question.

Your CPU has 4 cores and 8 HW threads, that's mean that from OS point of view CPU has 8 virtual cores. In reality each core has two sets of architectural register state. Every set is composed from GP Registers and APIC. SIMD FP , x87 and Integer units are shared between those HW threads. When your code is threaded and both threads contain similar floating-point machine code then at some time you will not reach 2x throughput because execution Ports will be saturated.

sv1 · ‎12-24-2015

To all responders: Thank you.

Your comments are very beneficial. My only problem now is how to decide which one is the best reply.

Regards

Bernard · ‎12-28-2015

@Vectorizer,

You may a have a look at this article: http://www.realworldtech.com/haswell-cpu/4/

OpenMP on i7-4700mq Resource Utilization