OpenMP and false sharing

George_M_4 · ‎07-11-2016

I'm only recently begun programming using OpenMP using the Intel compiler v.15. I've written a simple C code using OpenMP on a single node as a template for one of my projects, and the execution time increases as the number of threads increases, rather than decreasing as the number of threads increases. I'm hoping someone here can help me identify my error(s). I'm using the Haswell CPU, E5-2699v3, 18 cores, 32KB L1 data and instruction cache, 64 byte cache line. All of my data should stay within L1 cache, and is all double-precision floating point.

#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <math.h>
#include <omp.h>

#define CACHE_LINE_LENGTH 64	// bytes
#define L1CACHE 512		// bytes
#define MAXTHREADS_PER_CORE 2

int main(int argc, char *argv[]) {
double begin_time = 0.0, end_time = 0.0, result __attribute__ ((aligned(64))) = 0.0, *dblarray __attribute__ ((aligned(64))) = NULL;
int index1, thread_id, threads, remainder, tmp, usable_array_size;
unsigned long loop_iterations, index2;
double begin_time_per_thread, end_time_per_thread;


	if(argc != 3) {
		printf("Syntax: %s <# of threads> <# of loops>\n", argv[0]);
		exit(-1);
	}
	else {
		errno = 0;
		threads = strtol(argv[1], NULL, 0);
		if(errno == 0) {
			loop_iterations = strtoul(argv[2], NULL, 0);
			if(errno != 0) {
				printf("Syntax: %s <# elements in array> <# of threads> <# of loops>\n", argv[0]);
				exit(-1);
			}
		}
		else {
			printf("Syntax: %s <# elements in array> <# of threads> <# of loops>\n", argv[0]);
			exit(-1);
		}
	}

	// if the cache size isn't an integral multiple of the cache line length and the number of threads,
	// adjust it so that it *is* an integral multiple.  this will avoid false sharing of cache lines.
	// decrease the part of the cache that will be used until it is an integral of cache line and threads
	for(index1 = 0; index1 < L1CACHE; index1++) {
		tmp = L1CACHE - index1;
		if(((tmp % threads) == 0) && ((tmp % CACHE_LINE_LENGTH) == 0) && ((tmp % sizeof(double)) == 0)) {
			break;
		}
	}
	usable_array_size = tmp;
	dblarray = (double *) malloc(usable_array_size);
	if(dblarray == NULL) {
		puts("ERROR: malloc failed\n");
		exit(-1);
	}

	// calculate the number of DPFP to work on
	usable_array_size /= sizeof(double);

	//stride = CACHE_LINE_LENGTH / sizeof(double);
	//max_array_elements = array_size / sizeof(double);
	omp_set_num_threads(threads);
	#pragma omp parallel default(none) private(dblarray, thread_id, index1, index2, result, begin_time_per_thread, end_time_per_thread) \
		shared(threads, begin_time, end_time) \
		firstprivate(loop_iterations, usable_array_size)
	{

		thread_id = omp_get_thread_num();
		#pragma omp for schedule(static)
		for(index1 = 0; index1 < usable_array_size; index1++) {
		#pragma omp master
		{
			begin_time = omp_get_wtime();
		}

		for(index2 = 0; index2 < loop_iterations; index2++) {
			#pragma omp for schedule(static)
			for(index1 = 0; index1 < usable_array_size; index1++) {
				result = dblarray[index1] * (double)2.71828;
				dblarray[index1] = result * (double)3.14159;
			}
			//#pragma omp barrier
		}

		#pragma omp master
		{
			end_time = omp_get_wtime();
			threads = omp_get_num_threads();
		}
	}
	printf("HTCores: %d, Threads: %d, Elapsed time: %f seconds\n", omp_get_num_procs(), threads, end_time - begin_time);

	free(dblarray);
}

George_M_4 · ‎07-11-2016

Sorry, I have two errors in my pasted code:

The variable dblarray is actually in the shared variable parallel list.

The first loop in the parallel region using the variable index1 is incomplete. It should look like this:

for(index1 = 0; index1 < usable_array_size; index1++) {
  dblarray[index1] = (double) 1.0;
}

TimP · ‎07-12-2016

I guess you don't want to enter the "lions' den" by asking on stackoverflow.

I note that the way you wrote it, the inner loop is not parallelized , although it seems plausible that it might be if you set OMP_NESTED. I don't understand how you are expecting to see parallel speedup, if you are testing several copies of the same benchmark running simultaneously.

Even if you parallelize the inner loop, I'm not surprised if there is no speedup if your benchmark is sized for L1 of a single core and you are dividing the benchmark among decreasing partitions of L1 on multiple cores.

On a benchmark which seems pointless unless it is large enough to exercise cache misses, and thus in this case becomes memory bound, the point of parallelism would be to use multiple memory controllers (for which my Haswell box is too small).

jimdempseyatthecove · ‎07-12-2016

Did you really intend to be using nested parallel regions?

You have a second level "omp for" inside an "omp for" without "parallel" on second level.

Each thread taking secion of index1 iteration space from first level parallel for, is performing full iteration space on index1 inside second (nested) omp for.

(section of index1 iterations space) (* number of threads) * (all of index2 iterations space) * (all of index 1 iteration space).

And subsequently you have all threads updating all cells of dblarry.

You need to redesign your code.

You have no indication of how index2 fits into your scheme.

Jim Dempsey

George_M_4 · ‎07-12-2016

Thank you for your time. The first for loop was pasted in with an error, it should look like this:

for(index1 = 0; index1 < usable_array_size; index1++) {
             dblarray[index1] = (double)1.0;
}

The change above eliminates the nesting of parallel for loops. Rather than creating a benchmark, I'm just trying to learn to use OpenMP.

I thought threading the code to execute the same amount of total work on multiple cores would shorten the overall duration as measured by (end_time - begin_time). I think my code looks like roughly like this:

1. create array in L1 cache

2. establish parallel region

3. initialize values in array using parallel for loop

4. get the start time

5. use an outer for loop (index2) to iterate through a parallelized inner for loop (index1) to perform DP multiplication on each element of array

6. get the end time

7. print the elapsed time

I think the "#pragma omp for schedule(static)" around the inner for loop should subdivide the inner loop among the cores I've set using omp_set_num_threads(threads), and decrease the measured execution time. The elapsed time in step 5 instead grows as I increase the number of threads. I've made sure my array in L1 data cache is a multiple of the number of threads and cache line length.

I don't understand the reason for the increase in execution time when adding threads. I don't think I'm false sharing or have data loop dependencies that would affect execution time, or do I? Should not the threads each perform part of the total workload and therefore decrease the time?

jimdempseyatthecove · ‎07-13-2016

>>1. create array in L1 cache

The L1 cache is sharable only within the scope of the hardware threads residing within a single core. Did you intend to say: Create an array who's size fits within the sum of the L1 caches of all the cores to be used? Note, L2 is also shared amongst threads within same core, *** but on some CPU designs, not most, two cores may share the same L2. You can use CPUID/CPUIDX to determine how L2 is shared.

>>3. initialize values in array using parallel for loop

The array section as used by each thread, possibly multiple threads per core, will, depending on size of section and sizes of various cache levels, will as a byproduct potentially populate L1 or L2 or L3 or not. It is important here that the same threads initializing the data are the same threads later manipulating the same locations.

If dblarray is the only block of memory being worked on, then on most CPUs L1 cache is 32KB or 4096 doubles. Your CPU with 18 cores should have a capacity of 73,728 doubles... less some amount for local variables. Possibly a dblarray size of 72,000 doubles would fit in the L1s (assuming all cores utilized).

Note, if your input parameters chose 4096 for the number of doubles to allocate, then each thread would work on only 227 or 228 variables. The amount of work is too little to amortize the overhead of barriers inside your timed loop.

Jim Dempsey