Writing to shared variable occasionally, irregularly very slow KNL

Adam_S_5 · ‎10-03-2016

I have a python3 application I'm parallelizing using the multiprocessing package, and I'm encountering a surprising, sporadic slowdown when multiple processes are simultaneously writing to the same shared variable. I'm running a developer's kit Knights Landing (64 core), and the problem appears independent of whether I use the xppsl kernel (running CentOS).

The scheme for sharing results of a vector calculation among n_par workers is to allocate a shared array with n_par * vec_size elements, and to have each worker write its vector of results into its own segment of the array (then the results can quickly be summed across workers to produce the final vector result). In one representative setup, writing to the shared array usually takes negligible time (rounds to 0.00), but occasionally takes about two seconds (!!) to complete.

For this example, the problem size was 40 parallel workers (each pinned to a unique core) sharing a vector of length roughly 300,000 double precision elements. It happens maybe once every ten iterations of the sharing scheme, and this is enough to seriously degrade overall performance.

The severity of the problem tends to increase within increasing shared array size, and disappears for small enough vector or number of parallel workers. Overall it seems more strongly dependent on number of workers than vector length. So far, I have not been able to recreate the effect on an i7-6900K workstation (up to 8 workers).

During the moments when the program seems to stall temporarily, in htop, the CPU usage bars mostly change color from green (normal priority threads) to red (kernel threads).

Does anyone have any ideas about what might be going on here? Is it possibly something to do with the MESIF protocol? Any more specific characterizations I could try to help narrow this down and solve it?

Thank you,

Adam

jimdempseyatthecove · ‎10-04-2016

Is there additional information about your program, no matter how small, that may be pertinent to the issue? This would be something that would cause serialization (IOW critical section).

Can you make s simple reproducer, then attach it here for others to test (I have a KNL system).

Note, if the (attempted) reproducer does not exhibit the problem, then ask yourself "what is different between the real code and the reproducer?".

Can you run VTune, in Paused mode, then let the program run until it stalls, and then click Resume (then stop the program after it continues running). This might yield information as to the cause.

An alternative, you have additional threads available. Setup an additional thread to run as a watchdog thread. A small function that monitors progress. This function is compiled in Debug mode, all others as you compile now that reproduce the problem with the exception that you instruct the compiler to produce (and keep) debug information. You will have to instruct the linker not to strip out the debug information. Start the program with the debugger and set the break point in the watchdog function that is executed when the hang is detected. On break, you can examine the location of each of the other threads to see if anything is of interest.

Jim Dempsey

Adam_S_5 · ‎10-04-2016

Jim,

Thank you for your interest! I have cooked up a little script that can recreate the problem and have attached it (I gave it a ".c" extension because this website did not allow a ".py" extension, but actually it is a python script). See inside the script for command line options, but on my KNL it fairly regularly reproduces the effect when using the "-l" option (that's the letter "el", bad form I know) and the remaining as defaults. Every once in a while, one of the iteration times will be 10x longer than normal. Again, running Python 3.5 here (e.g. "python share_times.py -l").

Edit: Also shows up pretty well using: "python share_times.py -n 64 -d 30000 -p 3"

Some additional notes about what I have seen:

The problem does not go away when writing to the array is protected with a lock: it's not just about simultaneous writes to different parts of the array.
The first iteration is always much slower than average--this is the first time the process writes to the shared array after the OS forks--and how much slower depends sensitively on the number of processes and much less so on the vector size (this is true for i7 machine, as well).
In the full program I'm running, the effect happens more often and more regularly than in the script--often doesn't happen at all in the script without using the lock.
In the full program, I've done more detailed timing that shows that with locking, the slow iterations are indeed due to time spent writing, not due to irregularities in locks or barriers, and the all the processes experience the roughly the same level of slowness in the same iterations.
In the full program, the total memory usage never goes above about 4.5 GB, so it's not exceeding the MCDRAM cache.
Problem persists when running MCDRAM in flat memory mode, anyway (so far have always run KNL in Quadrant mode).
Still not able to recreate the effect on i7-6900K. Seeing some onset of much less severe timing fluctuations when running 16 processes, but this could just be hyperthreading (8 core machine) and the much larger vector size I used. Problem does not seem to appear on KNL with 8 or 16 processes.
During the normal iterations, writing to the shared array takes about the same amount of time as writing to a private array in the same fashion.

So far I have not attempted your other suggestions, but I am very curious to hear if this reproduces on your KNL or anyone elses.

Thanks!

Adam

Adam_S_5 · ‎10-05-2016

SOLVED: switch the order of the indeces of the array in the previously attached script, and everything runs smoothly and much faster.

Previously, I had decided to use the rank as the second index to the array, probably to make the subsequent sum across ranks occur over contiguous values in memory. But, writing is harder than reading, and putting rank as the first index allows each process to write to a single contiguous block in memory. Turns out the sum is still plenty quick.

It's still baffling to me how erratic the previous behavior was...

jimdempseyatthecove · ‎10-05-2016

Adam,

I am not a Python programmer, the description of your solution indicates that pre-solution your code was performing strided loads and stores. IOW the sequence of (two) indices was such that each subsiquent cell reference was not contiguous with the prior cell reference. When this occurs in a multi-threaded program (note, your program is multi-threaded not multi-processes) then adjacent cells are likely managed by different threads. Meaning references (reads, writes, read-modify-writes) will perform cache eviction of the other thread and cache reloading of the executing thread. Cache line eviction will occur depending on where the threads are located (same core, other core) and with respect to cell reference, combined with when the thread(s) reference the data. Meaning you may experience significantly less interference in some circumstance (phasing) and significantly greater interference in others. Somewhat like listening to a dual prop airplane with the engines running at different RPMs.

You should construct the code such that adjacent cells in memory are managed by the threads that have the highest frequency of access. Typically this means adjacent cells managed by same thread (there are some outliers where this might not hold true).

Jim Dempsey

TimP · ‎10-05-2016

As you mentioned, arranging the subscripts so that you don't have multiple threads writing to the same cache line (except possibly at work chunk boundaries) is necessary to avoid false sharing. The effect of false sharing should not be as bad on KNL if the interfering threads are running on the same pair of cores.

Adam_S_5 · ‎10-06-2016

For future reference, there is still an occasional slowdown during the write in some iterations, but it's not anywhere near as bad, so the program is feasible to run. This is noticeably reduced even further by padding the number of columns to cache align each row of the array, so that each cache line is only ever written to by one core. Surprisingly, I still see a similar distribution of slow iterations--they are tiny now but are not totally eliminated. Anyway this was a good lesson for me with the KNL because this sort of phenomenon does not show up with only 8 or 16 cores running!

UPDATE: Actually, under further testing, the row alignment seems to have negligible affect (as to be expected, for large 2d arrays). What finally solved the problem was chunking the allocation of the shared array (e.g. rows 0-3 in one chunk, rows 4-7 in the next, etc.). With these scheme, the same pattern of slow write times appears, but even the slow times are still very fast, less than a factor of 10 slower than the fast iterations. (And luckily the subsequent computation using these values remained almost as fast as when they were held in one place). The total array size is on the order of 100's of megabytes with a row for each of 64 cores.

It seems the program I am running would occasionally evict these variables to a higher level of memory. What caused the confusion for me is that this did not happen every time around, but only every few iterations, despite all the same intermediate computations being performed.