Can 4K aliasing prevent scaling to multiple threads?

coulom__remi · ‎10-04-2018

Hi,

I am struggling to understand why my code does not scale to many threads on my new i9-7920X.

There is almost no synchronization between the threads. They don't use a lot of memory bandwith and work mainly in L1 and L2. They don't share data. So I thought performance should scale very well.

My code makes heavy use of AVX2, and I know the i9 will reduce the clock rate as the number of threads grows. But according to the frequency tables I saw, I should not get the catastrophic scaling I observerve:

1 thread: 15.3 items/second
4 threads: 60.8 items/second
8 threads: 88.6 items/second
12 threads: 56.3 items/second

So I have perfect scaling at 4 threads, bad scaling at 8, and something catastrophic happens at 12 threads.

The CPU is watercooled, temperature stays at less than 60°C, turbo boost is always active, so it seems there is no thermal problem.

I profiled the code with VTune. With 1 thread, I have only 4.8% memory bound, and 12 threads are 65% memory bound. Most of it is L1 bound. That seems to be the cause for the bad performance, as the rest is similar.

Microarchitecture Exploration gives further details of L1 Bound.

With 1 thread: L1 Bound 0.6%, L2 Bound 0.9%, L3 Bound 1.2%, DRAM Bound 3.0%, Store Bound 0.0%

DTLB Overhead: 0.2%
Loads Blocked by Store Forwarding: 0.0%
Lock Latency: 0.0%
Split Loads: 0.0%
4K Aliasing: 14.9%
FB Full: 0.0%

With 12 threads: L1 Bound 52.5%, L2 Bound 0.9%, L3 Bound 0.4%, DRAM Bound 0.4%, Store Bound 0.0%

DTLB Overhead: 0.0%
Loads Blocked by Store Forwarding: 0.0%
Lock Latency: 0.0%
Split Loads: 0.0%
4K Aliasing: 7.4%
FB Full: 0.0%

The Bandwith Utilization Histogram gives an Average Bandwidth of 1.361 GB/sec, so that cannot be the bottleneck.

The threads do have a little synchronization, but at a frequency of less than 100 Hz, so that should not be catastrophic. It seems that something bad is happening with memory, but I don't understand what it is. If I understand correctly, L1 and L2 are local to each core, so if my code does most of its work in L1, then there should be little interference between cores.

Does 4K Aliasing produce communication between cores that could explain the bad performance? It is a bit difficult to avoid it with my current code, but if it can be the source of such a huge performance problem, then I would try to reduce it.

Can you help me to understand why my code becomes so slow with 12 threads? Any advice for further investigation would be very welcome.

McCalpinJohn · ‎10-04-2018

If HyperThreading is enabled, it is possible that you are getting two thread scheduled on one physical core, which could lead to both 4K aliasing and poor scaling.

The mechanism(s) to ensure that threads are scheduled on separate physical cores depend on the programming model and the compiler that you are using. For OpenMP codes with Intel or GNU compilers, you can use the OMP placement directives or the GNU placement directives. For OpenMP codes with the Intel compilers, you can also use the Intel KMP directives. It is usually a good idea to use the "verbose" options (when available) to ensure that you have configured the environment correctly.

coulom__remi · ‎10-04-2018

Hi John,

Thank you very much for your reply.

It seems that the scheduler already does a good job of scheduling each of the thread to a separate core. My code is plain C++, in Windows. Just to make sure, I added a SetThreadAffinityMask() call for each thread. It did not improve anything at all.

I attach a screenshot of the Windows task manager with 12 threads, because I noticed something strange: it reports 73% CPU usage, but I suppose it should be closer to 50%. That may be just a display bug, but it may mean something interesting.

So this problem remains a mystery. I'd really love to understand what is happening.

To give a little more information: my calculation is a convolutional neural network (inference only). Each thread performs an independent evaluation. The weights of the neural network are constant and shared between threads. The whole set of weights does not fit in cache, but the required bandwidth is low, and it seems they are comfortably prefetched. I use 16-bit quantization, and my code is made mainly of _mm256_madd_epi16, and _mm256_add_epi32.

I will run more experiments and report if I find anything interesting.

Thanks again for any suggestion.

Rémi

Richard_Nutman · ‎10-05-2018

Worse scaling performance as you increase number of threads suggests the issue is in the synchronisation code.

Is it possible to remove the synchronisation between threads just for test purposes ?

TimP · ‎10-05-2018

In my experience, Windows scheduler alone doesn't do a good job under hyperthreading of keeping threads on separate cores, when the number of threads is set to number of cores. The symptom of performance peaking with a number of threads less than number of cores, or varying unpredictably as number of threads is increased, often is a result of failing to pin threads to cores.

I'm not familiar with the issue of how the pinning is handled with Windows threads calls. It's clear it's not as effective in all cases as the linux equivalent.

The term 4K aliasing, as John mentioned, is usually applied to the action of L1 cache, where data evict others with the same address modulo 4K. Not long after hyperthreading was first introduced, the disastrous effects of multiple threads sharing L1 were alleviated by schemes for partitioning L1. Still, if threads are allowed to move around such that they sometimes share a core, the effective capacity of L1 may be reduced significantly.

A different and less severe 4K effect occurs where scaling to multiple threads may be more effective if the threads work on separate 4K data pages, as well as pinned to separate cores, so as to take better advantage of TLB.

In the simplest case of 2 cores with hyperthreading, if not using a pinning scheme such as the one in Intel OpenMP, performance frequently peaks at 3 threads, as that ensures that at least one thread runs on each core. However, the L1 capacity may suffer, as each thread can count on only half of the L1 of either core.