CPI increasing in parallel execution

Alain_K_ · ‎03-07-2016

Dear Everyone

I am trying to understand why we are experiencing scalability problems in our application. The application is a C++ based simulation software running on Windows only. Concurrency is implemented using C++11 threading facilities and a “self made” thread pool. The tasks are embarrassingly parallel and have a runtime of 20 seconds or more depending on the simulation case. The variation in runtime is relatively low. The “self made” thread pool is of course not an optimal solution, but since the task count is low and the runtime high, it does not seem to be a hotspot.

Simulation tasks make heavy use of object allocation and deallocation. Therefore, we have switched to TBB malloc_proxy (tbb44_20151115oss). This improved the scalability significantly.

However, scalability is nonetheless poor. When comparing sequential (1 worker thread) with parallel execution (8 worker threads), I can observe the following (footnote 1):
- The runtime of an individual task is increasing several seconds when running in parallel.
- CPI is increasing from 0.616 to 0.688.
- Front-end bound is decreasing from 23.7% to 17.2%, bad speculation is decreasing from 6.5% to 5.7%.
- Backend-bound is increasing from 47.7% to 56.9%.
- Retiring is decreasing from 22.1% to 20.2%.

I initially suspected that in parallel execution the L3 cache is used less efficiently, since all 8 workers compete on space in the L3 cache. However, non of the functions with a high difference in runtime is DRAM bound according to VTune.

When digging into the backend-bound category, I see that it is mostly “Core-bound” or “DTLB Overhead” which is going up. I don’t understand why those metrics are higher when I increase the number of workers. Since (at least in my understanding) execution ports and DTLB buffers are not shared among cores. Therefore, I would be really happy to learn what could cause such a behaviour and how I could try to mitigate it.

I uploaded an excerpt from VTune here (sorry for strange page format):
http://www.csc.kth.se/~kaeslin/vtune.pdf

I tried Windows’s native SetThreadAffinityMask to bind threads to individual cores, but I could not notice any difference.

My testing system is equipped with a 8-core Xeon E5-2630 v3 processor, hence it cannot be a NUMA problem. Turbo boost is turned off for this analysis.

Any hints and suggestions are welcomed!
Thank you in advance.

(1) All values obtained when filtering on actual simulation function, hence ignoring everything happening outside of parallel region.

Alain_K_ · ‎03-08-2016

Dear Everyone

After reading this post I think I understand a lot better what is happening. When running in parallel, I can observe many more PAGE_WALKER_LOADS.DTLB_MEMORY events. The concurrent workers seem to evict each others entries from the shared L3 cache. Similarly, DTLB_STORE_MISSES.WALK_DURATION is increasing significantly.

Comments and suggestions are still very welcome.
Thank you and best regard

Peter_W_Intel · ‎03-08-2016

I'm not familiar with your code. My opinions are: 1) adjust your data structure to avoid page fault, blocking data and keep them best utilizing in cache 2) Avoid (reduce) use shared variables in each thread, use thread local variables instead. 3) Reduce critical code area if possible.

Alain_K_ · ‎03-10-2016

Dear Peter

Thank you for your reply. Data locality is a big problem in the application I am analysing.

Best regards.

Bernard · ‎03-10-2016

>>>Backend-bound is increasing from 47.7% to 56.9%.>>>

I suppose that Backend-Bound is increased because of inter-thread synchronization. I was thinking about the situation where multiple threads are operating on some data array and performing reduction-like computation.