With memory bandwidth limited

Todd_W_ · ‎02-25-2018

Hi, I have a C# workload built against .NET 4.6.1 using the TPL (task parallel library) on Win 10 Fall Creators Update (.NET 4.7.1). Two IO tasks run in parallel to p/invoke CreateFile and ReadFile to read the first 8 kB from each file in an array from an SSD. Two compute tasks then pick up these 8k chunks and call through a C++/CLI layer into C++ for some SIMD number crunching and stay in ring 3. The IO tasks share a C# lock statement to increment the file array index and the compute tasks use a second, independent lock to increment through the chunks. The compute tasks have some brief delay logic in case compute gets ahead of IO but, as IO is consistently faster, instrumentation shows it fires at most once at the start of processing.

As a baseline, I'm profiling a 6000 file case. First running the two IO tasks, waiting for them to complete, and then the two compute tasks gives 1.6 seconds for the IO (3800 file reads/s) and 1.8 s for compute (3400 file chunks/s). Since the operation is sequential, total time is 3.4 s for an all up throughput of 1800 files/s. Since the test processor is dual core and hyperthreaded (i5-4200U Haswell), one would expect running all four tasks in parallel would complete in close the 1.8 s limiting duration for compute. Unfortunately, this isn't what happens. Instead, what occurs is IO completes in 1.7 s (a drop to 3500 files/s) and compute degrades from 1.8 to 2.6 s, a rather precipitous drop from 3400 to only 2300 files/s. While this is still a decent improvement over the 1800 files/s of sequential operation it leaves two threads running on four logical processors for 900 ms after IO completes but the compute tasks are still running. One might reasonably expect spinning up two more compute tasks at this point would shorten this period to 450 ms, since that doubles processor resources allocated to CPU bound work. But that's not what happens. Instead of pushing overall throughput up to 2800 files/s the compute time remains 2.6 s despite the extra processing power. Curiously, a single compute task with nothing else running also takes about 2.6 s even though inspection of performance counters shows good load balancing between two compute tasks.

Additionally, this is a best case. Sometimes performance drops as low as 1300 files/s. Sometimes this seems attributable to other system load but most of the time there's a drop to 1800 files/s with no other obvious load on the box. From some experiments with setting thread affinity, it appears this drop is attributable to both IO tasks landing on one core and both compute tasks on the other. The more typical case of 2300 files/s seems to be associated with each core running one IO task and one compute task.

I've attempted to have a look in VTune (Parallel Studio XE 2018 Update 1) but it consistently BSODs the box shortly after the target executable starts, so no information is available from it. However, bandwidth here is only about 30MB/s so I'd not expect any trouble with L3 and all operations have sequential stride so 4k aliasing shouldn't be an issue. Profiling in Visual Studio 2017 indicates only expected CPU hotspots shows no contention over either the IO or compute lock. Oddly, VS does indicate some shared handle contention at the C++/CLI to C++ and p/invoke sites but the numbers are inconsistent with to the observed delays and inspection of the release build disassembly shows no critical regions at these points. So I suspect this is just VS indicating the insertion of its contention instrumentation. Also, the codebase contains another quad thread SIMD workload which runs through the same classes but doesn't exhibit this scalability problem (it runs about twice as fast on four logical processors as on two, as expected). The difference is that workload initiates from a single threaded C++/CLI transition and then invokes concurrency::parallel_for at the C++ level.

Any suggestions as to how else to take this apart to try to figure out what's going on? Pushing the compute tasks into C++ or C++/CLI isn't really an option as, in addition to the SIMD, they need to update data structures defined in a dependent C# assembly and also make some computationally lightweight but functionally critical C# calls into managed dependencies.

TimP · ‎02-26-2018

As the question has little to do with Intel software, you may not find anyone familiar with that environment on these forums .

I/O is normally serial on a single CPU, with equivalent of critical region in the background, particularly if you don't have a striped or fully parallel file system . If you do, you would still wish to measure performance with varying number of threads , and with hyperthreads disabled if possible. It would be unusual to see a significant boost from hyperthreads unless your application has database characteristics.

jimdempseyatthecove · ‎02-26-2018

Two suggestions:

1) On the C# side, do not continually create tasks/threads (this is SOP for C# programming). Create once, park and reuse if necessary. C# generally creates a new thread/runs/delete thread for each task. This is excessive overhead for performance oriented programming, and it makes it difficult to affinity pin threads.

2) On the C++ compute task side (2 compute threads), do likewise (create once), however affinity pin the two threads one to each core.

Note, if the running with one compute thread fully utilizes the L3 cache, then adding an additional compute thread will have lesser impact (in some cases negative due to cache evictions by other thread).

Jim Dempsey

jimdempseyatthecove · ‎02-26-2018

I forgot to mention, from the C# side, do not construct a C# loop that repeatedly spawns a worker thread/task, that in turn calls C++, that in turn uses OpenMP to create two or more workers. IOW do not continually use C# newly spawn threads to launch OpenMP. Doing so will continually instantiate new OpenMP pools.

Instead, write the C# worker thread that calls in to the C++/OpenMP such that it is persistent.

Jim Dempsey

Todd_W_ · ‎02-26-2018

Tim, sure, but something is preventing effective use of Intel hardware and VTune's unable to diagnose. I could put this on a Microsoft forum or Stackoverflow but it's my impression folks here are closer to and more knowledgeable about parallel interactions with hardware. While serial single CPU IO is a common practice it's not well supported by profiling data in this case. Without the compute tasks running the IO scaling is

IO thread: 1900 files/s
IO threads: 3800 files/s (linear improvement from second core, yay)
IO threads: somewhat variable but typically 4200-4300 files/s
IO threads: fairly volatile but generally 4600-5000 file/s

This is consistent with your expectation of limited improvement from IO hyperthreads and why I'm trying to get efficient utilization of CPUs 3 and 4 with compute hyperthreads to minimize overall execution time. This is an app for regular laptops and desktops so asking users to go into their BIOS and turn off hyperthreading isn't something we'll do. Particularly as there's a 20% improvement from leaving it enabled. Typing this response made me realize the pure IO improvement of going from two to four threads is about 25%, which isn't much better than that 20% and not too far off the 30%-ish often cited for hyperthreading.

Typing this up prompted me to realize the linear scaling we've observed with hyperthreads in the C++ SIMD workload might due to the two hyperthreads executing SSE2 to 4.1 instructions concurrently on the upper and lower lanes of an AVX2 ALU. (VEX encoding, if it matters.) If something is preventing this with this mixed C#, C++/CLI, and C++ workload it'd explain the lack of throughput increase going from two to four compute tasks as well as some of the thread affinity observations. Might conceivably also explain the lack of scaling from one to two compute tasks but I'll need to add some thread affinity diagnostics to check.

Jim, concurrent execution requires a minimum of one task per logical processor utilized. So taking full advantage of a hyperthreaded dual core requires the four task created here; there's no overallocation, the TPL schedules the tasks onto the .NET thread pool, and a 1.5+ second second lifespan is more than sufficient to amortize overhead. In the case where two compute tasks are added after the IO tasks complete they'll schedule onto existing threads, though not necessarily the same ones as the IO tasks as the thread pool typically has around 30 threads available. There are limitations in how fast the TPL's work stealing can move tasks onto threads on idle logical processors but I haven't seen these at rates below 25 context swaps per logical processor per second.

jimdempseyatthecove · ‎02-27-2018

On a dual core/4 thread laptop you do not need to muck with the BIOS but what you should do is affinity pin the compute threads to different cores and let the I/O threads float (presumably they will run on HT siblings of the compute threads).

It looks like you could run 3 I/O threads (no affinity pinning) + 2 compute threads (pinned to different cores).

The I/O threads generally do not use SSE/AVX and therefore would present little/lesser competition for the VPU (Vector Floating Point) units.The reason you can run 3 I/O threads (presumably on 2 HT siblings 1/core) is that during I/O wait time the I/O thread is suspended. You may also want to experiment with setting the run time priority of the compute threads above that of the I/O threads. IOW to force them to not interfere with the compute threads and thus compete for the availability of the HT siblings of the compute threads.

Jim Dempsey

Todd_W_ · ‎02-27-2018

Hi Jim, thanks for the suggestions. I should probably clarify two things:

The IO tasks are also compute hot. By CPU sampling they spend 85% of their time in CreateFile() and only 5% in ReadFile(). I think it's not well known, but it's most efficient to call ReadFile() synchronously on small reads like the ones here (if you attempt an async call it completes synchronously anyway). So, while the workload is IO, it looks a lot more like what would usually be called a CPU bound task than the more common way of thinking of IO as large reads bound to IOCPs.
The SIMD workload of the compute tasks is integer rather than floating point, primarily a . One thing I was hoping to get out of VTune was a view of how the integer ALU port loading looks on a core for IO only, compute only, and IO+compute on HT siblings. But, since it just BSODs, can't get there from here.

I've tried various thread mixes and pinnings and the 2:2 case seems to be optimum for 2 core/4 thread in the current situation. Haven't looked at the effects of priorities, though.

jimdempseyatthecove · ‎02-28-2018

CreateFile() and ReadFile() have/use sequential and non-Vector code and therefore would not interfere too badly with vector code running on the HT sibling of the same core.

As you indicated SIMD integer would likely not have HT siblings competing of integerVectorALU as it does for some FloatingPointVectorFPU(VPU).

BSOD: The system should not BSOD. As a work around, do you have a desktop similar generation system? If so

a) VTune on desktop as the relative behavior will be similar to notebook.
b) VTune has (can be setup) to run on the desktop but use remote procedure call to run the application on the notebook to collect sampling data.

Also note, in earlier versions of VTune at my site I could run Timer based sampling when Event based sampling would fail. Experiment with simplifying the sampling.

Jim Dempsey

Todd_W_ · ‎03-02-2018

Unfortunately the only VTune analyses which don't BSOD are the basic ones with less information than Visual Studio's profiling. Desktop hardware is no problem but I don't have admin. So, while it'd be interesting to try RPC, can't quite get there from here,

However, I've replicated the lack of integer SIMD compute scaling on another workload which was profiled with Parallel Studio XE 2017. At that time, VTune indicated entirely memory bound performance with the platform running delivering about 28% of its 25.6 GB/s maximum memory bandwidth. The code in this workload hasn't changed and the disassembly still looks tight (no register spilling or such) but it's now profiling at 19% of maximum memory bandwidth. My rough estimate for two IO threads and two compute threads is 18.4% with the compute workload likely being over 15%. The lack of scaling which prompted this thread is therefore plausibly attributed to exhaustion of memory bandwidth. This implies a few optimizations to try but the code changes are somewhat substantial it'll be a bit before I can get to those experiments.

In the meantime, the reduction from 28 to 19% of potential troubles me as I've not an explanation for it. Having expectations for what percentage is typically achieved would be valuable context. But it appears there's little discussion of this parameter, either in general or for the aligned, sequential __m128i and __m256i loads and stores occurring here.

jimdempseyatthecove · ‎03-03-2018

With memory bandwidth limited programs, there are a few things you can do to improve program performance.

One issue that impacts memory performance is the number of pages of virtual memory that the data and code occupy. CPU cores have a limited number of pages that can be mapped concurrently. Unfortunately you cannot run VTune to measure this, but you can look at your program and code to "compact" the dispersement of the data into fewer pages. There is a similar issue with code.

A second issue is if the operations are vectorizable, assure that the data is aligned whenever possible.

Jim Dempsey

TimP · ‎03-04-2018

Data TLB misses, which are a subject of VTune analysis, are the usual performance factor with lack of page locality.

Todd_W_ · ‎03-05-2018

Yep. Though, from a practical perspective, since VTune BSODs it's now a bit difficult to get TLB information into an analysis and I didn't retain the VTune data from 2017. pcm.exe from Intel's Processor Counter Monitor doesn't report TLB information. pcm-memory.exe might but it seems so far I've not hit upon the right combination in its quirky install process and instructions for it to be able to access the memory counters.

I suspect there's a few things going on in this direction. While putting an IO+compute pair on HT siblings is more efficient than the alternative arrangements, the workloads are in different assemblies working with different data. L2 misses are about every 200 instructions and CPI drops from around 2 to 0.8 once the IO tasks complete and the just the compute tasks are running. It's difficult to flatten the C# data structures, so likely there is some memory scatter based on what the GC is deciding to do, and reading 8k from the 6000 files in the basic test loads 48MB. The IO tasks therefore don't have to get much ahead of the compute tasks for 8k chunks to move from L3 into main memory and then need to be brought back, which seems consistent with pcm.exe reporting an L3 miss every 500 instructions or so. The compute tasks do have a working set less than 100k and ought therefore to fit in L2. But they may be getting bumped by the IO tasks. I can likely confirm or deny with more probing but haven't opportunity to run those experiments yet.

Another consideration is that, while threads are affinitized to cores, data currently isn't. L2 and L3 behaviour may be improvable with partitioning which affinitizes data and by limiting how far IO can get ahead of its compute sibling. Basic probing in these directions has consistently indicated regression but I'm not yet ready to say investigation is complete.

jimdempseyatthecove · ‎03-05-2018

If you can make the C# data structures persistent this may help some (I've done this before). Also, investigate if you can change from C# predisposition to use Array Of Strctures to Structure Of Arrays.(at least for the compute intensive data that can yield a beneficial return on your investment in programming time).

Jim Dempsey

Todd_W_ · ‎03-05-2018

That's why the C++ layer exists below C# with the C++/CLI just thin glue. ;-) Can't do anything about CreateFile() but the SIMD bits I control run on a persistent 80k block from _aligned_malloc allocated by each compute task. The output back into C# is 16 bytes and the C# side structures are also persistent. They're considerably more layered than typical AoS and SoA examples but, to use the terms casually, they're more AoS than SoA. While I've been nudging them towards SoA over the years for other performance reasons that work is complicated enough it's really best kept a distinct effort. As general context, what's happening in the 6000 file example considered here is 48MB is extracted from 11.3GB of files and processed to further extract just a few pieces of interesting information, yielding 1MB in a database. A major part of that second extraction is done in a computationally light call into a C# dependency taking the 48MB as its input. This is as minimal as I can make the arrangement without starting to duplicate the dependency's code. I'll have to check the memory footprint of the C# database structures later but it shouldn't be more than a handful of MB. It's all off the compute paths so the cost of snapping pointers to it should be well amortized.

There are some things I can try here around data affinity, limiting spill of 8k blocks into main memory, and allocating the 8k as array elements in larger contiguous blocks. But the 8k related bandwidth is about 25MB/s and pcm.exe's reporting only 1.8GB/s read+write combined. So potential for improvement appears limited.

diagnosing unexpectedly poor scaling among IO and compute threads