Looking for help on a hyperthreading (?) issue

Jim_H_ · ‎07-03-2017

Without getting to detailed here, I am interested in finding out why some software DAW products have difficulty spreading load among CPU cores.

Some seem to put extreme CPU hogs in two logical cores in a physical core even when other cores are fairly idle.

I have been searching for a solution for some time and I was given advice that I may find help here.

If this is the place, I'll write more details, if not, I'll keep on looking.

Jim_H_ · ‎07-04-2017

OK. since a moderator approved my post, I'll supply more details. I'll be as succinct as possible but I will supply more details if needed.

I think my problem is related to hyperthreading.

I build some software for experimental music using a tool called Reaktor. Some are very CPU intensive. I share these freely with other users on a library forum provided by the vendor.

Reaktor is a single core application, so to achieve more we may have to split our stuff in pieces and chain them together using a Digital Audio Workstation on separate audio tracks. These DAWs put tracks in separate cores.

On three different DAWs I see three different behaviors. One places them seemingly at random, but usually in adjacent logical cores - one physical core. This gives very poor results. I often can not run 2 such tracks.

Another does a better job but has a bit more overhead and I can run 3 such tracks.

Another allows me to specify which cores it may use, so I select odd number cores only. I have 4 physical/8 logical cores and in this DAW I can easily use 4 tracks.

By assigning CPU affinity to 4 Reaktor processes running standalone, I can also manually create 4 such instruments.

If I turn off hyperthreading, I can get 4 such tracks in the problem DAWs, I think - I did not do extensive testing that way as turning HT off created a few other issues for me.

I use an overclocked Core i7-2600K @ 4.4 GHz that I built in 2011. It has been very stable. But I fear maybe I have set up something that is not helping this.

I've re-checked all my overclock BIOS parameters and tried different settings. I get the same results.

I want others to use stuff I do, and they are on Macs and PCs. Most users are musicians who never build Reaktor stuff. There are something like 100 users who are builders like me, and few make large intensive stuff like me. I don't want to tell users to turn off HT to use my stuff.

I'm looking for someone to give me some technical advice.

Any ideas on things to look for that may help?

TimP · ‎07-05-2017

You should be running Windows 7 SP1 or a newer Windows version in order to support scheduling for hyperthreading. Even then, Windows may exhibit more difficulty than other OS in allocating threads evenly across cores. If the application has been built for a version of OpenMP such as Intel libiomp5 which supports affinity, and the number of threads is set to number of cores, it is important to set OMP_PLACES=cores.

Other threading libraries, including gnu and Microsoft OpenMP, and Intel cilk(tm) plus, don't offer a facility to schedule threads evenly across cores. If it is not possible to disable HyperThreading, it may be necessary to experiment with setting a number of threads or workers which is greater than number of cores but less than the total number of logical processors. There isn't any satisfactory universal solution to the question of running multiple applications with minimal contention for logical processors without instructing each application to use its own group of cores.

Your 2nd comment wasn't visible at the time I answered, even though the web site time stamps indicate a 3 hour lag. Pinning an application to odd numbered logical processes seems a reasonable solution if not using Intel OpenMP.

McCalpinJohn · ‎07-05-2017

The GNU OpenMP implementation allows scheduling for HyperThreading, but you have to know which core numbers to use and assign them manually. An exception is on Mac OS X, which (as far as I can tell) does not provide the basic functionality of binding processes to processors.

The problem is not limited to HyperThreading. Running RHEL/CentOS 6.4 on our systems with 2 Xeon E6-2680 (v1) processors (8 cores per chip, 16 cores per node, HyperThreading disabled), it does not appear to be possible to get an OpenMP code to use more than about 13 of the 16 cores without explicit thread binding. Fortunately this is fixed in RHEL/CentOS 7 versions -- but we still recommend explicit control of thread placement whenever it is feasible.

TimP · ‎07-05-2017

gnu OpenMP lacks support for affinity on Windows (as, apparently, MacOS). According to the docs, current versions should support OMP_PLACES on linux, although John surely is more up to date on how this relates to the most often used distros.

Windows does not appear to have upgraded its support for scheduling since Intel made the improvements for win7 sp1.

Apparently, the OP isn't using a high level parallel library such as OpenMP, and, as we just said, the gnu OpenMP wouldn't help scheduling on the OS in question.

boost::thread library has been used in similar contexts, but it seems difficult to find out how if it should handle affinity on various OS. From experience, we know that suspending and resuming a boost thread causes it to lose memory locality, and there seems to be some hint of this in the documentation.

Past advice from Microsoft has indicated that a suspended Windows thread is expected to lose memory locality. The Intel OpenMP library seems capable of reattaching memory locality, but it can take a while, hence a suggestion to modify the wait policies. I don't know if you could find hints about this in the open source or llvm OpenMP projects.

Jim_H_ · ‎07-05-2017

I build in Reaktor 5 and 6 - Native Instruments Reaktor is a graphical development tool for audio. It does not have scripting or text-based features and I have no control over what libraries it uses.

I use Windows 10 pro, latest build. The three DAWs I have to experiment with are Ableton Live, Cakewalk Sonar and Cockos Reaper.

The CPU is Sandy Bridge Core i7-2600K @ 4.4 GHz.

The Reaktor project (called 'ensemble') that I want to use has the ability to adjust CPU load so I can experiment with that only. I have targeted it to use 77% CPU load on a single core - that should give enough extra cpu for the OS and DAW (I think).

Live seems to allocate its threads haphazardly, often choosing adjacent logical cores for these loads. So it breaks for me when I load two ensembles.

Sonar allocates them better across cores, but it has some extra overhead doing this, it seems, and I can load three ensembles.

Reaper allows me to choose CPU 1,3,5 and 7 (or other combinations). That allows me to run four ensembles.

If I turn off hyperthreading in BIOS, I can also run four ensembles without making any other adjustments.

When I submit my ensembles to the Reaktor User Library, I always mention my machine capabilities and loads so users can judge for themselves, but I always still see complaints about the load anyway. I don't want to tell them to use this DAW or turn off HT or anything like that.

I've attached the loaded DLLs for these DAWs so maybe that can tell you what libraries the DAWs are using.

Jim_H_ · ‎07-05-2017

Here are some graphics to illustrate. I've worked on this off and on since 2015. The Live and Sonar graphics date back to then.

I've repeated the tests recently and Live is no different, while Sonar has improved a bit, but still shows lots of overhead across all the cores. Reaper is new to me, I just bought it to work on this issue.

Live - audio dropouts after 2 ensembles

Sonar: no dropouts with 2, newer versions allow 3.

Reaper: 4 loaded, no dropouts. Windows GUI is not sluggish under this load. This shows my first test with Reaper under an evaluation license. I've since purchased it to continue testing.

Jim_H_ · ‎07-06-2017

My conclusion after doing these tests is that I have either misconfigured something in my system or BIOS, or that hyperthreading won't work in these extreme cases - that there are not enough spare resources in the logical processor pair because the extra load wants to use the same resources that the other logical pair is using.

Newer CPUs than Sandy Bridge probably have a lot more on-die resources to share so maybe it is less common on them.

I went through all my BIOS overclocking parameters, changed a bunch of stuff, got near 4.999 GHz, then backed off to 4.4 GHz. I built the desktop in 2011 and it has been very stable.

No change in the results of the DAWs after these changes.

I guess it's about time to consider building a new system.... I'm looking forward to the July 11 teleconferences.