best practice for evaluating AVX2 vs SSE4 parallel task power?

Todd_W_ · ‎03-27-2017

Hi, I'm working with some on demand, latency sensitive computations which parallel well and typically take 10-30ms to complete. VTune indicates their AVX implementation is generally higher latency than SSE because SpeedStep is less aggressive in boosting core frequencies from idle with AVX. The additional width of AVX does offer a lower millisecond * GHz product than SSE. But not by so much as to be unambiguously lower power when one considers the core voltage increase applied for AVX. It's therefore unclear if it's worth dispatching to the AVX implementation on processors where AVX is available.

What's the best software method to track core voltages and such while the tasks are running? I'm not seeing VTune 2017 Update 2 includes power in its platform information and searches in this direction lead into a maze of obsolete and contradictory information. From what I can tell getConsumedJoules() in OpenPCM appears to be the currently preferred approach for high such resolution timing . But, somewhat to my surprise, OpenPCM requires a somewhat involved build and installation process, doesn't release binaries, and lacks a nuget package with a .lib one can just link to. Its license does allow extracting the few bits of code I need. But this seems an unnecessarily difficult method for such a simple calculation. I'm aware of Intel Energy Profiler but it's not part of my license and, from the description, it's unclear if it supports sampling at rates above 10Hz whereas 1+kHz is desirable here.

Is there a better way? Profiling at coding time is OK but I wouldn't mind making the app smart enough to evaluate both widths and select whichever runs better.

gaston-hillar · ‎03-31-2017

Todd,

Whenever I need to measure energy consumption, I use the PowerLog utility of Intel Power Gadget. Simple yet very useful. Here is the link: https://software.intel.com/en-us/articles/intel-power-gadget-20

gaston-hillar · ‎03-31-2017

Todd,

I didn't have time a few hours ago to provide you this additional info. The following link provides you the documentation about the Intel Power Gadget API: https://software.intel.com/en-us/blogs/2014/01/07/using-the-intel-power-gadget-30-api-on-windows

Todd_W_ · ‎04-01-2017

Thanks, Gastón! In its app form Power Gadget relies on Sleep() to space samples, so they occur at a stochastic interval typically in the range of 7-26ms with a mode near the 16ms system timer tick. From an initial look I'm hesitant about the underlying API---among other things it's used to report only a single CPU rather than per core frequency, which VTune indicates can be a poor model---but will have to do some investigation later when I have more time.

jimdempseyatthecove · ‎04-01-2017

Todd,

There was a post on this forum (https://software.intel.com/en-us/forums/intel-isa-extensions/topic/710248 #3) by John McCalpin "Dr. Bandwidth" explaining that some Intel CPU's when .NOT. performing AVX/AVX2 instructions for a duration of more than (approximately) 1ms that the core shuts down half the AVX engine. If not observed, there is a delay of about 10us to get the upper half going again.

The solution to your latency problem may be as simple as inserting some innocuous AVX/AVX2 instructions in the code that runs between your low latency requirements. I believe this is on a per-core basis. You might be able to make a replacement Sleep function that includes the AVX/AVX2 instructions.

Jim Dempsey

Todd_W_ · ‎04-03-2017

Hi Jim, thanks. I'm aware of the cost of starting the upper 128 bits but 10us should be less than 0.1% overhead here (once a task is initiated each core's for loop runs nearly continuously). That's negligible compared to the range in core frequency choices SpeedStep makes at the start of the tasks. For AVX I've measured anywhere from 400MHz to 1.7GHz, though most commonly between 690MHz and 1.4GHz is chosen. Some load balancing overhead is also imposed as core frequencies are typically close but not quite matched and every so often one core ends up clocked substantially faster or slower than the others (±50%, say) and there's also some problems with tail latencies as SpeedStep doesn't necessarily raise core frequencies in response to sustained load. On my i5-4200U test system VTune shows core frequencies can change just a little over 50 times per second, so if SpeedStep makes a low initial choice and stays low at the first opportunity to revise after ~19.5ms one can end up with things like a task which normally takes around 15ms to complete taking 54ms. Or 63 or 76 or...

It happens this particular set of tasks are all 100% L3 bound and widen trivially. So it is OK in this case if SpeedStep chooses an AVX clock only half of an SSE clock as there's negligible penalty. VTune also shows AVX is effective at superqueue offloading though it may not matter as both SSE and AVX max out at 11GB/s memory bandwidth. That's only about 40% of theoretical max so I suspect the limiting factor is DRAM ultimately timings.

As a software developer I think all of this is out of my hands. What it seems one can do is try to measure the platform provided and select for the best available tradeoffs.

jimdempseyatthecove · ‎04-03-2017

Most system BIOS for Intel CPUs have controls for:

Intel Adaptive Thermal Monitor
Enhanced Intel Speed Step Technology
Turbo Mode
...

If you monitor your CPU temperature and/or improve CPU cooling, you may be able to set the cores to a consistent speed (though maybe not at max Turbo Frequency, but in the process, increasing the lowest speed). If your work load per hardware thread (2 or 4 in your case) is unbalanced consider tweaking your parallel regions such as to balance out the loads. If you have disabled HT, you can also experiment by enabling it then alternating between HT siblings in an effort to spread the workload over a larger area of the CPU die.

Jim Dempsey

Todd_W_ · ‎04-03-2017

Yep. However, the user base in this case isn't technical and doesn't necessarily have admin. So checking or adjusting BIOS settings isn't feasible. It's also safe to assume hardware cooling won't be modified from standardized OEM SKUs, particularly on ultrabooks.

So far as I know planning balance in software isn't possible as there's no way to ask SpeedStep which cores will run at which frequencies (or, even better, tell it what frequency to use). Am I missing something? The workload is amenable to relatively fine granularity so I've reactive balancing in place to mitigate slow core cases. When last I tested reducing to two threads from four was disadvantageous. That's consistent with Intel Power Gadget never reporting more than a 3C increase in core temperature and, within the limits of the tool, both SSE and AVX remaining well below TDP. In fairness I could investigate more deeply but it's probably more solution than the problem really calls for.

jimdempseyatthecove · ‎04-03-2017

>>So far as I know planning balance in software isn't possible as there's no way to ask SpeedStep which cores will run at which frequencies (or, even better, tell it what frequency to use). Am I missing something?

Possibly.

64-ia-32-architectures-software-developer-manual-325462-055US-June-2015.pdf, section 14.8

The thermal management facilities like IA32_THERM_INTERRUPT and IA32_THERM_STATUS are often implemented
with a processor core granularity.

So you may be able to use IA32_THERM_STATUS MSR to get the temperature relative to TCC activation temperature (number of degrees below TCC), then schedule your per core workload accordingly. (more CPU intensive to core with largest value)
Section 14.7.5.2 Reading the Digital Sensor

Jim Dempsey