For TEST wait until mem == 0

andysem · ‎11-09-2019

Hi,

According to Agner Fog's instruction tables and my own tests, the `pause` instruction is much more expensive in terms of reciprocal throughput on Skylake-X CPUs than on the previous generations (I tested on Sandy Bridge and Broadwell). The difference is an order of magnitude. Agner Fog lists the following reciprocal throughput for the `pause` instruction:

Sandy Bridge: 11 clocks

Haswell, Broadwell: 9 clocks

Skylake-X: 141 clocks

My own tests show the following numbers:

Sandy Bridge: ~24 clocks

Broadwell: ~12 clocks

Skylake-X: ~389 clocks

The exact numbers are not so important to me, and my measurement most probably includes the overhead from the test loop itself. What is important is that on Skylake-X `pause` is way more expensive than on the other architectures, to the point it makes me question whether I should revise my use of this instruction in various spin loops. So my questions are as follows:

1. Is it still sensible to use `pause` in tight spin loops in order to improve performance of hyperthreads?

2. Given the cost difference, should programmers account for the cost difference of `pause` in the number of spin iterations? For instance, if I account that a particular spin loop does not exceed ~500 clocks, which is a typical estimate of a context switch on Linux, should I calculate the number of spin iterations from the actual cost of `pause`?

3. If p.2 is true, are applications expected to benchmark `pause` before deciding on the number of spin iterations? What are the best practices regarding such benchmarking?

4. Is the Skylake-X order of costs considered "normal" and expected to remain the same in future generations? Or is it maybe a CPU bug that is expected to be fixed in the future? I understand that it is not officially known what will be implemented in future products, but general position on the issue is also of interest.

Thanks.

jimdempseyatthecove · ‎11-09-2019

It is unfortunate that there isn't a variation of MWAIT then can be executed in Ring 3 (user mode) that can be used for inter-HT as well as spinwait loops within a process. What I would suggest to be considered for this purpose is to use:

LOCK; TEST mem
and/or
LOCK; CMP ... with one of the arguments mem

Currently these would be invalid instructions as they are not read/modify/write.

The operation would be:

For TEST wait until mem == 0 (or interrupt by O/S for preemption)
For CMP wait until equal (or interrupt by O/S for preemption)

Note, following the TEST or CMP would be the branch back on fail.

DEC CountOfBusyThreads
SpinLoop:
LOCK; TEST CountOfBusyThreads // Wait for remaining threads to complete
JNE SpinLoop

A similar think for CMP except that you have more flexibility. For example testing for count up as well as count down. Or on each change in CounteOf... as might be useful in performing reductions as each thread completes as opposed to after all threads complete.

Of course this ability would have to be listed in the CPU feature set.

The benefits of this are:

1) Spinwait loops occupy (near) zero CPU resources
2) Spinwait loop exit (response) has lower latency (no need to wait for completion of PAUSE)

Granted, you wouldn't use this in situations were your Spinwait is designed to suspend the thread. Rather, you would use this in situations where you know that there is a high probability that the former use of Spinwait would complete with exit before spin time expires.

Jim Dempsey

jimdempseyatthecove · ‎11-09-2019

A variation (extension) on this would be to permit the prefix REP to state the maximum number of clock cycles to wait is held in (R/E)CX.

This would add a 3rd benefit that you can suspend the thread on timeout should subsequent test indicate not done.

Jim Dempsey

andysem · ‎11-09-2019

For TEST wait until mem == 0 (or interrupt by O/S for preemption)
For CMP wait until equal (or interrupt by O/S for preemption)

I'm not sure this would be flexible enough in my cases. Spinning often requires complex tests on one or more memory locations, which would not be possible with the above instructions. Also, I'm not sure if complex checks like the above are easily implementable in hardware without performance loss. Note that `mwait` only monitors memory writes and does not analyze the written value.

There is already a `tpause` instruction, which is not yet available in released CPUs, which gives some control over the duration of the execution suspension.

However, new instructions is not my main interest. For the forseeable future I will have to target CPUs that don't have `tpause` or any other yet unreleased instructions. This means that I'll have to stick with (or avoid using) `pause` and other existing instructions while spinning. So, my questions remain.

jimdempseyatthecove · ‎11-09-2019

>> complex tests on one or more memory locations..

You could have a flag that says "test multiple locations - one changed". The value written into there could be the index of the location (to first check).

>>tpause

The issue with this is you could, say, wait 100 clock ticks only to have the "go" event to be set at 1 tick later. You would then wait longer than necessary. The tpause is likely a re-hash of the KNC delay instruction.

>>This means that I'll have to stick with (or avoid using) `pause` and other existing instructions while spinning

Try issuing a common instruction that is register-register but that consumes a large number of clock cycles

Skylake

IDIV r32 10 ticks
DIV r64 32 ticks
IDIV r64 57 ticks
RCR/RCL r,i 8 ticks
LOOP(N)E 11 ticks

Or try FPU (x87) that are register only (FPSIN...)

Jim Dempsey

andysem · ‎11-11-2019

Try issuing a common instruction that is register-register but that consumes a large number of clock cycles

I could but that would waste resources that may be needed by a hyperthread, which means it won't be helpful for HT. Also, one of the potential advantages of `pause` is to reduce power consumption, which allows to keep CPU clocks high.

Or try FPU

This will require kernel to save and restore FPU state on task switching.

jimdempseyatthecove · ‎11-11-2019

>>I could but that would waste resources that may be needed by a hyperthread, which means it won't be helpful for HT.

It is not the waste of resources , it is the waste of shared resources that affects the performance of an HT sibling. Of primary importance is the memory/cache controller and the VPU (vector processing unit). I do not know if the intrinsic functions of the FPU consume resources of the VPU. You currently have a test program for PAUSE, I suggest you experiment.

>>advantages of `pause` is to reduce power consumption, which allows to keep CPU clocks high

Correct to some extent. It is the core temperature that throttles the core(s). The above mentioned instructions are high tick counts because they are little used, and thus may be located in different areas of the die. Meaning any heat is disbursed. Again, this is a case where experimentation would make the determination.

>>This will require kernel to save and restore FPU state on task switching.

The kernel does not know if you used the FPU, therefore it must save and restore the FPU on context switch, that is unless it requires all threads to not use the FPU.

Jim Dempsey

andysem · ‎11-11-2019

The kernel does not know if you used the FPU, therefore it must save and restore the FPU on context switch, that is unless it requires all threads to not use the FPU.

It's not quite the case.

jimdempseyatthecove · ‎11-11-2019

OK, however, should a save and restore NOT be required when not necessary, then use of FPU for "pause" would only potentially cause minor system overhead during context switch, and it which case your fast barrier is moot.

Jim Dempsey

andysem · ‎01-09-2020

Adding overhead on context switching means general system performance degradation. If two or more threads enter this spinning loop even once per time slice, that would mean imminent penalty on the next context switch, which might well exceed the cost of spinning in the first place. That is also not considering any possible power/heat/frequency implications from having to wake up x87 FPU.

After experiments and thinking things over I decided to remove spinning in a few places of my code. I tried the benchmarking approach, but it turned out impractical since the benchmarking results were too unstable from one run to another. I also don't want to hardcode the particular CPU models as having expensive `pause` because I don't know whether that is expected to change in future models. Unfortunately, it looks like `pause` is not well suited for spin loops where spinning duration might be a concern, which is often the case.

Pause instruction cost and proper use in spin loops