Cross processor code modification

Sanjoy_D_ · ‎04-12-2016

Hi,

I'm trying to figure out the right way to patch an instruction that is concurrently being executed on another core, possibly even on a separate socket. The only place in the Intel manual where something like this is mentioned is Volume 3 (System Programming Guide) / Section 8.1.3, where the manual says the patching thread and the executing thread need to implement a mutex-like protocol. I'd like to know if the protocol mentioned in 8.1.3 is required on newer Intel chips or if newer Intel chips in practice have a stronger guarantee around code patching that will let me implement something cheaper. I don't really care about making the patching thread fast (since we'll patch instructions very rarely), but the we're not okay with slowing down the executing thread.

-- Sanjoy

jimdempseyatthecove · ‎04-13-2016

It may be easier to patch a functor than to patch the code itself. A problem you have with patching code directly is any patching will not patch instructions in the instruction pipeline (of either CPU/core). There may be similar issue with the L1 instruction cache.

Jim Dempsey

Sanjoy_D_ · ‎04-13-2016

By "functor" do you mean a function pointer? Won't that make me always have an indirect call (i.e. load function pointer; call function pointer)? That's likely to be too slow.

> A problem you have with patching code directly is any patching will not patch instructions in the instruction pipeline (of either CPU/core). There may be similar issue with the L1 instruction cache.

So let's relax the constraints. Let's say I'm okay with other (or even the same) cores not seeing the update immediately, but "eventually" (for some negotiable definition of "eventually"). It is fine for cores that already have the old instruction in their pipeline to continue executing it, but cores are only allowed to execute either the pre-patched instruction or the post-patched instruction (but must not crash or execute garbage). Does that make things easier?

andysem · ‎04-14-2016

I think patching code can make a much more severe hit on performance than an indirect jump, even if it's frequently executed. This article may be helpful: http://blog.onlinedisassembler.com/blog/?p=133.

One other issue is atomicity of code modification. x86 instructions can have any size from 1 to 15 bytes and are not required to have any alignment and may span across cache line boundaries. Unless you have a very specific instruction that is suitably aligned and have a suitable size, this basically rules out atomic operations for modifying code in flight.

jimdempseyatthecove · ‎04-14-2016

The functor can be placed in a register. Therefor, if you fetch the functor from memory into register, typically 4 clock cycles from L1, longer when functor changes, then you can execute other useful instructions during the memory/cache latency

Pseudo code:

r15 = qword ptr [YourFunctor] ; likely in L1, in RAM only after change by other thread
push arg
push arg
push arg
call r15 ; only stalls here if/when latency exceeds time for interviening instructions.

Jim Dempsey

Sanjoy_D_ · ‎04-14-2016

andysem wrote:

I think patching code can make a much more severe hit on performance than an indirect jump, even if it's frequently executed. This article may be helpful: http://blog.onlinedisassembler.com/blog/?p=133.

The situation is that we "almost never" patch code, so the tradeoff we have is (say) 100 million indirect jumps / calls vs. a one time code patch.

andysem wrote:

One other issue is atomicity of code modification. x86 instructions can have any size from 1 to 15 bytes and are not required to have any alignment and may span across cache line boundaries. Unless you have a very specific instruction that is suitably aligned and have a suitable size, this basically rules out atomic operations for modifying code in flight.

But what if I have do have the instructions properly aligned (i.e. does not cross a 16 byte boundary even, let alone a cache line boundary)? Do I still have to follow the above protocol or is there a short-cut that I can take for newer intel architectures?

Sanjoy_D_ · ‎04-14-2016

jimdempseyatthecove wrote:

r15 = qword ptr [YourFunctor] ; likely in L1, in RAM only after change by other thread

push arg
push arg
push arg
call r15 ; only stalls here if/when latency exceeds time for interviening instructions.

I understand that I can make the indirect call very cheap; but it still is not free. I'll take up space in my L1 that could have been used by other things, for instance.

andysem · ‎04-18-2016

Sanjoy D. wrote:

But what if I have do have the instructions properly aligned (i.e. does not cross a 16 byte boundary even, let alone a cache line boundary)? Do I still have to follow the above protocol or is there a short-cut that I can take for newer intel architectures?

x86 in 32-bit mode has atomic ops on up to 8 bytes and in 64-bit mode on up to 16 bytes. If the instructions are suitably aligned then you should be able to use atomic instructions to update the code (e.g. using a CAS loop). If the modified instruction fits in 8 aligned bytes then you might even be able to use regular stores to write the modification as these are considered atomic as well. Cache coherency protocols work on cache line granularity, so with the cache line changes being atomic I don't see how other CPUs could see half-updated instructions and you should be safe.

andysem · ‎04-18-2016

Sanjoy D. wrote:

But what if I have do have the instructions properly aligned (i.e. does not cross a 16 byte boundary even, let alone a cache line boundary)? Do I still have to follow the above protocol or is there a short-cut that I can take for newer intel architectures?

x86 in 32-bit mode has atomic ops on up to 8 bytes and in 64-bit mode on up to 16 bytes. If the instructions are suitably aligned then you should be able to use atomic instructions to update the code (e.g. using a CAS loop). If the modified instruction fits in 8 aligned bytes then you might even be able to use regular stores to write the modification as these are considered atomic as well. Cache coherency protocols work on cache line granularity, so with the cache line changes being atomic I don't see how other CPUs could see half-updated instructions and you should be safe.

jimdempseyatthecove · ‎04-19-2016

>>I understand that I can make the indirect call very cheap; but it still is not free

You will have to look at the total cost. Sketch of protected method:

volatile int Phase = 0; // location visible to both threads
// When phase is odd, critical code region is in use

int workerLastPhase = 0; // can be a register in the worker thread

void patch(Payload_t* payload)
{
  for(;;)
  {
     int oldPhase;
     if((oldPhase = Phase) & 1) continue; // in use
     if(CAS(&Phase, oldPhase, oldPhase+1)) break;
  }
  // perform patch here
  Phase = oldPhase + 2; // release
  return;
}

// other thread
// when reaching the protected region
   ...
   for(;;)
   {
     if(CAS(&Phase, workerLastPhase, workerLastPhase + 1)) break;
     FlushInstructionPipeline();
     workerLastPhase = (Phase + 1) & ~1;
   }
   {
    ... the pached area here
   }
   Phase = workerLastPhase; // release the code region
   ...

The cost of the CAS (and related code) clearly outweighs the cost of the indirect function call.

Jim Dempsey