Software Tuning, Performance Optimization & Platform Monitoring
Discussion around monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform monitoring
The Intel sign-in experience is changing in February to support enhanced security controls. If you sign in, click here for more information.

Performance of self-modifying code / (single use) JIT



I have a function, which, when called, writes some code to a pre-allocated (with write+execute permissions) area of memory, then calls it.

The function is called many times. The JITed code is used only once per call, as then is rewritten on the next call.

I'm interested in the overheads/penalties of this, particularly the cost of calling freshly written code and if there are any general guidelines on how to do this optimally. I've been struggling to find much information on the subject unfortunately, hence this post.

In particular, I've noticed that the more writes performed, the slower the code performs. That is, something like:

ptr = code;
*ptr++ = 0xC3; // ret

is significantly faster than (ignoring the fact that there are more instructions)

ptr = code;
*ptr++ = 0xC3; // ret
*ptr++ = 0xC3; // junk instructions
*ptr++ = 0xC3;
*ptr++ = 0xC3;

The latter executes slower and generate more machine clear/SMC events.

Because of this, I've found that it is often significantly faster to write code to a temporary (non-executable) location, then memcpy the contents over, than to write directly to executable memory (presumably memcpy is using vector units to perform the copy, resulting in fewer writes).

I have observed this effect on Nehalem and Haswell processors, but not on Core2, Silvermont or AMD K10.

My understanding is that this kind of behavior requires the L1D cache containing the fresh instructions to be flushed to memory (or L2$?) then the L1I cache needs to load them back in, as well as other instruction caches/pipelines needing to be refilled. The overhead is likely hundreds of clock cycles, but I'm sometimes measuring more with enough code written (~1000, though I don't really have accurate tools (not exactly sure what to use)). Perhaps these newer CPUs do some sort of pre-emptive flushes when the memory is written? If so, is there any way to prevent this until a decent portion of the code is written?

As the code is written out sequentially, is there perhaps some way to get the L1I cache to load in the fully written parts and not the unwritten parts? Am interested in hearing what solutions others may be using too.

Thank you for reading!

0 Kudos
1 Reply

This topic is covered in Intel's Optimization Reference Manual under the topic of "self-modifying code". Sections include coding rules 57 and 58 and section 3.6.9. Intel's volume 3A of the System Programming Guide has section 8.1.3 Handling Self-and Cross-Modifying Code.