Does Intel plan, or need to for that matter, issue an update to the C++ compiler and/or Profile Guided Optimization process for mitigation of meltdown and spectre? I see that GCC and LLVM are releasing patches. If the answer is yes, can you:
Estimate when an update would be available?
Specify if a compiler option will be available for enablement/disablement?
Estimate how much performance loss might be expected as a result of said patches?
Describe in detail the difference in code emission? For example, will you be emitting lfence operators prior to indirect branches or some other technique(s)?
Thank you for any advance notice you can provide. This will assist us in planning for release of said changes, should they be necessary.
- Development Tools
- Intel® C++ Compiler
- Intel® Parallel Studio XE
- Intel® System Studio
- Parallel Computing
IMHO the resolution of Meltdown and Specter ought to be the responsibility of the software engineer designing a secure operating system. Mapping pages with secure data into non-privileged processes (as execute-only memory) is (apparently) a design flaw of the O/S writer and not that of the processor design. The exploits are definitely an oversight in the sense that a caution could have been made to the O/S writers once the exploit was discovered. I do not feel (wish for) the CPU design to change, but ratter that a formal Application Note be produced as to how to properly secure sensitive data/code as a measure in the O/S.
Meltdown could be considered a design flaw in the OS, but it was really a team effort between the hardware and software. The Intel hardware does maintain zero-order protection across protection domains, but fails to prevent speculative execution from using protected data to make reliably detectable changes in the cache state. This provides a high-bandwidth covert channel for reading protected data. AMD was right to block this case before execution (since there is no case in which the memory access would be allowed to complete, allowing it to execute speculatively provides no benefits). Ironically, Intel's TSX extensions provide a large increase in the throughput of the Meltdown attack, as well as eliminating the exceptions that could be monitored by the OS. (I.e., the kernel interrupt handler could monitor for page faults caused by a user process attempting to load kernel pages and take action if this occurs too frequently.)
Spectre is a much lower-bandwidth covert channel (10kB/s vs 500kB/s for Meltdown), but it is not as easily blocked as Meltdown. The demonstrations referred to in the Spectre paper placed the attacking thread in the alternate HyperThread of the physical core where the target process was running. If HyperThreading is disabled, the attacking thread has to attempt to bias the branch predictor of the target process by running on the same physical core in alternating time slices. Although the authors of the Spectre paper say that they have shown that this can be used to cause branch mistraining on Skylake processors, they don't show that they were able to generate a working exploit in this mode. It is probably possible, but would likely have significantly more noise and significantly lower bandwidth than the attack based on using the alternate HyperThread on a core with a shared branch predictor. Using a shared branch predictor might have been necessary in the first processors that supported HyperThreading, but it is difficult to imagine that there is any value in keeping it shared in the era of multi-billion transistor processors. Clearing the branch prediction history on context switches (or saving and restoring it as an explicit part of the process context) would eliminate the primary tool used by Spectre. The Spectre paper concludes with a listing of other techniques that could be applied in conjunction with speculative execution to create data leaks, but without specific implementations and demonstrations, it is unclear whether these are serious potential threats or just random speculation....
Thank you for the insights. My concern arises from the some of the code emissions going into GCC and LLVM. There are some that emit additional lfence operators. Others that use pause. And of course the retpoline. In all these cases, the impact for software that makes heavy use of indirect branches is going to be measurable. We use PGO as we are very sensitive to performance, and we gained significant performance from the speculative branch prediction improvements that were made in Haswell and beyond. We specifically use the Intel C++ compiler because it emits better code for our application than Microsoft Visual Studio. We even put up with the horrific time it takes to do PGO in the Intel environment compared to the Microsoft environment. If the resulting executable was not measurably faster, we would not do so.
If Intel does decide to change code emission as a result, hopefully they will include a compiler switch to allow us to enable/disable the slower code emission as we see fit.
John>>but fails to prevent speculative execution from using protected data to make reliably detectable changes in the cache state.
Isn't the problem that the "protected" data is still mapped to the attacking process (e.g. as execute-only). IOW the exploit could be thwarted if the page table entry (for the protected data) of the user process (when at user level) is invalidated or points to an innocuous page and that on transition to ring 0/1 (at privileged mode) the page table entry is properly set, then on exit is reset. This should cause the speculative code to fetch (as index) data from a faulting page or the innocuous data for use as index.
Andrew>>My concern arises from the some of the code emissions going into GCC and LLVM
There should be no changes to the compilers for this as the attacker would simply write the code in assembler. IOW the compiler should generate code that takes full advantage of any speculative execution capability of the processors.
The "fix" is NOT "Don't allow speculative execution", rather the "fix" is to code the O/S such that this exploit is not possible.
The Meltdown exploit requires that (1) the user process page tables include the kernel page tables, and (2) speculative memory accesses from user space to kernel pages are allowed to execute (returning data to the core and allowing dependent speculative instructions to execute using the *value* loaded from the kernel address).
This can be fixed by changing either (1) or (2). AMD prohibits (2), by refraining from executing loads speculatively if the code is currently operating in user space and the Page Table Entry for the target address has the kernel attribute set. Contrary to some comments, this is not hard to detect -- it requires comparing one bit of the current processor execution mode and one bit from the Page Table Entry that had to be present to perform the address translation and access checking for the memory reference. In this case, the AMD processor will refrain from executing the instruction until it becomes the oldest instruction in the queue -- i.e., all prior instructions have executed and committed, so the load is no longer speculative. At this point there are two choices: (a) allow the load to execute, then trigger the fault at commit, or (b) notice that the load will fault, so just trigger the fault without executing the load. Option (a) is a bad idea, since executing the load changes the cache state (which can be used to determine kernel addresses and undo the benefits of KASLR), but (depending on the details of the implementation) it might be able to prevent speculative execution that uses the results of the load. Option (b) is a much better approach -- since there is no case in which the load could succeed, there is no benefit in actually executing it.
I think it is preferable to require the O/S programmer to change (1) as this can be issued as a software update. Whereas (b), while also working, requires a hardware change. Reports indicate this is not correctible with re-flash of flashable firmware in the CPU, thus would require a CPU replacement rather than an O/S update/patch.
We have one central location for all communications regarding this. Please check https://newsroom.intel.com/ frequently for any latest update.
Intel Compiler Support group
>>We have one central location for all communications regarding this.
That is a one-way communication....and the topic gets lost in the flurry of other "news" postings.
From looking at the "indirect branch" exploit, and looking at the Intel 64-ia-32 reference manual, together with other references to speculative code it appears that one can have a block of memory aligned at and size of one page or more pages filled with some instruction like RET or other one byte instruction. Then have a loop where all cache lines containing all possible RET/JMP's in that table are flushed. IOW next read of entry in table will require fetch from RAM (slow access). Then the exploit issues a series of instructions such as:
1) produce a result that is known to be zero but CPU/compiler cannot determine in advance.
2) execute a JZ which always succeeds.
3) followed by a JMP "near relative" word/dword ptr ProbeAddress speculatively executed... but never executed.
Where the JMP instruction address immediately precedes the buffer. The speculative execution would load the instruction pipeline from the buffer (RIP of JMP) + DEST offset fetched (note as execute-only). Then the exploit would probe the buffer to see if a specific cache line is in cache or not. Cache hit indicating offset into buffer was contents of memory at location specified by speculative JMP. i.e it would know the content of the offset represented by the DEST (held in the protected page).
This is an interesting exploit... *** However, this requires the protected page be mapped to the user process. While the user process cannot directly read the memory, the speculative execute read can (and then speculatively reads the instruction is the user's buffer).
Should the O/S invalidate the protected page .OR. if that virtual memory page is remapped to an innocuous page (e.g. full of 0's), then this exploit would not be possible. This "firewall" would come at the expense of issuing a CPU serialization instruction (that assures page table entry becomes effective), upon return from an O/S call that re-maps the protected page to the VM, performs password/other code, unmaps page, flush page table entry. IMHO it is the responsibility of the O/S writer to thwart this exploit, and not the responsibility of the CPU architecture to fix bad programming practices. Note, the speculative read is a good optimization feature and should not be removed due to bad programming practices made by code in the O/S.
Jim Dempsey wrote:
Note, the speculative read is a good optimization feature and should not be removed due to bad programming practices made by code in the O/S.
The speculative read is a good optimization feature -- unless it can be proven that it can never be successful! This is the benefit of the AMD implementation. The core knows what mode it is operating in (user or kernel), and the Page Table Entry has a bit that says that user-mode access is not allowed. Since the core needs to read the Page Table Entry to get the Physical Address, it is guaranteed that this information is available to the core before the L1 Data Cache tags can be queried. I would say that there is not much excuse for speculatively executing the read in this case.
Aside: At a very low level, the process of querying the L1 Data Cache tags can begin before the Page Table Entry is available. The congruence-class (or "index") of the L1 Data Cache cache is determined by bits 11:6 of the virtual address. A small portion of the cache access cycle consists of driving some signals to this part of the cache in preparation for the address compare operation. It might not be desirable to delay this part of the access until the TLB returns the Page Table Entry information. BUT, since the cache is physically tagged, it is not possible to begin the actual address compare until after the Page Table Entry has been loaded. At that point the core can determine that the load cannot "succeed", so the L1 Data Cache query can be aborted. Assuming that the cache access is started before the TLB lookup is complete, we know that the L1 Data Cache query can be aborted, since that has to happen every time there is a TLB miss.
The Page Table Entry has other bits that are used for access checks (Table 4-19 of Volume 3 of the SWDM). Comparing these bits with the description of the Page Fault Exception in Section 6.15 of the same volume suggests some additional cases in which the speculative execution of a load should always be prevented:
- If the P (Present) bit is clear, there is either a page table entry or the page itself is not present in memory. It would be a really bad idea (TM) to use the physical address bits in the PTE to perform the read speculatively.
- The U/S (User/Supervisor) bit has been discussed above.
- Blocking speculative execution of user-mode accesses to supervisor pages will also help with the ugly problem that many (most?) architectures have with the "Accessed" bit in the Page Table Entry. This is typically set by speculatively executed instructions, but not properly reset if those instructions are subsequently flushed.
- If the XD (eXecute Disable) function is enabled and this bit is set, instruction fetches are not allowed from this page. Since the architecture guarantees that this cannot "succeed", there is no benefit in executing the fetch speculatively.
- If the R/W bit is clear, writes are not allowed to this page. Speculative writes are severely limited in any case, but this bit could be used by an implementation to prevent even the attempt at speculative execution.
- This would also help prevent the ugly problem that most architectures have with the D (Dirty) bit of the Page Table Entry, which in many systems will be set by speculative writes (and not reset if the speculative write is flushed).
- If Read access is allowed for this page (but not Write access), then preventing the speculative write won't stop the user from loading the cache line, but it will prevent the user from speculatively performing certain cache transactions. Even if we don't currently know of an exploit that could make use of this, the principle of least surprise suggests that if the transaction is not going to be allowed to complete, it should not be allowed to execute speculatively either.
- If Protection Keys are supported (CR4.PKE=1), bits 62:59 of the Page Table Entry specify a Protection Key that must be compared to bits in a processor register to determine access rights.
- This requires a lot more bits to be looked at before deciding if the memory access should be allowed to proceed -- the PKRU register is 32 bits wide (1 "access" bit plus 1 "write" bit for each of the 16 possible values of the Protection Key field).
- Pulling this many bits into the path to decide whether to speculatively execute a memory access is likely to be a challenge, but if Meltdown and Spectre taught us anything, it is that protection is not reliable if it does not also prevent speculative accesses.
- A page fault exception will also be caused if any of the "reserved" bits in the page table entry are set.
- Again, even without evidence of an exploit, there is no benefit to speculatively executing a memory access that is architecturally guaranteed to be unable to complete.
- Issues similar to the "Protection Keys" are probably also relevant to SGX, but I have not studied that set of architectural extensions.
In the XD case, and possibly the U/S case, when executing: JMP "near relative" word/dword ptr ProbeAddress
The execution occurs at the JMP instruction address which has Execute Enable, the "word/dword ptr ProbeAddress", which contains the relative RIP offset (in this case) residing in XD memory (has the protected data) is potentially not considered part of the instruction. Its contents is used to generate the destination address (it is similar to a defer state in older machines). I am asking if this is the problem? IOW this instruction has the peculiar characteristic that the instruction bytes are in one place, and the offset is fetched .NOT. as part of a read instruction. i.e. this is a "neither here nor there" type of situation.
Note, the above need not be in a speculatively executed section of code, but actually executed. The code following the RIP would have to be one byte instructions such as INC AX (e.g. 64K of these). Assuming AX were first zeroed the code word would be -AX.
The JMP instruction causes an instruction fetch from a target (virtual) address. This address has to be translated by the TLB, at which point the XD bit is visible.
I don't know the specifics of any processor implementations for this case, but the Meltdown & Spectre flaws are based a model in which (statically) prohibited operations are allowed to execute (including making memory references), but are "tagged" with the Page Fault condition because of a permission violation. Subsequent instructions can execute speculatively because the Page Fault exception is not actually raised until the offending instruction attempts to retire. If the Intel implementations treat XD violations the same way they treat U/S violations, then the branch would be allowed to execute, but it would be "tagged" with the page fault condition, which would be raised at retirement. This would prevent the branch from being retired and would prevent the instructions after the branch from being retired, but may allow those instructions to be executed speculatively, leading to potentially observable changes in the memory hierarchy.
The AMD note at https://lkml.org/lkml/2017/12/27/2 says that the AMD processors refrain from speculatively executing memory references when those are prohibited by privilege (user/supervisor) -- the note does not make any statements about speculative execution of any of the other cases that would generate a page fault. I have no personal knowledge of this topic from my time at AMD....
A better model is to notice the violation and refrain from executing the prohibited operation --- in this case an instruction fetch. If the JMP instruction is being issued speculatively (i.e., if any instructions before the JMP instruction have not yet retired), then the instruction fetch should not be executed. If the JMP instruction is not being issue speculatively (i.e., if all instructions before the jump have been retired), then the instruction fetch should also not be executed, but the Page Fault interrupt should be signaled in lieu of executing the instruction.
>>The JMP instruction causes an instruction fetch from a target (virtual) address.
The JMP instruction, in this case, is performing a data fetch from a target address, which is subsequently added to the RIP. Upon completion, the new instruction fetch comes from RIP+(value fetched from the specified ("protected") memory).
My supposition is that there is a lack of definition as to if the memory fetch at the address in the protected memory is part of the instruction sequence .OR. is to be interpreted as if it were a normal memory data read.
JMP "near relative" word/dword ptr ProbeAddress
could be considered equivalent to:
ADD word/dword ptr ProbeAddress, RIP
In the former case (JMP) there is an ambiguity as to if the content at ProbeAddress is "instruction" or "data"
In the latter case (ADD) there is no ambiguity, it is "data"
Should the protection system work such as to tag the instruction for page fault, then the instruction pipeline might fetch into the cache the cache line containing RIP+content at word/dword ptr ProbeAddress. And require the exploit to use speculative execution, then checking which cache line following the JMP is in cache. However, should there be an ambiguity that slips through the protection design, then an exploit may run directly (without speculative execution), and thus be worse (faster exploit).
I haven't set up a test case to see if the ambiguity exists or not.
Note, both exploits (speculative and direct) require the page to be present (albeit with protection bits set in the descriptor). Neither exploit would work if the appropriate page table entries are marked as not present or point elsewhere (and the instruction pipeline is serialized). As to why O/S designers (apparently) do not map these pages on entry to supervisory mode and unmap on exit, I cannot say. Possibly for performance reasons. But please note, a vast majority of O/S calls need not have access to specific protected data. Thus, the map/unmap could be restricted to only those O/S calls needing access to this data.
IOW - no hardware change required. Patch the O/S.
Sorry I misunderstood which version of the JMP instruction you were talking about -- I was assuming opcode E9, which uses an immediate operand for the RIP-relative offset.
Opcode FF is a JMP instruction with a register specifying the (memory) address of a 64-bit displacement value. This requires multiple micro-ops: one for loading the offset, and one for fetching the branch target.
- The first micro-op is just data -- if the address is readable, then it will be read, and the RIP-relative displacement will be computed. Because this is just a read, the R/W and XD bits don't matter, but the read should not be executed (i.e., no memory access should occur) if there is a U/S violation.
- If the XD bit is set, but there is not a U/S violation, this is not a security concern, since the address can be read directly.
- Speculatively executing the fetch in the presence of a U/S violation is exactly the same security mistake that Intel processors allow with ordinary reads. It is not a new problem -- it is the same problem packed into a different instruction.
- The second micro-op is the branch. Once the branch target address is computed, *that* address needs to be fetched, and it should be handled as I described above. For this second micro-op, the memory access should not be allowed to occur if the U/S or XD bits indicate a violation.
The IPSXE2018 update 2 is released last week that contains the Intel compiler 18.0.2. This version of Intel compiler provides the fix for Spectre variant 2 only. Here is the detail about it: https://software.intel.com/en-us/articles/using-intel-compilers-to-mitigate-speculative-execution-si...