Trouble handling host code programming errors via exceptions

Yogi_D_ · ‎05-04-2013

Hi.

I am writing a small OS-agnostic hypervisor as a teaching tool for my students. The hypervisor code is loaded by the code I embed in a custom MBR on the boot device when the system boots. The hypervisor code switches to 32-bit proceted mode and then IA32e (64-bit mode, paged with identity mapping of linear -- physical addresses). It then sets up the 64-bit exception handling mechanism and tests of this exception handling mechanism are successful (CPL and DPL are 0 so no stack switching is expected). E.g., divide by 0, and page faults are handled as expected.

Next, an IA32e mode guest is launched. The guest has its own paging tables (these are not identity mapped). The guest handles exceptions and interrutps by itself (i.e., it has a different IDT than the host, and the exception bitmap control is set to 0). All this is working. External interrupts, exceptions, memory accesses, access to I/O devices is working well int he guest. The guest exits to the host because of various conditions and is resumed correctly.

The issue occurs when I try to capture programming mistakes in the VM exit handler (host code). For example, a divide by 0, invalid, opcode, page fault exceptions all result in the CPU locking up. The host essentially has the same IDT setting as before the launch of the guest, but clearly something is getting screwed up. Any thoughts as to what I should be looking at in particular to help solve this issue?

For the host, I am setting up TR selector, IDTR base, TR base to the same values they are before VM launch is executed. Because the host is running witch CPL of 0 and the handler code's CS has DPL of 0, I am not expecting a stack switch. Therefore, I am not specifying any stacks in the 64-bit TSS (all entries in the hosts TSS are 0 except for the I/O bitmap offset).

Thanks,
-Yogi

Yogi_D_ · ‎05-06-2013

I figured out the issue myself this weekend and learned somthing very interesting in the process.

Because exception handling was working correctly just before the VM was launched, I assumed the IDT was setup correctly and that the host's IDTR or TSS base was screwed up in the VMCS. However, double checking IDTR base, IDT entries, TSS base, TSS entries did not yield any solution. Then, on a hunch I checked the GDTR base and that was incorrect! I never thought of checking that value because the host code was working correctly!

Well, it seems that my host code never needed to load a segemet selector so the CPU never needed to reference the GDT. So the incorrect GDTR base value was never used, and therefroe, did not cause an issue. However, because exception handling requires the CS and an offset of handler code to be laoded, the CPU did need to reference the GDT which was not at the specified location -- this cause another exception resutling in a hung CPU.