Software Archive
Read-only legacy content

Failed to enter protected mode in nested virtualization system

Tao_W_
Beginner
772 Views

I am testing VMX, and building a Linux kernel module to be as hypervisor.

The Linux kernel module is loaded into a Linux VM, where it is running in VMware workstation.

To test the hypervisor (kernel module), I used the code of bootasm.S http://pages.cs.wisc.edu/~skobov/cs537/P3/xv6/kernel/bootasm.S to build the VM, and loaded it to the hypervisor.

Once the VM is loaded, it can run well, but failed to long jump to start32:
I dumped the VM's GDT, the entries are as follows,
entry 0: 00000000000
entry 1: 00cf9a00,0000ffff
entry 2: 00cf9200,0000ffff
they looked good.
I suspected the instruction of ljmp may not be executed by real mode VM.
The VM's CR0 is 0x60000011, CR4 is 0x2000, RFLAGS is 0x3006.
 

This problem happened in a nested virtualization environment.

I tested it in a Linux system running in baremetal, it is working well.  

I don't know what is missed in my code.

Thanks,

-Tao

0 Kudos
6 Replies
Quoc-Thai_L_Intel
772 Views

Since you are running under VMware, could this be an issue under the VMware software?  You might want to check with them too on this issue.  I also have forwarded your question to my peers for any input. 

-Thai

0 Kudos
Quoc-Thai_L_Intel
772 Views

Hello, I got some feedback from my peer to ask you to try something:

Can you try following?

  

Replace

ljmp    $(SEG_KCODE<<3), $start32

with

 

  .byte   0x66, 0xea

 .long   start32

  .word  (SEG_KCODE<<3)

 

Regards,

-Thai

0 Kudos
Tao_W_
Beginner
772 Views

Thai,

Thank you very much for your reply.

I changed the code as you suggested, but it still failed to jump to start32.

Here is the Makefile,

all: bootblock
OBJDUMP=objdump
OBJCOPY=objcopy

CFLAGS = -fno-pic -static -fno-builtin -fno-strict-aliasing -Wall -MD -ggdb -m32 -Werror -fno-omit-frame-pointer
CFLAGS += $(shell $(CC) -fno-stack-protector -E -x c /dev/null >/dev/null 2>&1 && echo -fno-stack-protector)
ASFLAGS = -m32 -gdwarf-2 -Wa,-divide
# FreeBSD ld wants ``elf_i386_fbsd''
LDFLAGS += -m $(shell $(LD) -V | grep elf_i386 2>/dev/null)

bootblock: bootasm.s bootmain.c
        $(CC) $(CFLAGS) -fno-pic -nostdinc -I. -c bootasm.S
        $(CC) $(CFLAGS) -fno-pic -nostdinc -I. -c bootmain.c
        $(LD) $(LDFLAGS) -N -e start -Ttext 0x7C00 -o bootblock.o bootasm.o bootmain.o
        $(OBJDUMP) -S bootblock.o > bootblock.asm
        $(OBJCOPY) -S -O binary -j .text bootblock.o bootblock.bin

clean:
        rm *.o
        rm *.bin

 

the gcc version is

t@ubuntu:~/test/vmxx/kermod/linuxvmxx/toyvmm/vm$ cc -v
Using built-in specs.
COLLECT_GCC=cc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/5/lto-wrapper
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 5.4.0-6ubuntu1~16.04.9' --with-bugurl=file:///usr/share/doc/gcc-5/README.Bugs --enable-languages=c,ada,c++,java,go,d,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-5 --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-libmpx --enable-plugin --with-system-zlib --disable-browser-plugin --enable-java-awt=gtk --enable-gtk-cairo --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-5-amd64/jre --enable-java-home --with-jvm-root-dir=/usr/lib/jvm/java-1.5.0-gcj-5-amd64 --with-jvm-jar-dir=/usr/lib/jvm-exports/java-1.5.0-gcj-5-amd64 --with-arch-directory=amd64 --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --enable-objc-gc --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.9)

Thanks,

-Tao

0 Kudos
Tao_W_
Beginner
772 Views

Here are more infor for your reference.

The VMCS for the VM

 0x0000003F = control_VMX_pin_based
 0xA501E1F2 = control_VMX_cpu_based
 0x00000082 = control_VMX_proc2_based
 0x00000000 = control_exception_bitmap
 0x00000000 = control_pagefault_errorcode_mask
 0xFFFFFFFF = control_pagefault_errorcode_match
 0x00000000 = control_CR3_target_count
 0x00036FFB = control_VM_exit_controls
 0x000011FB = control_VM_entry_controls
 0x00000000 = control_VM_entry_interruption_information
 0x00000000 = control_VM_entry_exception_errorcode
 0x00000000 = control_VM_entry_instruction_length

 0xFFFFFFFFFFFFFFF7 = control_CR0_mask
 0xFFFFFFFFFFFFF871 = control_CR4_mask
 0x0000000060000010 = control_CR0_shadow
 0x0000000000000000 = control_CR4_shadow
 0x0000000000000000 = control_CR3_target0
 0x00000000B2E98000 = control_CR3_target1
 0x0000000000000000 = control_CR3_target2
 0x0000000000000000 = control_CR3_target3

The VMX MSRs in the Linux VM ran in VMWare ESXi.

 VMX-Capability Model-Specific Registers

     00D8100000000001 = IA32_VMX_BASIC_MSR
     0000003F00000016 = IA32_VMX_PINBASED_CTLS_MSR
     FFF9FFFE0401E172 = IA32_VMX_PROCBASED_CTLS_MSR
     003FFFFF00036DFF = IA32_VMX_EXIT_CTLS_MSR
     0000F3FF000011FF = IA32_VMX_ENTRY_CTLS_MSR
     00000000000401E0 = IA32_VMX_MISC_MSR
     0000000080000021 = IA32_VMX_CR0_FIXED0_MSR
     00000000FFFFFFFF = IA32_VMX_CR0_FIXED1_MSR
     0000000000002000 = IA32_VMX_CR4_FIXED0_MSR
     00000000001727FF = IA32_VMX_CR4_FIXED1_MSR
     000000000000005A = IA32_VMX_VMCS_ENUM_MSR
     000038FE00000000 = IA32_VMX_PROCBASED_CTLS2
     00000F0106114141 = IA32_VMX_EPT_VPID_CAP
     0000003F00000016 = IA32_VMX_TRUE_PINBASED_CTLS
     FFF9FFFE04006172 = IA32_VMX_TRUE_PROCBASED_CTLS
     003FFFFF00036DFB = IA32_VMX_TRUE_EXIT_CTLS
     0000F3FF000011FB = IA32_VMX_TRUE_ENTRY_CTLS


 original_CR0=80050033  PG=1 CD=0 NW=0 AM=1 WP=1 NE=1 ET=1 TS=0 EM=0 MP=1 PE=1
 original_CR4=001406E0  VMXE=0 PGE=1 MCE=1 PAE=1 PSE=0 DE=0 TSD=0 PVI=0 VME=0

The guest status just before ljmp

 VMX Guest State

 CR0=0000000000000031  CR3=0000000000000000  CR4=0000000000002050

 RSP=00000000000017FA  SYSENTER_ESP=0000000000000000
 RIP=000000000000182E  SYSENTER_EIP=0000000000000000
 DR7=0000000000000400  SYSENTER_CS=00000000  RFLAGS=0000000000000006

   ES=0000  [ base=0000000000000000 limit=0000FFFF rights=00000093 ]
   CS=0000  [ base=0000000000000000 limit=0000FFFF rights=0000009B ]
   SS=0000  [ base=0000000000000000 limit=0000FFFF rights=00000093 ]
   DS=0000  [ base=0000000000000000 limit=0000FFFF rights=00000093 ]
   FS=0000  [ base=0000000000000000 limit=0000FFFF rights=00000093 ]
   GS=0000  [ base=0000000000000000 limit=0000FFFF rights=00000093 ]
 LDTR=0000  [ base=0000000000000000 limit=0000FFFF rights=00000082 ]
   TR=0000  [ base=0000000000000000 limit=0000FFFF rights=0000008B ]
      GDTR  [ base=0000000000000000 limit=00000000 ]
      IDTR  [ base=0000000000000000 limit=0000FFFF ]

 EAX=60000011  ECX=00000000  ESI=00000000  ESP=000017FA   extints=0
 EBX=00000000  EDX=00000000  EDI=00000000  EBP=00000000   nmiints=0

 

thanks,

-Tao

0 Kudos
Quoc-Thai_L_Intel
772 Views

Thanks Tao!  I have forwarded your additional info. on to my peers. 

-Thai

0 Kudos
Tao_W_
Beginner
772 Views

Hi Thai,

Here is the updated guest infor and VMCS for your reference,

VMCS fields.
0x0000003F = control_VMX_pin_based
0xA501E1F2 = control_VMX_cpu_based
0x00000082 = control_VMX_proc2_based
0x00000000 = control_exception_bitmap
0x00000000 = control_pagefault_errorcode_mask
0xFFFFFFFF = control_pagefault_errorcode_match
0x00000000 = control_CR3_target_count
0x00036FFB = control_VM_exit_controls
0x000011FB = control_VM_entry_controls
0x00000000 = control_VM_entry_interruption_information
0x00000000 = control_VM_entry_exception_errorcode
0x00000000 = control_VM_entry_instruction_length

0xFFFFFFFFFFFFFFF7 = control_CR0_mask
0xFFFFFFFFFFFFF871 = control_CR4_mask
0x0000000060000010 = control_CR0_shadow
0x0000000000000000 = control_CR4_shadow
0x0000000000000000 = control_CR3_target0
0x00000000B7934000 = control_CR3_target1
0x0000000000000000 = control_CR3_target2
0x0000000000000000 = control_CR3_target3


Guest state:
CR0=0000000000000031  CR3=0000000000000000  CR4=0000000000002050

RSP=0000000000007BFA  SYSENTER_ESP=0000000000000000
RIP=0000000000007C2E  SYSENTER_EIP=0000000000000000
DR7=0000000000000400  SYSENTER_CS=00000000  RFLAGS=0000000000000006

   ES=0000  [ base=0000000000000000 limit=0000FFFF rights=00000093 ]
   CS=0000  [ base=0000000000000000 limit=0000FFFF rights=0000009B ]
   SS=0000  [ base=0000000000000000 limit=0000FFFF rights=00000093 ]
   DS=0000  [ base=0000000000000000 limit=0000FFFF rights=00000093 ]
   FS=0000  [ base=0000000000000000 limit=0000FFFF rights=00000093 ]
   GS=0000  [ base=0000000000000000 limit=0000FFFF rights=00000093 ]
LDTR=0000  [ base=0000000000000000 limit=0000FFFF rights=00000082 ]
   TR=0000  [ base=0000000000000000 limit=0000FFFF rights=0000008B ]
      GDTR  [ base=0000000000007C3C limit=00000017 ]
      IDTR  [ base=0000000000000000 limit=0000FFFF ]

EAX=60000011  ECX=00000000  ESI=00000000  ESP=00007BFA   extints=0
EBX=00000000  EDX=00000000  EDI=00000000  EBP=00000000   nmiints=0

The cpuinfo of the Linux running in VMware is below.
processor       : 1
vendor_id       : GenuineIntel
cpu family      : 6
model           : 63
model name      : Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz
stepping        : 2
microcode       : 0x3c
cpu MHz         : 2397.291
cache size      : 15360 KB
physical id     : 2
siblings        : 1
core id         : 0
cpu cores       : 1
apicid          : 2
initial apicid  : 2
fpu             : yes
fpu_exception   : yes
cpuid level     : 15
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon nopl xtopology nonstop_tsc eagerfpu pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm tpr_shadow vnmi ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 invpcid xsaveopt arat
bugs            :
bogomips        : 4801.89
clflush size    : 64
cache_alignment : 64
address sizes   : 43 bits physical, 48 bits virtual
power management:

 

The guest code is as follows,

#define SEG_KCODE 1  // kernel code
#define SEG_KDATA 2  // kernel data+stack
#define SEG_KCPU  3  // kernel per-cpu data
#define SEG_UCODE 4  // user code
#define SEG_UDATA 5  // user data+stack
#define SEG_TSS   6  // this process's task state

#define CR0_PE          0x00000001      // Protection Enable

#define SEG_NULLASM                                             \
    .word 0, 0;                                             \
    .byte 0, 0, 0, 0

// The 0xC0 means the limit is in 4096-byte units
// and (for executable segments) 32-bit mode.
#define SEG_ASM(type,base,lim)                                  \
        .word (((lim) >> 12) & 0xffff), ((base) & 0xffff);      \
        .byte (((base) >> 16) & 0xff), (0x90 | (type)),         \
        (0xC0 | (((lim) >> 28) & 0xf)), (((base) >> 24) & 0xff)

#define STA_X     0x8       // Executable segment
#define STA_E     0x4       // Expand down (non-executable segments)
#define STA_C     0x4       // Conforming code segment (executable only)
#define STA_W     0x2       // Writeable (non-executable segments)
#define STA_R     0x2       // Readable (executable segments)
#define STA_A       0x1 // Accessed
# Start the first CPU: switch to 32-bit protected mode, jump into C.

        .code16
        .global code16, code16_end
code16:
        xor %ecx, %ecx
        mov %cr3, %eax
        mov %eax, %cr3
    seta20.1:
        inb     $0x64,%al               # Wait for not busy
        testb   $0x2,%al
        jnz     seta20.1

        movb    $0xd1,%al               # 0xd1 -> port 0x64
        outb    %al,$0x64

    seta20.2:
        inb     $0x64,%al               # Wait for not busy
        testb   $0x2,%al
        jnz     seta20.2

        movb    $0xdf,%al               # 0xdf -> port 0x60
        outb    %al,$0x60

        wrmsr

        lgdt    gdtdesc
        movl    %cr0, %eax
        orl     $CR0_PE, %eax
        movl    %eax, %cr0

        rdmsr 
//PAGEBREAK!
# Complete transition to 32-bit protected mode by using long jmp
# to reload %cs and %eip.  The segment descriptors are set up with no
# translation, so that the mapping is still the identity mapping.

       .byte 0x66, 0xea
       .long start32
       .word (SEG_KCODE<<3)


        .code32  # Tell assembler to generate 32-bit code now.
start32:
cid:
        cpuid
        # Bootstrap GDT

        .p2align 2                                # force 4 byte alignment
gdt:
        SEG_NULLASM                              # NULL seg
        SEG_ASM(STA_X|STA_R, 0x0, 0xffffffff)   # code seg
        SEG_ASM(STA_W, 0x0, 0xffffffff)         # data seg

gdtdesc:

        .word   (gdtdesc - gdt - 1)             # sizeof(gdt) - 1

        .long   gdt
code16_end:

 

The Makefile to build it is,

G_CFLAGS = -fno-pic -static -fno-builtin -fno-strict-aliasing -Wall -MD -ggdb -m32 -Werror -fno-omit-frame-pointer
G_CFLAGS += $(shell $(CC) -fno-stack-protector -E -x c /dev/null >/dev/null 2>&1 && echo -fno-stack-protector)

 

        $(CC) $(G_CFLAGS) -fno-pic -nostdinc -I. -c code16.S
        $(LD) $(G_LDFLAGS) -N -e start -Ttext 0x7C00 -o bootblock.o code16.o
        $(OBJCOPY) -S -O binary -j .text bootblock.o bootblock.bin

Thank you very much for your help.

 

0 Kudos
Reply