Kernel Panic When Trying To Enable BTS

ETPA_Team · ‎04-07-2013

I am part of a team working on a project that uses the Branch Trace Store feature of the x86 architecture to capture execution trace data of running applications in the linux 3.2 kernel. We have run into some issues and were hoping that someone here could help us sort them out.

We have gotten to the point that we are able to set up the DS Save area and enable BTS with BTInt enabled, however, after BTS is enabled, we always enter a kernel panic as a result of a segfault (usually on memory allocated to commonly used C libraries such as libc). We believe that the problem is either the result of using mlock on the memory that we allocate for BTS (we have found some conflicting information on what mlock does and if it is necessary in this application) or the result of a problem converting virtual memory addresses to linear addresses prior to writing the address information to the DS_SAVE_AREA register. Regardless, it appears that the issue stems from the fact that with our DS save area setup code, the allocated memory always seems to end up being in a library mapped to a process. Granted, it may also be something completely different that we haven't been able to identify yet.

We are running this code on an HP elitebook with an Intel Core i7 Q820 cpu.

I will include some relevent code chunks below, if there is any other information that would be helpful, let me know and I'll add it.

Our code that sets up the DS Save Area

[cpp]

int i = 0;
void* virtMemBTSBuffer[8];
unsigned long physMemBTSBuffer[8];
void* virtMemBTSBufferInfo[8];
unsigned long physMemBTSBufferInfo[8];
void* addressBuffer;
void* addressBuffer2;
struct physMemInfo tempMemStruct;
for(i; i <= 7; i++){

//Start Buffer Allocation
virtMemBTSBuffer = calloc(1, 4079);
mlock(virtMemBTSBuffer, 4079);
physMemBTSBuffer = virtToLinear(virtMemBTSBuffer,&tempMemStruct);

virtMemBTSBufferInfo = calloc(1, 0x58);
mlock(virtMemBTSBuffer, 0x58);
physMemBTSBufferInfo = virtToLinear(virtMemBTSBufferInfo,&tempMemStruct);
//End Buffer Allocation

//Start Buffer Info Setting
memcpy(virtMemBTSBufferInfo, &(physMemBTSBuffer), 8); // set the base
memcpy(virtMemBTSBufferInfo + 0x8, &(physMemBTSBuffer), 8); // set the index
addressBuffer = &(physMemBTSBuffer) + 4057; // 170 x 24byte records + 1
memcpy(virtMemBTSBufferInfo + 0x10, &addressBuffer, 8); // set the max
//addressBuffer = &(physMemBTSBuffer) + 2400;

addressBuffer = (void*)0xffffffffffffffff; // use only if using a circular buffer
memcpy(virtMemBTSBufferInfo + 0x18,&addressBuffer,8); // set the interrupt threshold

//End Buffer Info Setting

//Start Setting IA32_DS_AREA
wrmsr(i,IA32_DS_AREA,physMemBTSBufferInfo);
printf("the value of the register 0x600 on cpu #%d is 0x%lx, should be 0x%lx\n",i,rdmsr(i,0x600),physMemBTSBufferInfo);
//End Setting IA32_DS_AREA

[/cpp]

And our code that converts virtual addresses to linear addresses

[cpp]

unsigned long virtToLinear(void* virtAddr, struct physMemInfo* memInfo){
unsigned long frameInfo=0;
unsigned long pageSize = sysconf(_SC_PAGE_SIZE);
unsigned long pageNumber = (unsigned long) virtAddr / pageSize;
int pid = getpid();
char pageMapFile[50];
FILE* pagemap = NULL;
snprintf(pageMapFile,50,"/proc/%d/pagemap",pid);
pagemap=fopen(pageMapFile,"r");
fseek(pagemap,pageNumber * 8, SEEK_SET);
fread(&frameInfo, 8, 1, pagemap);
memInfo->virtAddr = virtAddr;
memInfo->offset = (unsigned int)((unsigned long) virtAddr % pageSize);
if(0x8000000000000000 & frameInfo){ // page is present
   memInfo->pagePresent=1;
   if(!(0x4000000000000000 & frameInfo)){ // page is in memory
   memInfo->pageSwapped=0;
   memInfo->frameNumber=0x3fffffffffffff & frameInfo;
  }else{ // page is swapped
   memInfo->pageSwapped=1;
   memInfo->swapType=0xf&frameInfo;
   memInfo->swapOffset=(unsigned long)(0x3ffffffffffff0 & frameInfo) >> 4;
  }
   memInfo->pageShift=((unsigned long)frameInfo >> 55) & 0x7f;
}
else{ // page is not present
  memInfo->pagePresent=0;
}
if(memInfo->pagePresent && !memInfo->pageSwapped){ // address is in memory
   return (memInfo->frameNumber << memInfo->pageShift) + memInfo->offset; // physical address
}
return 0;
}

[/cpp]

A sample output leading up to the kernel panic

[plain]
the value of the register 0x600 on cpu #0 is 0x21c20e250 should, should be 0x21c20e250
the value of the register 0x1d9 on cpu #0 is 704 should, should be 0x2c1
the value of the register 0x600 on cpu #1 is 0x21c354730 should, should be 0x21c354730
[111998.783656] init[1]: segfault at 21c354738 ip 00007fd7becc0850 sp 00007fff84bca678 error 4 in libnih.so.1.0.0[7fd7becb8000+16000]
[111998.794935] init[1]: segfault at 21c354738 ip 00007fd7bf0fcc80 sp 00007fff84bca438 error 4 in init[7fd7bf0f5000+26000]
[112002.890421] Kernel panic – not syncing: Attempt to kill init!
[112002.890650] upstart-socket-[714]: segfault at 21c354738 ip 00007f259864f003 sp 0007fff9b66c7a8 error 4 in libc-2.15.so[7f2598562000+1b5000]
[112002.891547] Pid: 1, comm: init Tainted: G WC O 3.2.0-38-generic #61-Ubuntu
[112002.892115] Call Trace:
[112002.892691] [<ffffffff81644952>] panic+0x91/0x1a4
[112002.893268] [<ffffffff8106be95>] forget_original_parent+0x245/0x250
[112002.893844] [<ffffffff8106beb7>] exit_notify+0x17/0x110
[112002.894451] [<ffffffff8106c743>] do_exit+0x44/0x450
[112002.895027] [<ffffffff8106cb44>] do_group_exit+0x44/0xa0
[112002.895602] [<ffffffff8107db1c>] get_signal_to_deliver+0x21c/0x420
[112002.896183] [<ffffffff81014865>] do_signal+0x45/0x130
[112002.896761] [<ffffffff81644ab6>] ? printk+0x51/0x53
[112002.897338] [<ffffffff8105712a>] ? finish_task_switch+0x4a/0xf0
[112002.897918] [<ffffffff8165ae3c>] ? __schedule+0x3cc/0x6f0
[112002.898502] [<ffffffff81014b15>] do_notify_resume+0x65/0x80
[112002.898943] [<ffffffff8165d93c>] retint_signal+0x48/0x8c
[112002.899337] panic occurred, switching back to text console

[/plain]

Thank you!

Bernard · ‎04-07-2013

Hi,

Can you dump the context of general purpose registers at the time of segfault?Next can you resolve this address @eip == 00007fd7becc0850 and this @eip == 00007fd7bf0fcc80 and go backward? I suppose that these faulted ip adresses point to the block of code,which could have been executed by two cores,so can we have some kind of synchronization issue or these are simply two segfaults caught.For getting more insight into what could happen I would advise you to dump stack pointer and dump raw stack data , because the faulting code could dereference uninitialized memory or not accessible memory(on Win such a memory is called PAGE_NOACCESS).

ETPA_Team · ‎04-08-2013

That would be a good next step, but the problem is that while the same general problem occurs each time the code is executed (there is a segmentation fault resulting in a kernel panic), it doesn't appear to occur at the same point each time the code is executed. In fact, in the 3 times I've replicated the issue just now (attempting to work out a way of doing what you suggested), the segfault has occured at a different address, it's always at (the value of IA32_DS_AREA) +0x8, but this address is different each time due to the way memory is allocated from the heap, and the kernel panic has actually listed a different library (not just libnih from the sample output) each time it occurs.

Also, the segmentation fault hasn't been occuring at the same point in our code each time, actually happening slightly after setting the value of the IA32_DEBUG_CTL register on different cores each of the three times. That code is the next line after the first block of code included above and looks like this:

[cpp]wrmsr(i,IA32_DEBUGCTL, IA32_DEBUGCTL_TR | IA32_DEBUGCTL_BTS | IA32_DEBUGCTL_BTS_OFF_OS); [/cpp]

While not only does the crash seem to occur in a different place each time, the kernel panic locks up the computer preventing any further debugging when it occurs, so we are unable to use gdb to resolve the address of eip (in our program). However, since the addresses are the same as what we expect to be the location of the BTS top of stack pointer, and the kernel panic resolves the address to a precompiled library each time, we believe that BTS is being enabled, but there is a mismatch between where we think we put the DS_SAVE_AREA and the address we put in the IA32_DS_AREA register. Also note that the ip mentioned in the kernel panic isn't from our application, the ip should never be on the heap unless we are doing something terribly wrong, and it is most likely some kernel mode driver calling a library function that causes the kernel panic as a user mode application should simply crash, not cause a kernel panic.

See section 19.18.5 in this document http://css.csail.mit.edu/6.858/2012/readings/ia32/ia32-3b.pdf for a diagram of the DS area we are attempting to initialize with this code, as well as sections 19.5.1 and 19.7.8 for descriptions of the BTS mechanism and what we are attempting to do.

Thank you,

ETPA Team

Bernard · ‎04-09-2013

Hi ,

Unfortunatly I do not have a lot of experience in Linux kernel debugging.My answer was based on the experience gained from Windows debugging.I suppose that gdb lacks the broad functionality of windbg mainly in kernel mode debugging.

Regarding the ip value i can recall from windows debug session that somehow saved ip after call instruction returned was somehow altered(overwritten) and transfered the execution to the wrong part of code.It should be very insightful when you could have resolved the memory address of the faulting ip.The crucial info regarding the bad address can be found probably in raw stack data. Tomorrow I will look at your issue.

Bernard · ‎04-09-2013

Who is writing to IA32_DEBUG_CTL register?Is this your code if it is so please put the breakpoint on wrmsr instruction and trace under debugger.Does GDB have an option of single stepping through the code?In windbg is an option to break on every memory access to particular address can you enable such a option in GDB?It could be interesting to track accesses to DS_SAVE_AREA (I suppose that your team had supplied its own routine to access DS_SAVE_AREA) can you do it with GDB? Can you somehow resolve the address of code which is writing to IA32_DS_AREA at the time of kernel panic? I think that this is crucial in order to understand the root cause of panic.Is the IA32_DS_AREA pointer valid and does it point to first byte of DS management buffer?Is there even a remote possibility that other code (not related to yours) tried to write to IA32_DS_AREA and thus corrupting your pointer.On Windows one can lock the access to some part of memory in kernel mode by using busy-waiting loop with high IRQL and nop instuction thus preventing other kernel code from running.Can you implement this in Linux?

Sorry if I can not be more helpful because I have no real experience in Linux debugging.I base my knowledge on Wimdows debugging:)

ETPA_Team · ‎04-17-2013

Sorry for the delay in getting back to you, I'll try to address each of your questions and hopefully we can move forward.

gdb does have a single stepping feature, and in using it to step through the application, it always crashes directly after our code modifies the value of the IA32_DEBUG_CTL (thereby enabling the BTS feature). The crash happens after executing the wrmsr function and before I have a chance to step to the next next line of code. The wrmsr is actually carried out by the msr kernel module by writing to the file /dev/cpu/0/msr at the corect offset in our user mode application. I also tried writing a kernel module in which I directly called wrmsr using inline assembly, and the same problem occurs.

I believe that the cpu is writing directly to the IA32_DS_AREA as part of the BTS functionality (incrementing the index of the BTS top of stack pointer), not any code that we wrote, which would be consistenet with the kernel panic occuring in seemingly random libraries. You are most likely correct that some other code is in fact writing to the IA32_DS_AREA, or more likely (based on the kernel panic's stack trace) some code, or the stack of some other application unrelated to ours is being stored in the same location, and overwritten by the BTS function of the cpu.

I should be able to impmentent the busy-wait locking mechanism you described, but I'm not sure if it will work properly on a multicore cpu if I understand your description properly. It would be possible to disable the other cores with some kernel options though, so I'll look into that as a next step.

I'm pretty sure the root of this problem is in translating the user-mode application's virtual addresses to the format the cpu wants in its IA32_DS_AREA register, ie. the cpu is overwriting code somewhere outside of the region we malloc'ed and mlock'ed. As a sanity check, I directly used the virtual addresses returned by malloc, but the same problem occured, so I'm really scratching my head trying to understand the issue. I may also be able to set asside a region of unmanged memory to use for the DS_AREA and BTS buffer, but will have to do a little looking into how to achieve this.

I appriciate your help so far, it's good to at least have someone to bounce ideas off of. This is a lot more complex than anything we've tried before.

Bernard · ‎04-18-2013

Hi

I will answer your post tomorrow.

Bernard · ‎04-19-2013

My advice related programaticaly block some cores with busy-wait loops is tailored for Win OS driver development and I do not know if it will work on Linux.Here main idea is to spin other cores in useless loop at elevated IRQL and by doing this to block the execution of kernel mode code on other logical processors.I think that best option will be to block the others cores when your code is issueing wrmsr instruction by doing this we could be able in theory block the kernel code from manipulating the IA32_DEBUG_CTL register.

I strongly suspect that other code is accessing at the same time the IA32_DS_AREA and corrupting your pointer.The problem lies in how to preempt this code from running.I suppose that we could try to catch the call stack of the offending thread and disassemble the function which is carrying out the wmsr instruction.In one of my previous posts I advised you to do backward stepping and try to disassembling the faulting ip and saved context I think that this albeit tediuos could shed some light on the problem root cause.Can you put a breakpoint on wrmsr instruction or on IA32_DS_AREA memory access and disassemble the call stack functions?

I also suppose that the reason could be a virtual address translation or even internal processor machine error.Here I would like to ask to reproduce the error on other machine is it possible?It should be done in order to remove the responsibility from your code.

I am glad that I can be helpful in solving your case.

Patrick_F_Intel1 · ‎04-19-2013

Hello EPTA team,

Here is a response from someone in Intel who has programmed BTS.

Those guys are programming BTS from the user mode!!!

No matter which tricks they do with memory locking and address conversion, their system will only last until their thread gets swapped out.

Explanation: BTS operates on linear addresses, which are the same for all processes in the system, so one may lock a user-mode buffer in physical memory, program a pointer to that buffer in DSA, but after that current process gets swapped out, the BTS logic will dump all records to the same linear addresses of another process, which may contain anything. Hence the crash.

Besides, I do not understand why they write to DSA 8 times, and what exactly they write.

Maybe I do not understand all the logic behind their code, but, again, using BTS from the user-land makes no sense to me.

I (Pat) have no experience with programming BTS. Is our BTS guy correct that you are trying to program BTS from user land (ring 3)?

Pat

Bernard · ‎04-19-2013

EPTA team guy also wrote kernel mode module which uses wrmsr directly,but the system also crashed.

Bernard · ‎05-04-2013

EPTA team,

do you have any updates?