- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
My first post here :)
I'm an academic researcher experimenting with PT with the hope of using it inside a tracing JIT.
For now, I have a simple C program which (pretends to be an interpreter and) has a loop. The goal is to try and collect one iteration of this loop. The important part looks like this:
for (i = 0; i < LARGE; i++) { /* "JIT Merge Point" */ if (i == HOT_THRESHOLD) { VDEBUG("Hot loop start"); trace_on(&tr_ctx); sleep(1); } for (int j = 0; j < 1000; j++) { /* Start Marker*/ asm volatile( "nop\n\t" "nop\n\t" "nop\n\t" :::); if (i % getpid() == 0) { sum += i * p % 33; } else { sum += i * 5 % 67 + p; } /* Stop Marker*/ asm volatile( "nop\n\t" "nop\n\t" :::); } if (i == HOT_THRESHOLD) { VDEBUG("Hot loop end"); if (ioctl(tr_ctx.perf_fd, PERF_EVENT_IOC_DISABLE, 0) < 0) err(EXIT_FAILURE, "ioctl to stop tracer"); if (close(tr_ctx.loop_done_fds[1]) == -1) { err(EXIT_FAILURE, "close"); } TDEBUG("Turned off tracer"); } }
- `trace_on()` turns on PT and runs the poll loop in a thread.
- I'm using `perf_event_open()` to communicate to the kernel PT functionality..
- the `nop`s are so I can easily see the loop code in a trace
If we disassemble the loop, it looks like this:
1646: 90 nop 1647: 90 nop 1648: 90 nop 1649: e8 22 f5 ff ff callq b70 <getpid@plt> 164e: 89 c1 mov %eax,%ecx 1650: 8b 45 f4 mov -0xc(%rbp),%eax ... etc 16be: 90 nop 16bf: 90 nop
If I run the program and disassemble the trace using `ptxed` from the processor-trace repo [0] then we get:
[enabled] [exec mode: 64-bit] [28, 7f332607f536: error: no memory mapped at this address] [enabled] [exec mode: 64-bit] [987, 7f3325d89f47: error: no memory mapped at this address] 0000562813a12656 mov eax, edx 0000562813a12658 test eax, eax 0000562813a1265a jnz 0x562813a1268d 0000562813a1268d mov edx, dword ptr [rbp-0xc] 0000562813a12690 mov eax, edx 0000562813a12692 shl eax, 0x2 0000562813a12695 lea ecx, ptr [rax+rdx*1] 0000562813a12698 mov edx, 0x7a44c6b 0000562813a1269d mov eax, ecx 0000562813a1269f imul edx 0000562813a126a1 sar edx, 0x1 0000562813a126a3 mov eax, ecx 0000562813a126a5 sar eax, 0x1f 0000562813a126a8 sub edx, eax 0000562813a126aa mov eax, edx 0000562813a126ac imul eax, eax, 0x43 0000562813a126af sub ecx, eax 0000562813a126b1 mov eax, ecx 0000562813a126b3 mov edx, dword ptr [rbp-0x14] 0000562813a126b6 add eax, edx 0000562813a126b8 cdqe 0000562813a126ba add qword ptr [rbp-0x8], rax 0000562813a126be nop 0000562813a126bf nop 0000562813a126c0 add dword ptr [rbp-0x10], 0x1 0000562813a126c4 cmp dword ptr [rbp-0x10], 0x3e7 0000562813a126cb jle 0x562813a12646 0000562813a12646 nop 0000562813a12647 nop 0000562813a12648 nop 0000562813a12649 call 0x562813a11b70 ...
What baffles me about this, is that we see `nop, nop` (the end loop marker) before we see `nop, nop, nop` (the start loop marker). This suggests that the head of the loop has been lost. This might even be the 100th iteration of the inner loop we are looking at here. In any case, we should see `nop, nop, nop`, some other code, *then* `nop, nop`.
My initial thought was that perhaps there is a latency between the `ioctl()` to turn on the trace and tracing actually starting, but you can put a 10 second sleep after `trace_on()` and still get the same result.
So, does anyone know why the top of the loop is missing? Has the compiler moved my asm blocks around? I was under the impression that it couldn't, but maybe I'm wrong... FWIW, the program is built `-O0`.
I also notice that if I take away the inner loop (leaving the computation inside though), then I don't see any of the code from the outer loop at all, which supports my theory that the first few instructions of the trace are missing. Here I suppose the computations are smaller than the number of instructions that get lost.
The full code for my program is here [1].
Thanks!
[0] https://github.com/01org/processor-trace
[1] https://github.com/vext01/hardware_tracer/blob/4dece2203d6585f97003d6ff8c709f58ca5e1e99/vm.c
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello Edd,
In addition to Andi's point about the decode error you might also want to check your perf_event configuration. When the aux buffer is mapped read-only it will overwrite older trace when it wraps around.
regards,
markus.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Andi, Markus,
Thanks for your swift response. Looks like Andi hit the nail on the head! I'm going to reply in a quite verbose fashion in the hope that this helps others (and my future self) who have landed here with the same problem :)
As I understand it, the trace I showed before was running through code for which `ptxed` didn't have an on-disk binary for. As it happens, this code is a return from `libc`, probably `sem_wait()` unlocking the thread after the tracer has initialised and signalled the other thread to carry on. Without giving `ptxed` knowledge of that code, it is unable to know where we are in the binary until the next sync point (presumably a trace packet containing an uncompressed absolute address?)
So what we need to to is tell `ptxed` to load the libraries at the same offsets as in the relocated memory image (bearing in mind the offset will change between runs due to ASLR). I wrote a little Python script [0] to wrap `ptxed` and pass the correct `--raw` arguments based upon a map file extracted from `/proc/<pid>/maps`.
So whereas before I was running something like:
ptxed -v --cpu auto --pt trace.data --raw vm:0x562813a11000
Now we have:
ptxed -v --cpu auto --pt trace.data \ --raw /home/vext01/research/hardware_tracer/vm:0x562813a11000 \ --raw /lib/x86_64-linux-gnu/libc-2.24.so:0x7f3325cd1000 \ --raw /lib/x86_64-linux-gnu/libpthread-2.24.so:0x7f3326070000 \ --raw /lib/x86_64-linux-gnu/ld-2.24.so:0x7f332628d000
And that gives us a trace like this:
[enabled] [exec mode: 64-bit] ... 00007f3325d891a8 pop rbp 00007f3325d891a9 ret 0000562813a1263d mov dword ptr [rbp-0x10], 0x0 0000562813a12644 jmp 0x562813a126c4 0000562813a126c4 cmp dword ptr [rbp-0x10], 0x3e7 0000562813a126cb jle 0x562813a12646 0000562813a12646 nop 0000562813a12647 nop 0000562813a12648 nop 0000562813a12649 call 0x562813a11b70 0000562813a11b70 jmp qword ptr [rip+0x2024ba] 00007f3325d89f40 mov eax, 0x27 00007f3325d89f45 syscall [disabled] [resumed] 00007f3325d89f47 ret 0000562813a1264e mov ecx, eax 0000562813a12650 mov eax, dword ptr [rbp-0xc] 0000562813a12653 cdq 0000562813a12654 idiv ecx 0000562813a12656 mov eax, edx 0000562813a12658 test eax, eax 0000562813a1265a jnz 0x562813a1268d 0000562813a1268d mov edx, dword ptr [rbp-0xc] 0000562813a12690 mov eax, edx 0000562813a12692 shl eax, 0x2 0000562813a12695 lea ecx, ptr [rax+rdx*1] 0000562813a12698 mov edx, 0x7a44c6b 0000562813a1269d mov eax, ecx 0000562813a1269f imul edx 0000562813a126a1 sar edx, 0x1 0000562813a126a3 mov eax, ecx 0000562813a126a5 sar eax, 0x1f 0000562813a126a8 sub edx, eax 0000562813a126aa mov eax, edx 0000562813a126ac imul eax, eax, 0x43 0000562813a126af sub ecx, eax 0000562813a126b1 mov eax, ecx 0000562813a126b3 mov edx, dword ptr [rbp-0x14] 0000562813a126b6 add eax, edx 0000562813a126b8 cdqe 0000562813a126ba add qword ptr [rbp-0x8], rax 0000562813a126be nop 0000562813a126bf nop 0000562813a126c0 add dword ptr [rbp-0x10], 0x1 0000562813a126c4 cmp dword ptr [rbp-0x10], 0x3e7 0000562813a126cb jle 0x562813a12646 0000562813a12646 nop 0000562813a12647 nop 0000562813a12648 nop 0000562813a12649 call 0x562813a11b70 0000562813a11b70 jmp qword ptr [rip+0x2024ba] 00007f3325d89f40 mov eax, 0x27 ... [disabled]
Which looks plausible to me:
- We can see the return from libc: 00007f3325d891a9 ->0000562813a1263d
- Then the loop start NOP markers.
- The part where the trace is disabled is probably the call to `getpid()` as I have kernel tracing disabled.
- Then the loop end NOPs
Notice how the next iteration of the loop is in the trace (we see the next loop's `nop, nop, nop`). In fact we have many, many iterations in the trace (chopped from the listing for brevity).
I suppose this is because there is latency in stopping the tracer via `ioctl()`? Next I'll be looking into ways of stopping the trace sooner.
Hopefully I've understood correctly! Again, many thanks :)
[0] https://github.com/vext01/hardware_tracer/blob/f8013d05bd1920c7fc505f001fb9ccc9bf733208/run_ptxed.py
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Ah, the reason we see lots of iterations is because of the inner loop. Killing this makes the trace much shorter :)
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page