Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

Processor trace -- Missing the head of a trace

Edd_Barrett
Beginner
1,063 Views

Hi,

My first post here :)

I'm an academic researcher experimenting with PT with the hope of using it inside a tracing JIT.

For now, I have a simple C program which (pretends to be an interpreter and) has a loop. The goal is to try and collect one iteration of this loop. The important part looks like this:

    for (i = 0; i < LARGE; i++) {
        /* "JIT Merge Point" */
        if (i == HOT_THRESHOLD) {
            VDEBUG("Hot loop start");
            trace_on(&tr_ctx);
            sleep(1);
        }

        for (int j = 0; j < 1000; j++) {
            /* Start Marker*/
            asm volatile(
                "nop\n\t"
                "nop\n\t"
                "nop\n\t"
                :::);

            if (i % getpid() == 0) {
                sum += i * p % 33;
            } else {
                sum += i * 5 % 67 + p;
            }

            /* Stop Marker*/
            asm volatile(
                "nop\n\t"
                "nop\n\t"
                :::);
        }

        if (i == HOT_THRESHOLD) {
            VDEBUG("Hot loop end");
            if (ioctl(tr_ctx.perf_fd, PERF_EVENT_IOC_DISABLE, 0) < 0)
                err(EXIT_FAILURE, "ioctl to stop tracer");

            if (close(tr_ctx.loop_done_fds[1]) == -1) {
                err(EXIT_FAILURE, "close");
            }
            TDEBUG("Turned off tracer");
        }
    }
  • `trace_on()` turns on PT and runs the poll loop in a thread.
  • I'm using `perf_event_open()` to communicate to the kernel PT functionality..
  • the `nop`s are so I can easily see the loop code in a trace

If we disassemble the loop, it looks like this:

    1646:       90                      nop
    1647:       90                      nop
    1648:       90                      nop
    1649:       e8 22 f5 ff ff          callq  b70 <getpid@plt>
    164e:       89 c1                   mov    %eax,%ecx
    1650:       8b 45 f4                mov    -0xc(%rbp),%eax
    ... etc
    16be:       90                      nop
    16bf:       90                      nop

If I run the program and disassemble the trace using `ptxed` from the processor-trace repo [0] then we get:

[enabled]
[exec mode: 64-bit]
[28, 7f332607f536: error: no memory mapped at this address]
[enabled]
[exec mode: 64-bit]
[987, 7f3325d89f47: error: no memory mapped at this address]
0000562813a12656  mov eax, edx
0000562813a12658  test eax, eax
0000562813a1265a  jnz 0x562813a1268d
0000562813a1268d  mov edx, dword ptr [rbp-0xc]
0000562813a12690  mov eax, edx
0000562813a12692  shl eax, 0x2
0000562813a12695  lea ecx, ptr [rax+rdx*1]
0000562813a12698  mov edx, 0x7a44c6b
0000562813a1269d  mov eax, ecx
0000562813a1269f  imul edx
0000562813a126a1  sar edx, 0x1
0000562813a126a3  mov eax, ecx
0000562813a126a5  sar eax, 0x1f
0000562813a126a8  sub edx, eax
0000562813a126aa  mov eax, edx
0000562813a126ac  imul eax, eax, 0x43
0000562813a126af  sub ecx, eax
0000562813a126b1  mov eax, ecx
0000562813a126b3  mov edx, dword ptr [rbp-0x14]
0000562813a126b6  add eax, edx
0000562813a126b8  cdqe
0000562813a126ba  add qword ptr [rbp-0x8], rax
0000562813a126be  nop
0000562813a126bf  nop
0000562813a126c0  add dword ptr [rbp-0x10], 0x1
0000562813a126c4  cmp dword ptr [rbp-0x10], 0x3e7
0000562813a126cb  jle 0x562813a12646
0000562813a12646  nop
0000562813a12647  nop
0000562813a12648  nop
0000562813a12649  call 0x562813a11b70
...

 

What baffles me about this, is that we see `nop, nop` (the end loop marker) before we see `nop, nop, nop` (the start loop marker). This suggests that the head of the loop has been lost. This might even be the 100th iteration of the inner loop we are looking at here. In any case, we should see `nop, nop, nop`, some other code, *then* `nop, nop`.

My initial thought was that perhaps there is a latency between the `ioctl()` to turn on the trace and tracing actually starting, but you can put a 10 second sleep after `trace_on()` and still get the same result.

So, does anyone know why the top of the loop is missing? Has the compiler moved my asm blocks around? I was under the impression that it couldn't, but maybe I'm wrong... FWIW, the program is built `-O0`.

I also notice that if I take away the inner loop (leaving the computation inside though), then I don't see any of the code from the outer loop at all, which supports my theory that the first few instructions of the trace are missing. Here I suppose the computations are smaller than the number of instructions that get lost.

The full code for my program is here [1].

Thanks!

[0] https://github.com/01org/processor-trace

[1] https://github.com/vext01/hardware_tracer/blob/4dece2203d6585f97003d6ff8c709f58ca5e1e99/vm.c

0 Kudos
4 Replies
Andreas_K_Intel
Employee
1,063 Views
You have a decode error at the beginning. The decoder will jump to the next synchronization packet, which leads to the missing instructions. You have to make sure the decoder sees all the code executed.
0 Kudos
Markus_M_Intel
Employee
1,063 Views

Hello Edd,

In addition to Andi's point about the decode error you might also want to check your perf_event configuration.  When the aux buffer is mapped read-only it will overwrite older trace when it wraps around.

regards,

markus.

0 Kudos
Edd_Barrett
Beginner
1,063 Views

Hi Andi, Markus,

Thanks for your swift response. Looks like Andi hit the nail on the head! I'm going to reply in a quite verbose fashion in the hope that this helps others (and my future self) who have landed here with the same problem :)

As I understand it, the trace I showed before was running through code for which `ptxed` didn't have an on-disk binary for. As it happens, this code is a return from `libc`, probably `sem_wait()` unlocking the thread after the tracer has initialised and signalled the other thread to carry on. Without giving `ptxed` knowledge of that code, it is unable to know where we are in the binary until the next sync point (presumably a trace packet containing an uncompressed absolute address?)

So what we need to to is tell `ptxed` to load the libraries at the same offsets as in the relocated memory image (bearing in mind the offset will change between runs due to ASLR). I wrote a little Python script [0] to wrap `ptxed` and pass the correct `--raw` arguments based upon a map file extracted from `/proc/<pid>/maps`.

So whereas before I was running something like:

ptxed -v --cpu auto --pt trace.data --raw vm:0x562813a11000

Now we have:

ptxed -v --cpu auto --pt trace.data \
    --raw /home/vext01/research/hardware_tracer/vm:0x562813a11000 \
    --raw /lib/x86_64-linux-gnu/libc-2.24.so:0x7f3325cd1000 \
    --raw /lib/x86_64-linux-gnu/libpthread-2.24.so:0x7f3326070000 \
    --raw /lib/x86_64-linux-gnu/ld-2.24.so:0x7f332628d000

And that gives us a trace like this:

[enabled]
[exec mode: 64-bit]
...
00007f3325d891a8  pop rbp
00007f3325d891a9  ret 
0000562813a1263d  mov dword ptr [rbp-0x10], 0x0
0000562813a12644  jmp 0x562813a126c4
0000562813a126c4  cmp dword ptr [rbp-0x10], 0x3e7
0000562813a126cb  jle 0x562813a12646
0000562813a12646  nop 
0000562813a12647  nop 
0000562813a12648  nop 
0000562813a12649  call 0x562813a11b70
0000562813a11b70  jmp qword ptr [rip+0x2024ba]
00007f3325d89f40  mov eax, 0x27
00007f3325d89f45  syscall 
[disabled]
[resumed]
00007f3325d89f47  ret 
0000562813a1264e  mov ecx, eax
0000562813a12650  mov eax, dword ptr [rbp-0xc]
0000562813a12653  cdq 
0000562813a12654  idiv ecx
0000562813a12656  mov eax, edx
0000562813a12658  test eax, eax
0000562813a1265a  jnz 0x562813a1268d
0000562813a1268d  mov edx, dword ptr [rbp-0xc]
0000562813a12690  mov eax, edx
0000562813a12692  shl eax, 0x2
0000562813a12695  lea ecx, ptr [rax+rdx*1]
0000562813a12698  mov edx, 0x7a44c6b
0000562813a1269d  mov eax, ecx
0000562813a1269f  imul edx
0000562813a126a1  sar edx, 0x1
0000562813a126a3  mov eax, ecx
0000562813a126a5  sar eax, 0x1f
0000562813a126a8  sub edx, eax
0000562813a126aa  mov eax, edx
0000562813a126ac  imul eax, eax, 0x43
0000562813a126af  sub ecx, eax
0000562813a126b1  mov eax, ecx
0000562813a126b3  mov edx, dword ptr [rbp-0x14]
0000562813a126b6  add eax, edx
0000562813a126b8  cdqe 
0000562813a126ba  add qword ptr [rbp-0x8], rax
0000562813a126be  nop 
0000562813a126bf  nop 
0000562813a126c0  add dword ptr [rbp-0x10], 0x1
0000562813a126c4  cmp dword ptr [rbp-0x10], 0x3e7
0000562813a126cb  jle 0x562813a12646
0000562813a12646  nop 
0000562813a12647  nop 
0000562813a12648  nop 
0000562813a12649  call 0x562813a11b70
0000562813a11b70  jmp qword ptr [rip+0x2024ba]
00007f3325d89f40  mov eax, 0x27
...
[disabled]

Which looks plausible to me:

  • We can see the return from libc: 00007f3325d891a9 ->0000562813a1263d
  • Then the loop start NOP markers.
  • The part where the trace is disabled is probably the call to `getpid()` as I have kernel tracing disabled.
  • Then the loop end NOPs

Notice how the next iteration of the loop is in the trace (we see the next loop's `nop, nop, nop`). In fact we have many, many iterations in the trace (chopped from the listing for brevity).

I suppose this is because there is latency in stopping the tracer via `ioctl()`? Next I'll be looking into ways of stopping the trace sooner.

Hopefully I've understood correctly! Again, many thanks :)

[0] https://github.com/vext01/hardware_tracer/blob/f8013d05bd1920c7fc505f001fb9ccc9bf733208/run_ptxed.py

0 Kudos
Edd_Barrett
Beginner
1,063 Views

Ah, the reason we see lots of iterations is because of the inner loop. Killing this makes the trace much shorter :)

0 Kudos
Reply