Thursday, March 17, 2011

x86 register save performance

Most people who deal with machine code already know that out-of-order CISC chips like most x86 variants are extremely complicated. Optimizing for them has always been a bit of a black art. In my work on DynamoRIO, I'm trying to optimize instrumentation calls to straight C functions. If it's a leaf function, we can try to inline it. For now, I'm just switching stacks and saving used registers inline. There are two ways you can do this. Most C compilers, when they have to spill locals to the stack, will subtract the size of the frame from the stack pointer on entry and address the locals relative to the SP. So, for register saves, I could do the same. However, I could also use the builtin push and pop opcodes. To decide, I did a microbenchmark, and the push and pop opcodes won. Let's look at the assembly:

/* Compile with -DPUSH to get pushes/pops
* and without for movs. */
.globl main
main:
push %rbp
mov %rsp, %rbp
mov $500000000, %rax
.Lloop:
#ifdef PUSH
push %rax
push %rbx
/* etc... */
push %r15
/* Mid */
pop %r15
/* etc... */
pop %rbx
pop %rax
#else
lea -120(%rsp), %rsp /* Update rsp. */
mov %rax, 0x70(%rsp)
mov %rbx, 0x68(%rsp)
/* etc... */
mov %r15, 0x00(%rsp)
/* Mid */
mov 0x00(%rsp), %r15
/* etc... */
mov 0x68(%rsp), %rbx
mov 0x70(%rsp), %rax
lea 120(%rsp), %rsp /* Update rsp. */
#endif
dec %rax
test %rax, %rax
jnz .Lloop
leave
ret

So, we have a loop here that saves all registers (I only need to save some for my purposes, this is just a micro-benchmark) in the two ways described. Here are the results for running both on a Core i7 I have access to:

Command:
gcc microbench_push_mov.S -DPUSH -o microbench_push_mov && time ./microbench_push_mov

User time:
trial 1 0m3.400s
trial 2 0m3.404s
trial 3 0m3.404s

Command:
gcc microbench_push_mov.S -o microbench_push_mov && time ./microbench_push_mov

User time:
trial 1 0m3.444s
trial 2 0m3.444s
trial 3 0m3.444s

So, the push/pop version is 1% faster, which depending on how you look at it is unintuitive. At first, one would think that push and pop have to do more work by updating the rsp register for each instruction. Looking at the encoding, I can say that the code size of the moves is much larger, because an offset needs to be encoded for every instruction. "push %rax" is just 50, while "mov %rax, 0x70(%rsp)" is 48 89 44 24 70. I suppose doing the indirect addressing every cycle is also work, but it doesn't need to be managed as a write dependency like the stack pointer updates.

With that in mind, it now makes better sense why most compilers generate pushes for saving caller saved registers on entry, but use stack offset addressing for locals. The encoding for pushes and pops is much smaller, and the execution is marginally faster. Locals are accessed unpredictably, so push and pop aren't really an option.

3 comments:

  1. 1. I am against deriving performance of actual programs from micro-benchmarks.

    2. I am against people who are forming general conclusions from simplistic benchmarks.

    3. Your ideas about what "most compilers are doing" seem to be inadequate.

    4. Consider the following (somewhat counter-intuitive) results obtained on my notebook Intel CPU:

    /* Compile with -DPUSH to get pushes/pops
    * and without for movs. */
    .globl main
    main:
    push %ebp
    mov %esp, %ebp
    mov $500000000, %eax
    .Lloop:
    #ifdef PUSH
    push %eax
    push %ebx
    push %edi
    pop %edi
    pop %ebx
    pop %eax
    #else
    sub $120, %esp
    mov %ebp, 0x08(%esp)
    mov %esp, %ebp
    mov %eax, 0x70(%esp)
    mov %ebx, 0x68(%esp)
    mov %edi, 0x00(%esp)
    mov 0x70(%esp), %eax
    mov 0x68(%esp), %ebx
    mov 0x00(%esp), %edi
    mov 0x08(%esp), %ebp
    add $120, %esp
    #endif
    dec %eax
    test %eax, %eax
    jnz .Lloop
    leave
    ret


    Truncated output from "perf stat":

    perf stat -- ./microbench_push
    Performance counter stats for './microbench_push':
    1893.395538 task-clock-msecs
    3135617722 cycles
    4501740707 instructions (1.436 IPC)
    1.902329985 seconds time elapsed

    perf stat -- ./microbench_mov
    Performance counter stats for './microbench_mov':
    1832.027358 task-clock-msecs
    3031630961 cycles
    7005248292 instructions (2.311 IPC)
    1.844742021 seconds time elapsed

    ReplyDelete
  2. 32-bit x86 is probably much different from 64-bit x86. Older x86 processors used to do things like trying to promote stack slots to internal registers to deal with the fact that there were so few registers. It's possible modern 64-bit CPUs running in 32-bit mode do the same. Either way, our hardware is different.

    My assumption going into this was that the mov's would be faster, but I was proven wrong on my particular hardware.

    As for what most compilers do, I was surprised to see a mix of both styles of prologue in /usr/bin/python. LLVM always pushes and pops in the prologue, which is what I've disassembled most. So it's kind of a wash.

    ReplyDelete
  3. I merely found this excellent post through google. This info is very useful and I will certainly save the bookmark. With thanks








    Shakti Prash | Hanuman Chalisa Yantra | Air Sofa Cum Bed |
    Zero Addiction | Hot Shaper | Allah Barkat Locket | Step Up Height growth | Body Buildo

    ReplyDelete