Thursday, March 17, 2011

x86 register save performance

Most people who deal with machine code already know that out-of-order CISC chips like most x86 variants are extremely complicated. Optimizing for them has always been a bit of a black art. In my work on DynamoRIO, I'm trying to optimize instrumentation calls to straight C functions. If it's a leaf function, we can try to inline it. For now, I'm just switching stacks and saving used registers inline. There are two ways you can do this. Most C compilers, when they have to spill locals to the stack, will subtract the size of the frame from the stack pointer on entry and address the locals relative to the SP. So, for register saves, I could do the same. However, I could also use the builtin push and pop opcodes. To decide, I did a microbenchmark, and the push and pop opcodes won. Let's look at the assembly:

/* Compile with -DPUSH to get pushes/pops
* and without for movs. */
.globl main
main:
push %rbp
mov %rsp, %rbp
mov $500000000, %rax
.Lloop:
#ifdef PUSH
push %rax
push %rbx
/* etc... */
push %r15
/* Mid */
pop %r15
/* etc... */
pop %rbx
pop %rax
#else
lea -120(%rsp), %rsp /* Update rsp. */
mov %rax, 0x70(%rsp)
mov %rbx, 0x68(%rsp)
/* etc... */
mov %r15, 0x00(%rsp)
/* Mid */
mov 0x00(%rsp), %r15
/* etc... */
mov 0x68(%rsp), %rbx
mov 0x70(%rsp), %rax
lea 120(%rsp), %rsp /* Update rsp. */
#endif
dec %rax
test %rax, %rax
jnz .Lloop
leave
ret

So, we have a loop here that saves all registers (I only need to save some for my purposes, this is just a micro-benchmark) in the two ways described. Here are the results for running both on a Core i7 I have access to:

Command:
gcc microbench_push_mov.S -DPUSH -o microbench_push_mov && time ./microbench_push_mov

User time:
trial 1 0m3.400s
trial 2 0m3.404s
trial 3 0m3.404s

Command:
gcc microbench_push_mov.S -o microbench_push_mov && time ./microbench_push_mov

User time:
trial 1 0m3.444s
trial 2 0m3.444s
trial 3 0m3.444s

So, the push/pop version is 1% faster, which depending on how you look at it is unintuitive. At first, one would think that push and pop have to do more work by updating the rsp register for each instruction. Looking at the encoding, I can say that the code size of the moves is much larger, because an offset needs to be encoded for every instruction. "push %rax" is just 50, while "mov %rax, 0x70(%rsp)" is 48 89 44 24 70. I suppose doing the indirect addressing every cycle is also work, but it doesn't need to be managed as a write dependency like the stack pointer updates.

With that in mind, it now makes better sense why most compilers generate pushes for saving caller saved registers on entry, but use stack offset addressing for locals. The encoding for pushes and pops is much smaller, and the execution is marginally faster. Locals are accessed unpredictably, so push and pop aren't really an option.