/* Compile with -DPUSH to get pushes/pops
* and without for movs. */
.globl main
main:
push %rbp
mov %rsp, %rbp
mov $500000000, %rax
.Lloop:
#ifdef PUSH
push %rax
push %rbx
/* etc... */
push %r15
/* Mid */
pop %r15
/* etc... */
pop %rbx
pop %rax
#else
lea -120(%rsp), %rsp /* Update rsp. */
mov %rax, 0x70(%rsp)
mov %rbx, 0x68(%rsp)
/* etc... */
mov %r15, 0x00(%rsp)
/* Mid */
mov 0x00(%rsp), %r15
/* etc... */
mov 0x68(%rsp), %rbx
mov 0x70(%rsp), %rax
lea 120(%rsp), %rsp /* Update rsp. */
#endif
dec %rax
test %rax, %rax
jnz .Lloop
leave
ret
So, we have a loop here that saves all registers (I only need to save some for my purposes, this is just a micro-benchmark) in the two ways described. Here are the results for running both on a Core i7 I have access to:
Command:
gcc microbench_push_mov.S -DPUSH -o microbench_push_mov && time ./microbench_push_mov
User time:
trial 1 0m3.400s
trial 2 0m3.404s
trial 3 0m3.404s
Command:
gcc microbench_push_mov.S -o microbench_push_mov && time ./microbench_push_mov
User time:
trial 1 0m3.444s
trial 2 0m3.444s
trial 3 0m3.444s
So, the push/pop version is 1% faster, which depending on how you look at it is unintuitive. At first, one would think that push and pop have to do more work by updating the rsp register for each instruction. Looking at the encoding, I can say that the code size of the moves is much larger, because an offset needs to be encoded for every instruction. "push %rax" is just 50, while "mov %rax, 0x70(%rsp)" is 48 89 44 24 70. I suppose doing the indirect addressing every cycle is also work, but it doesn't need to be managed as a write dependency like the stack pointer updates.
With that in mind, it now makes better sense why most compilers generate pushes for saving caller saved registers on entry, but use stack offset addressing for locals. The encoding for pushes and pops is much smaller, and the execution is marginally faster. Locals are accessed unpredictably, so push and pop aren't really an option.