/* Compile with -DPUSH to get pushes/pops
* and without for movs. */
.globl main
main:
push %rbp
mov %rsp, %rbp
mov $500000000, %rax
.Lloop:
#ifdef PUSH
push %rax
push %rbx
/* etc... */
push %r15
/* Mid */
pop %r15
/* etc... */
pop %rbx
pop %rax
#else
lea -120(%rsp), %rsp /* Update rsp. */
mov %rax, 0x70(%rsp)
mov %rbx, 0x68(%rsp)
/* etc... */
mov %r15, 0x00(%rsp)
/* Mid */
mov 0x00(%rsp), %r15
/* etc... */
mov 0x68(%rsp), %rbx
mov 0x70(%rsp), %rax
lea 120(%rsp), %rsp /* Update rsp. */
#endif
dec %rax
test %rax, %rax
jnz .Lloop
leave
ret
So, we have a loop here that saves all registers (I only need to save some for my purposes, this is just a micro-benchmark) in the two ways described. Here are the results for running both on a Core i7 I have access to:
Command:
gcc microbench_push_mov.S -DPUSH -o microbench_push_mov && time ./microbench_push_mov
User time:
trial 1 0m3.400s
trial 2 0m3.404s
trial 3 0m3.404s
Command:
gcc microbench_push_mov.S -o microbench_push_mov && time ./microbench_push_mov
User time:
trial 1 0m3.444s
trial 2 0m3.444s
trial 3 0m3.444s
So, the push/pop version is 1% faster, which depending on how you look at it is unintuitive. At first, one would think that push and pop have to do more work by updating the rsp register for each instruction. Looking at the encoding, I can say that the code size of the moves is much larger, because an offset needs to be encoded for every instruction. "push %rax" is just 50, while "mov %rax, 0x70(%rsp)" is 48 89 44 24 70. I suppose doing the indirect addressing every cycle is also work, but it doesn't need to be managed as a write dependency like the stack pointer updates.
With that in mind, it now makes better sense why most compilers generate pushes for saving caller saved registers on entry, but use stack offset addressing for locals. The encoding for pushes and pops is much smaller, and the execution is marginally faster. Locals are accessed unpredictably, so push and pop aren't really an option.
1. I am against deriving performance of actual programs from micro-benchmarks.
ReplyDelete2. I am against people who are forming general conclusions from simplistic benchmarks.
3. Your ideas about what "most compilers are doing" seem to be inadequate.
4. Consider the following (somewhat counter-intuitive) results obtained on my notebook Intel CPU:
/* Compile with -DPUSH to get pushes/pops
* and without for movs. */
.globl main
main:
push %ebp
mov %esp, %ebp
mov $500000000, %eax
.Lloop:
#ifdef PUSH
push %eax
push %ebx
push %edi
pop %edi
pop %ebx
pop %eax
#else
sub $120, %esp
mov %ebp, 0x08(%esp)
mov %esp, %ebp
mov %eax, 0x70(%esp)
mov %ebx, 0x68(%esp)
mov %edi, 0x00(%esp)
mov 0x70(%esp), %eax
mov 0x68(%esp), %ebx
mov 0x00(%esp), %edi
mov 0x08(%esp), %ebp
add $120, %esp
#endif
dec %eax
test %eax, %eax
jnz .Lloop
leave
ret
Truncated output from "perf stat":
perf stat -- ./microbench_push
Performance counter stats for './microbench_push':
1893.395538 task-clock-msecs
3135617722 cycles
4501740707 instructions (1.436 IPC)
1.902329985 seconds time elapsed
perf stat -- ./microbench_mov
Performance counter stats for './microbench_mov':
1832.027358 task-clock-msecs
3031630961 cycles
7005248292 instructions (2.311 IPC)
1.844742021 seconds time elapsed
32-bit x86 is probably much different from 64-bit x86. Older x86 processors used to do things like trying to promote stack slots to internal registers to deal with the fact that there were so few registers. It's possible modern 64-bit CPUs running in 32-bit mode do the same. Either way, our hardware is different.
ReplyDeleteMy assumption going into this was that the mov's would be faster, but I was proven wrong on my particular hardware.
As for what most compilers do, I was surprised to see a mix of both styles of prologue in /usr/bin/python. LLVM always pushes and pops in the prologue, which is what I've disassembled most. So it's kind of a wash.