One big annoyance I have with x86 in my work on DynamoRIO is the special FLAGS register. You can't read or write to it like a normal register, and 2/3 of the bits are reserved or always 0 or 1. The only easy way to get flags in or out is to use pushf and popf, but in DynamoRIO we can't use those without switching off the application stack, since it could be doing things like using the x86_64 Linux red zone. So, we try to avoid clobbering FLAGS in inserted code whenever possible. For example:
You wanted to subtract a constant? Instead of "sub $-1, %rax" try "lea -1(%rax), %rax". Works for add, too.
With the more complex base, displacement, index, scale, and offset form of the x86 memory operand, you can get even more fancy, accomplishing small shifts and multiplies and register to register adds using the LEA instruction. You can even get a 3-operand add like so:
lea 0(%rax, %rbx, 1), %rcx
Which adds %rax and %rbx and stores the result in %rcx. Another old trick is multiplication by 5, which works like so:
lea 0(%rax, %rax, 4), %rax
This is powerful enough to perform register to register adds and simple arithmetic with constants, but we can't yet subtract a register from another. Unfortunately, SUB and NEG both touch FLAGS. However, the NOT instruction does not. This means, assuming %rax is dead, we can rewrite "sub %rax, %rbx" like so:
not %rax
lea 1(%rax, %rbx, 1), %rbx
Remembering how two's complement works, negation is equivalent to a complement and add one. Addition is commutative, so we can fold the increment into the displacement field of the LEA.
Finally, you can also use JRCXZ (JECXZ in 32-bit land) to perform a jump if RCX is zero, without looking at flags. Putting it all together, you can implement equality checks without changing flags like so (assuming one of %rdi or %rsi and %rcx are dead or have been saved):
Flags code:
cmp %rdi, %rsi
je .Llabel
No flags code:
not %rsi
lea 1(%rdi, %rsi, 1), %rcx
jrcxz .Llabel
I haven't checked, but I imagine the compare is faster. However, in DynamoRIO, we have to save the flags for the cmp, which requires three instructions to save and to restore. With that factored in, I have a feeling the no flags version may be preferable. I will have to implement the no-flags transformation and see if it performs better.
Unfortunately, the toy client I am developing with has code like this:
lea -1(%rdx), %eax
test %rax, %rdi
jz .Llabel
Which performs an alignment check (assuming rdx is a power of two). I'm not sure how to implement an arithmetic AND (which is what TEST does) without touching flags.
I don't suppose that:
ReplyDeletelahf
seto %al
...
add %al, 0x7F
sahf
would be actually save you any time?
That's what we're already doing to save/restore the flags, except I then save %rax with the flags to memory somewhere, so it comes out to a lot. It also clobbers %rax, which means I first have to save %rax to memory.
ReplyDeleteDo you know the alignment value at insertion time or not until run-time? I was toying with using xlat but I don't think it would be practical. Something like this:
ReplyDeleteunsigned char lookup_table[][256] =
/* 1-byte alignment */ { 1,1,1,1,1,... },
/* 2-byte alignment */ { 1,0,1,0,1,... },
/* 4-byte alignment */ { 1,0,0,0,1,... },
...
};
; %rax = value_to_test
; %rbx = lookup_table[log2(alignment)]
xlatb
movzbl %al, %rcx
jrcxz not_aligned
I was going to say you could just do "movzbl 0(%rbx, %rax, 1), %rcx" but you'd have to find a way to zero out the top bits of rax, which would require masking or shifts, bringing back thr problem.
ReplyDeleteIn my case, I sort do and don't know the alignment. It's an argument passed into function which I'm trying to inline and optimize. In order to "know" it, I have to constant fold it, and then use a pattern for "test $0x07, %rax" to use the lookup table.
I had another idea as I was falling asleep last night. Shift the value left using lea and then check to see if the bottom byte is 0. Pick one of the following blocks as appropriate:
ReplyDeletelea 0(,%rax,8),%rcx
lea 0(,%rcx,2),%rcx
movzbl %cl, %rcx
jrcxz _16_byte_aligned
lea 0(,%rax,8),%rcx
lea 0(,%rcx,4),%rcx
movzbl %cl, %rcx
jrcxz _8_byte_aligned
lea 0(,%rax,8),%rcx
lea 0(,%rcx,8),%rcx
movzbl %cl, %rcx
jrcxz _4_byte_aligned
Checking for 2-byte alignment would require another lea. 32, 64, 128, and 256 are easy if you need them.