I work on DrMemory, a dynamic instrumentation tool for finding memory bugs in native apps similar to memcheck, more widely known as just Valgrind. We have to be able to generate stack traces in a wide variety of interesting situations, and we want to do it quickly and without depending on slow, non-portable, and hard to integrate libraries.
If you remember your early computer systems classes, you'll recall those exam questions about unwinding the stack by chasing the base pointer. The reality is that for years we've been stuck with x86_32, which has a measly 8 general purpose registers, and compilers have been itching to grab that tender little %ebp register. So, they took it, and now the world is much more complicated because walking the stack is a whole lot harder.
To unwind the stack, we need to answer the question: given the stack pointer (SP) and program counter (PC), what is the offset from the SP to the next return address? With that, we can repeat the question until we hit main (or NULL or whatever).
Maintaining a huge table mapping any arbitrary module offset to a stack offset would waste a lot of space, so most systems use some form of state machine encoding that can compute the value on the fly, should you need to walk the stack. (Win64 SEH uses an interesting mechanism involving carefully constructed prologues which I'd like to learn more about.) If stack walking is rare, such as for an "exceptional" exception or when in the debugger, then the state machine (DWARF CFI) works great. But if you need to do it all the time and you need to do it on multiple platforms, it will be slow, you will need platform-specific code, or you will need to depend on some other library. Pulling arbitrary code into a tool that runs in the same as the address space as the application can often break isolation between the tool and the app, so we don't like to do it very often.
However, we're in dynamic binary instrumentation land, so we get to see every instruction as it's executed. The next most obvious solution for unwinding is what we call the "shadow stack" solution. Every time you see a call or ret instruction, you just push or pop the PC on to or off of the current thread's shadow stack. If you need to record lots of stack traces, then this may be your best bet, especially since you can imagine logging these updates and then recording a trace is as easy as logging a timestamp. I believe that this is what ThreadSanitizer does, because they need to produce a stack trace not just for the current memory access, but for the other memory access that races with the current thread.
For a tool like DrMemory, we only need to store stack traces for every heap routine invocation, which is not nearly quite as often. We also don't want to unnecessarily slow down an app that performs many calls in a loop without calling any heap routines or creating any memory bug reports. You also have to worry about a lot of corner cases, such as an app which clobbers the return addr to go somewhere else, or longjmp, or exception handling.
The solution we use is I think quite elegant. First, if there are frame pointers, we use them, because it's fast (like iterating a linked list that's in cache :). While the compiler may get a 10% perf boost from an extra register, we get a bigger boost by not doing a "scan", which is what I am about to describe.
If we don't have frame pointers, we start with the stack pointer, and scan each pointer-aligned cell for a value that looks like a pointer into executable memory. This part is obvious, and you can get Windbg to do the same thing and symbolize it to boot (anyone know how to do this in gdb?).
The problem is that most code rarely initializes its entire stack frame. In any given frame, there is a lot of spill space for temporaries in unexecuted code, or arrays that are not filled. This means that stale return addresses from prior frames are still on the stack. The clever bit is realizing that we're already shadowing every memory write to track definedness information to find uninitialized reads, so we actually *know* which bytes are initialized and which are not. That means we can tell what is a probable return address and what is stale.
Still, what if the app stores a function pointer to the stack? That will look like a code pointer and show up as an extra frame. The next bit of cleverness is where we take the candidate return address and look backwards in the code for likely encodings of a call instruction. This is pretty straightforward, as there are only a handful of encodings for direct and indirect calls, and we can just enumerate the opcodes along with their offsets. If it's a function pointer, this test should fail, unless we're really unlucky.
And with that, we have our (mostly) reliable, (reasonably) performant, and (mostly) portable stack crawler! :) To my knowledge, Memcheck doesn't attempt any of this craziness, and encourages you to recompile with FPs if it gets a weird looking stack.
Edit: Comments on G+ are preferred, since I'll actually respond.