I just had an interesting discussion with Qin Zhao about how the various instrumentation frameworks out there achieve early injection. I haven't actually cross-checked this all at all, so it may not be 100% accurate, but I thought it was interesting enough to write down.
Early injection is the ability of a binary instrumentation framework to interpret an application from the very first instruction. On Linux, applications do not start life in userspace at main. They start life at _start, and that usually hops into the dynamic loader to bring in the shared libraries such as libc. Usually this code is very similar for every application, and you don't need to instrument it to find uses of uninitialized memory in your application. However, its possible you have an app that doesn't use the dynamic loader, or there really are bugs there you want to see. A related problem is how to load your own dynamic libraries without conflicting with the loader being used by the application. Typically the DBI framework will load two copies of libc: one for the application, and one for the framework and its libraries. DynamoRIO, Valgrind, and Pin all have different ways of loading and injecting with different tradeoffs.
DynamoRIO is the easiest, because we don't support early injection on Linux. (Windows is a different story which I can't speak to.) We simply set LD_LIBRARY_PATH and LD_PRELOAD in a shell script that wraps the application invocation. This way, the loader will pull in one of our shared libraries, and call _init, at which point we take control. Instead of returning back to the loader, we start interpreting from where we would have returned. For DynamoRIO's dynamic libraries, the most important of which is the actual tool that you want to run, ie your race detector or memory debugger, we have our own private loader which keeps its own libraries on the side and does all the memory mapping and relocations itself. This is a pain, since it practically reimplements all of the system loader, but we do have the ability to place restrictions on the kinds of libraries we load to make things easier.
Valgrind is the next easiest, since it's basically the inverse of the above strategy. The command "valgrind" is a proper binary, so the system loader kicks in and loads all of Valgrind's libraries for it. It can use dlopen to bring in the tools it wants. To load libraries for the application, Valgrind has a private loader similar to our own, except that it loads the application's libraries instead of Valgrind's.
Finally, Pin has the most interesting strategy. Pin does not have it's own loader. Pin is also a proper binary. They start by loading all the libraries they ever want, including the tool you want to run, using the system loader (dlopen, etc). Then they fork a child process. The *parent*, not the child, sets up a call to exec to match the invocation of the application. Somehow, probably using ptrace or some other API, the child is able to pause the parent application before the first instruction and attach. The child then copies its memory map into the parent. If there are conflicts, it errors out. The child then redirects execution to Pin's entry point, which eventually starts interpreting the application from _start.
Pin's approach allows early injection without the maintenance burden of a private loader. However, it can cause conflicts with the application binary and requires heuristics to set preferred addresses which won't conflict. It also slows down startup, due to the extra forking and memory mapping and copying. Personally, I think this is pretty cool, even if it has some transparency issues.