Thursday, November 8, 2012

Funny performance characteristics of DBI tools

These days I'm working on dynamic binary instrumentation (DBI) tools built with DynamoRIO, in particular Dr. Memory.  One of the things people always as is, "So what's the slowdown?  Are you faster than Valgrind?"  The answer is incredibly complicated, as performance questions usually are.  The easy answer is, "On SPEC?  10x slowdown, and yes, we are faster than Valgrind."  Go to the drmemory.org front page and you can see our pretty graph of spec slowdowns and how we are twice as fast as Valgrind.

OK, great!  Unfortunately, it turns out that most apps aren't at all like SPEC.

My team's goal is to find bugs in Chrome, so we want to run Chrome and its tests, not SPEC.  So what's different about Chrome?  Many things, but the biggest difference in one word is: V8.  V8 is the JavaScript engine that gives Chrome much of its performance edge, and it loves to optimize, deoptimize, and generally modify its generated code.  This creates a problem for DBI systems like DynamoRIO and Valgrind because they actually execute instrumented code out of a code cache, and not from the original application PC.  DBI systems need to maintain code cache consistency.

Valgrind doesn't actually try to solve this problem.  It requires the application to annotate all of its code modifications before re-executing the modified code.  Search for "VALGRIND_DISCARD_TRANSLATIONS" for more information on how this works.

DynamoRIO was originally a security product designed to run on unmodified Windows applications, so this approach was a non-starter.  Instead, DR uses page protections and code sandboxing to detect modification.  Sandboxing is where we insert extra code to check that the code we're about to execute is unmodified on every instruction.  When we use page protections, we mark all read, write, execute pages as read-only.  When the app writes its code, we catch the page fault, flush the code, and mark the page writable.

In theory, with those two techniques we are able to provide perfect application transparency.  However, it they come at a very high performance cost.  I'm currently running a V8 test case that takes .1 seconds to execute natively.  The version running under DynamoRIO has been running for 50 minutes while I've been writing this blog post, and it's actually making progress based on the output.  That gives us approximately a 32400x slowdown!

Generally speaking, our slowdown isn't this bad.  This particular test case is stress testing the optimizer.  But it demonstrates how hard it is to answer the question of performance.

Still, there's a lot of room for improvement here.  In particular, we are considering integrating parts of the Valgrind approach where we get help from the app to maintain code cache consistency, but I don't want to give up on the dream of a perfectly transparent DBI system yet.  Our rough idea for how to do this is to have the app tell us which areas of virtual memory it will maintain consistency for, and for the rest of the memory, we'll use our normal cache consistency protocol.  This naturally handles two JITs in the same process, one which is cooperating with us, and one which isn't.

Hopefully I'll write another blog post when we get this stuff implemented.

2 comments:

  1. Note that the Valgrind option
    --smc-check=none|stack|all|all-non-file
    allows to provide application transparency.
    (default value is "stack".
    all or all-non-file should allow to run Chrome
    without annotations.

    ReplyDelete
  2. another blog post...? did you written that? - LYK

    ReplyDelete