Monday, March 12, 2018

UB will delete your null checks

Last week, I committed r326965 to Clang. Some buildbots showed that one of our tests, diagnose_if.c, was crashing reliably after my change. I couldn't reproduce the problem locally on Windows, Linux, or with ASan, all with assertions enabled. At that point, I stopped trying to fix forward and reverted my change and resolved to investigate it later. I was expecting failures, but I expected them to be loud and to take the form of a null dereference. But, as any compiler writer should know, dereferencing null is UB, so you shouldn't assume that if your program doesn't crash that it doesn't dereference null. That's what happened in this case.

I'll show how this happened with the following example program that models what I was doing in Clang:

#include "llvm/ADT/SmallVector.h"
struct Foo {
  llvm::SmallVector Vals;
struct Bar {
  Foo *getLastFoo() { return Foos.empty() ? nullptr : Foos.back(); }
  llvm::SmallVector Foos;
  void *Pad;
  void doUB();
void __attribute__((noinline)) Bar::doUB() {
  if (getLastFoo()->Vals.empty())
    puts("Vals empty");
    puts("Vals non-empty");
int main() {
  Bar b;

In this example, Bar contains a stack of Foos. Sometimes the Foo stack is empty, so sometimes getLastFoo returns null. The caller is supposed to check for that case, but the bug is that doUB does not check for null.

The implementation of SmallVector::back checks that the vector must be non-empty, so you would think that enabling assertions would protect us from this bug, but it does not. Because of the way LLVM's inliner works, empty() and back() are inlined into getLastFoo first. Some optimization (probably GVN) notices that the empty() check in the inlined code for back() is redundant with the empty check in the ternary, so it folds it to true. Now we have lost our assertion, but we still have the empty check.

Next we inline getLastFoo() into doUB(). At this point, SimplifyCFG (IMO a strange pass to do this) realizes that we always dereference the result of getLastFoo(). It simplifies the conditional branch to an unconditional branch that always goes down the "non-empty" path and removes the phi node with nullptr.

OK, so we've gotten rid of the null check, and the assertion is gone. But what about ASan? Why can't it catch this bug? First, it doesn't do any instrumentation to catch null dereferences because they are usually trivial bugs. Adding more instrumentation would be expensive, and we can usually rely on the processor to catch these bugs for us. At least, we can when the compiler isn't actively making it harder to find bugs.

Second, ASan instruments the code after optimization, so by the time ASan shows up, it's harder to catch the bug.

Finally, because of the way llvm::SmallVector works, back() ends up loading valid memory. The vector end pointer happens to point to the byte after the capacity pointer. Because we are storing pointers in this vector, we can successfully load the capacity pointer and return it. Then, because Foo is a relatively small object, the Vals.empty() check only ends up loading from valid memory addresses in Bar. We load the garbage, compare it, and do an arbitrary print. Hooray, no ASan bug. :)

Hypothetically MSan could find the bug as it is described here, but in the general case it's not a reliable tool for finding this type of UB. UBSan will do the job because it inserts UB checks prior to optimization. In particular, the unconditional dereference that SimplifyCFG saw before will become conditional with UBSan instrumentation, and the optimization will not fire.

Anyway, that's my writeup. After debugging this and writing it up, I now feel like there is more of a use case for -fno-delete-null-checks. My use case for it is to augment the Relase+Asserts developer build configuration that I prefer. I want something that builds fast, runs fast enough, and finds most bugs. I'm still skeptical of the Linux kernel use case, though. If there's a null dereference in your program, it has bugs, and we should build good bug detection tools so we can find and fix these efficiently rather than papering them over.


  1. "I'm still skeptical of the Linux kernel use case, though. If there's a null dereference in your program, it has bugs". But if you have null checks and the compiler is not permitted to "optimize" them away, you don't have a bug.

  2. Yay, a blog post! :-)

    I think Blogspot de-templatized your code though (things in angle brackets are gone).