Juggling Knives or Adding Definedness to Combat Undefined Behavior

November 5, 2020
Re-Edit: September 4, 2021

Undefined behavior is notorious for code-breaking optimizations. While this is a discussion on its own, in this article we'll explore the flip side of the coin; a less known artifact of undefined behavior. It also hinders optimizations. How's that so and how compiler writers try to circumvent this problem ?

Note: This article was heavily inspired by this great paper: Taming Undefined Behavior in LLVM.



What is Undefined Behavior ?

In this article I will avoid explaining undefined behavior and I'll give only a short example on how it enables optimizations. I feel that other people have already done a great job on this topic. If you want to know more about it, I would recommend basically anything from John Regehr, for example, this 2-part video series: CppCon 2017: Undefined Behavior in 2017.

Also, this video from Chandler Carruth: Garbage In, Garbage Out: Arguing about Undefined Behavior with Nasal Demons. While we're at this, I would also recommend that you watch less popular opinions on the topic, for example: Handmade Hero Chat 011 - Undefined Behavior (Casey Muratori)

Undefined Behavior Enables Optimizations

At this point, I'm assuming we all can think of an example where undefined behavior enables an optimization. A classic one is signed overflow. In C/C++, if signed overflow happens, then you get undefined behavior. The important thing is that a program that triggers undefined behavior for some input is considered incorrect. The compiler can assume that the program it is given is always correct and so, it can take advantage of undefined behavior by assuming that it never happens. This is a valid assumption! It is the programmer's responsibility to ensure that undefined behavior is not triggered for all inputs.

So, using these assumptions, the compiler can do the following transformation:

if (a + c < b + c) ...  -->  if (a < b) ...

if a, b, c are signed integers of course.

First of all, let's make sure that we understand why the compiler would not be able to do this transformation if signed overflow was defined. Imagine that a == INT_MAX, b == 0 and c == 1. The code before the transformation would be:

if (INT_MAX + 1 < 0 + 1)

If we assume that signed overflow wraps around, then INT_MAX + 1 == INT_MIN. INT_MIN is you know.. like a very negative number (assuming we're using 2's complement arithmetic, which in 2020 everybody assumes except C/C++). For the right-hand-side, 0 + 1 == 1, not surprisingly. So, INT_MIN < 1 is true. But what would happen after the transformation:

if (INT_MAX < 0)

That would be false and our transformation invalid. But, since the operation INT_MAX + 1 causes signed overflow, we use UB to the rescue! We assume it doesn't happen (and if it does, shame on you programmer) and we proceed.

Undefined Behavior Disables Optimizations

That was all fun and certainly at times, a marvelous way to piss off C/C++ programmers if a lot of such "small and innocent" transformations are applied to their code. But, unfortunately, you can't always have that much fun with undefined behavior. Suppose that you have this loop:

int b, c;
...
for (int i = 0; i < N; ++i) {
  a[i] = b + c;
}

What we would like to do here is not perform this addition in every iteration. Instead, we would like to hoist it out of a loop, in a temporary, and only do a simple copy:

int tmp = b + c;
for (int i = 0; i < N; ++i) {
  a[i] = tmp;
}

But surprise, surprise, we can't do that, if we don't know the values of b, c, N. The reason is that as we learned, if b + c overflows, that's undefined behavior. Now, assume we hoist it and N <= 0.

In the original program, we would never get into the loop and thus, we would never execute the overflow. But in the transformed program, we will execute it no matter if we get in the loop or not. So, in the off-chance that b + c overflows and N <= 0, we introduced undefined behavior which would otherwise never happen. Essentially, we made the program more undefined, which is illegal for the compiler to do.

Definedness to the Rescue

As mentioned, the compiler can never make the program more undefined, but, it can make it more defined. For example, if we want, we can define what happens on signed overflow. We can e.g., define that the result is 0.

You can think of it as follows. Imagine that there is the set B which contains all possible behaviors of the program. When we say that a statement exhibits undefined behavior, the C/C++ standard lets the compiler produce a program, that will exhibit any one behavior b ∈ B, per execution of the program. That is, in one execution, the program can exhibit b1, while in another b2.

But, note that this gives freedom to the compiler to constrain the behaviors that the program can exhibit. For instance, the compiler is free to say that the program will exhibit the same behavior in every execution of the program (e.g., b1 == b2 above). Not only that, but it can also define what this behavior will be. For instance, for signed overflow, it can choose to define it as being 0.

Later in the article, we are going to see how we can define signed overflow so that it allows us to do the hoisting in the loop example above. But until then, let's explore some implications of adding definedness.

Implications of Adding Definedness

At some point, the input program gets lowered into assembly by the back-end. The back-end's job is to express the semantics of the input program in assembly. This gives rise to an important implication of adding definedness. You see, if we didn't ever add definedness, then if the back-end saw a signed addition in the input, it could translate to a regular signed addition in the assembly. That of course, assumes that the assembly has an addition instruction, but that is a reasonable assumption.

This addition instruction might not handle overflow correctly. Actually, this instruction could blow up the machine in the case of signed overflow. But we do not care because we didn't ever define what happens on signed overflow, so everything is allowed.

However, consider what happens when we do define signed overflow, e.g., defining it to result to 0. Furthermore, assume that the addition instruction of the assembly actually blows up the machine on signed overflow. How should the back-end translate a signed addition in the input program?

It certainly cannot emit a simple addition instruction, because in the case of signed overflow, the machine will blow up and our definition does not allow that! So, it has to emit complicated code like this:

if (a + b overflows)
  res = 0;
else
  res = a + b;  // Use a regular add instruction

And this is just a high-level sketch. For instance, to implement the condition a + b overflows, the back-end can't use an addition instruction because again, if overflow happens, the machine will blow up.

Of course, instructions do not usually blow up machines. But the point is that when you define something, it might have unforeseen implications. A more broad point is that undefined behavior is just a design decision. It has drawbacks, but it also has benefits. For instance, defining that pointer dereference never crashes would force the compiler to insert a guard in every dereference.

Back to Hoisting

Let's remind ourselves that we are interested in performing this:

int tmp = b + c;
for (int i = 0; i < N; ++i) {
  a[i] = tmp;
}

Now, let's find a definition for signed overflow that will let us do such hoistings. One minimally constraining is a simpler version of the one that is defined for LLVM IR. It states that signed overflow exhibits undefined behavior only if the result is not inolved in memory operations.

Understanding the full definition, and what exactly is a memory operation, would require us to explore LLVM IR more deeply. But I think this is not the point of this article. So, for the purposes of this article, assigning to or reading from a variable is not a memory operation. Let's just assume that the value is going to be in a register. But when storing to or reading from an array, those are memory operations. So, the statement int tmp = b + c does not have a memory operation. But the statement a[i] = tmp does.

Another nuance is that when we say "used in a memory operation", transitivity is implied. For instance, the result of b + c is not used directly in a memory operation, it's only assigned to tmp. But tmp is used, so transitively, b + c is used too.

The last important thing is that we are considering dynamic executions of the program. Consider the code below:

if (N > 0)
  a[i] = b;

If in a specific execution N <= 0, then the statement a[i] = b is not executed. Thus, for instance b is not used in a memory operation in that execution. In other words, when we are looking for whether some value is used in a memory operation, we should not just look the source code. For instance, in the loop we're dealing with, tmp appears in a memory operation but we only consider it used if this memory operation is executed.

All this follows naturally from the fact that undefined behavior is defined according to executions. In one execution the program might exhibit it, in another not.

With all the required definitions in place, we can finally hoist b + c out of the loop! Its result is going to be used in a memory operation only if we enter the loop. At this point, UB will be exhibited, but this would be exhibited in the original program too. So, we did not make the program more undefined than it already was.

Adding Definedness Disables Optimizations

Yes, definedness also disables optimizations. Before diving into it, let's give a name to the result of signed overflow; poison value. This is part of LLVM parlance, used more generically, to represent results of operations that normally cause UB (like signed overflow). Basically, it denotes an undefined value, but also it comes with a list of behaviors depending on how it is used. For our purposes, it will help us refer to the result of signed overflow more easily.

Let's now raise a question: How do we define the usage of a poison value in a branch condtion, for example:

int res = a + b; // Assume it overflows. `res` is poison
if (res)
  ...

We should always have in mind that when adding definedness, our goal is to: (a) not introduce more UB and ultimately, (b) aid optimizations. One way to define branching on poison is to be non-deterministic, that is, you don't know whether you'll take the branch or not. That semantically results on taking it and not taking it at the same time. In an if-else construct, it means you take both paths. But why would we choose that?

To find out why, let's learn about an interesting optimization called loop unswitching. It takes a loop-invariant branch inside a loop and wraps it around the loop. If our initial code is this:

while (foo) {
  if (bar)  { // Assume `bar` is loop-invariant
    <block 1>
  } else {
    <block 2>
  }
}

It transforms it to that:

if (bar) {
  while (foo) {
    <block 1>
  }
} else {
  while (foo){
    <block 2>
  }
}

This is useful because now the loop does not have a branch inside. Its effect is not direct, e.g., it's not that if we translate it to assembly, the code will be much faster. The reason is that modern branch predictors simply laugh at an invariant branch condition. However, branches disable other transformations, so the effect of this transformation is indirect.

Ok, what if bar is poison? That would worry us only if we never get in the loop since in the original version, we would never reach the branch on bar, but we do in the transformed version. If we define branching on poison as taking both paths, we can enable this transformation. Let's think through this.

If bar is not poison, we don't care. If it is and we get into the loop, we would have UB in both versions. Finally, if it is but we don't get into the loop, in the original program we wouldn't execute anything. In the transformed program, we would non-deterministically pick one of the two paths (either if or else), because of how we defined branching on poison above. But either of these paths contains a loop that is not executed, so we have the same behavior as in the original.

Oh that's great, let's define branching on poison then as taking both paths! Not so quickly... We are compiler developers with a wide range of knowledge and so now we go ahead and write a Global Value Numbering (GVN) pass. If you don't know GVN, it has similar effects as common sub-expression elimination (or simply, it gets rid of stuff that are / do the same). So, if our original code is:

int foo = a + b;
if (foo == bar) {
  tar = a + b;
  *p = tar;
}

It transforms it to:

int foo = a + b;
if (foo == bar) {
  *p = bar;
}

This results from the simple reasoning that first we find tar is the same as foo. But because of the if, foo is the same as bar (inside the if). And so we replace it.

Before we continue, I have to explain what happens when using poison in the == operator. In general, you can think that poison either raises UB immediately (if e.g., it is used in a memory operation) or it propagates poison (that's where the name came from; it poisons other values).

For our discussion, these details don't matter and will only confuse us. But, for example, storing a poison value (as we do here with *p) is immediate UB while using it in the == operator makes the result be poison. And so, if bar is poison, the branch condition becomes poison too and we have a branch-on-poison situation.

Back on topic, in the original code, we stored tar, which is ok since it's not poison. But in the transformed code, we store bar, which is poison, so we introduced UB. This is invalid only if branching on poison is defined (e.g., non deterministically, as we did). Because if not, UB exists in both versions (since both branch on poison).

In summary, there are two difficulties when trying to escape from UB through definedness. One is to add it so that it does not cause problems for the back-end or performance regressions. The latter requires a broad understanding of modern assemblies, what they can easily express, and which of those things are fast.

The other difficulty is that a definition which might be optimal for one optimization, might cancel out another. In that regard, optimizations can be mutually exclusive. Also, adding definedness might introduce a cognitive load for the compiler developers.

All these trade-offs make it difficult to find the sweet spot. Especially since we can't change the source language and we should also preserve backwards compatibility. My hope is that future intermediate represenations will learn from LLVM IR and at least have the chance to design these features from the ground up.