Measuring information in millibytes
A thousandth of one byte. Does such a thing even exist?
Yep, it does. I’ve used them before.
How to debug an optimizing compiler
Suppose you work on an optimizing compiler that has a bug, causing the compiled program to misbehave. You check the compiler log and discover that 256 optimizations ran, and it’s your job to figure out which optimization has the bug. What do you do?
You give the problem some thought, and realize that optimizations have a very important property: they’re all optional. The program will run correctly even if you perform only some (or none) of the optimizations.
You decide to try running with optimizations 1-128 enabled, and 129-256 disabled. That run fails. Now you know that one of the first 128 optimizations is to blame. You try again with 1-64 enabled, and 65-128 disabled, and that run passes. The bad optimization must be between 65 and 128. Continuing in this fashion—a binary search—you’re able to locate the faulty optimization in just eight steps.
Why eight steps? Well, to differentiate between 256 equally likely possibilities requires eight bits. Each experiment, since it differentiates between two equally likely outcomes (pass and fail), provides one bit of information. Therefore, we need eight such experiments to give us the eight bits we need to determine the faulty optimization.
It’s striking that the product of all our debugging effort will be data totaling just eight bits—a single byte. It’s not often that we deal with such small quantities of information. But we’re just getting started.
In fact, if the program is nondeterministic, and the failures are intermittent, your debugging effort could involve information on the scale of a tiny fraction of one bit!
To understand how, we’ll need to take a moment to examine what it means to measure information in the first place.
What’s in a bit?
A bit, we’re told, is the smallest unit of information. It’s just enough data to distinguish two equally likely possibilities. Heads or tails. Even or odd. Pass or fail.
It takes a larger number of bits to distinguish more possibilities. Two bits can distinguish four possibilities; so if we flip two coins, we could use one bit to describe the result of each coin, allowing us to represent the four possibilities heads-heads, heads-tails, tails-heads, and tails-tails. With three bits, we could flip three coins and represent all eight possible outcomes; four bits can represent sixteen possible outcomes.
In general, if we have B bits, we can distinguish 2B possibilities, each of which has a probability P = ½B. Solving this for B gives:B = log21/P
For P = 1/256, this gives B = log2256 = 8. As Claude Shannon tells us, this is why it takes eight bits to distinguish 256 possibilities that are equally likely.
However, a strange thing happens if we pick a value of P that is not a power of two. Suppose we want to specify one of the decimal digits, from 0 to 9. If each one is equally likely, then the information content is:B = log210 ≈ 3.3219 bits
Now, obviously a fraction of a bit makes no sense at all, right? Everyone knows it takes four bits to encode a decimal digit. Well, actually, it depends how you store them. Suppose you store the digits in groups of three. Then each group has 1000 possibilities, from 000 to 999, which you can encode in ten bits because 210 > 1000. Ten bits for three digits is 3⅓ bits per digit. In fact, if you were really desperate, you could encode each group of 31 digits using 103 bits, which works out to a little over 3.3226 bits per digit. It turns out Shannon’s formula gave us a pretty useful answer after all.
Intermittent failures
Armed with Shannon’s formula to tell us the information in a message, we’re ready to describe what happens when testing for intermittent failures.
Imagine the same situation as before—a compiler with one faulty optimization out of 256—but now the program is nondeterministic, and so the bug introduced by the flawed optimization causes a failure in only one out of every 90 runs. The question is, how much information do we get when we observe one passing test run? Well, it’s certainly not much. We expect most runs to pass, even if the faulty optimization does run, so learning that a given run passed isn’t exactly a newsflash.
Let’s apply Shannon’s formula. To calculate the information content of an event, we must first calculate its probability. As before, we’ll take the probability of including the faulty optimization to be 50%.† If the faulty optimization runs, the test will fail one time in 90; otherwise, it will pass. That makes the overall probability of failure 1/180, meaning the probability of the test passing is 179/180. Plugging this probability into the formula gives:B = log2180/179 ≈ 0.0080373 bits
This is a little more than eight thousandths of one bit. Therefore, the information given by one passing test run is just a little over one millibyte. To debug this compiler, we’re going to be running hundreds and hundreds of experiments, each of which gives us a tiny fraction of one bit, until we can finally put together the single byte we need to isolate the faulty optimization.
So, not only do millibytes exist; sometimes they can be extremely valuable!
† Actually, for intermittent failures, the probability that a given test includes a faulty optimization depends on what we’ve learned from previous experiments. The correct way to isolate a failing optimization causing intermittent failures uses Bayesian experimental design, which is a fascinating topic for another day.