QuickCheck and random infinite trees

So I was QuickCheck-ing a bit of code with this amazing library that will automatically produce random test cases for your properties. Often in Haskell, we’ll have recursive, tree-like datastructures, like this simple binary tree

data Tree = Leaf | Branch Tree Tree deriving Show

In order for QuickCheck to generate some random trees, we make this type an instance of Arbitrary:

import Control.Monad
import Control.Applicative
import Test.QuickCheck

instance Arbitrary Tree where
    arbitrary = oneof 
        [ return Leaf
        , Branch <$> arbitrary <*> arbitrary ]

Easy! An arbitrary Tree is either a Leaf or a Branch into two arbitrary trees. I quickly learned that THIS IS NOT THE WAY YOU DO IT. Trying it out with

sample (arbitrary :: Gen Tree)

lets Haskell give us some random trees

Branch (Branch Leaf Leaf) Leaf
Branch Leaf Leaf
Branch Leaf (Branch (Branch (Branch Leaf (Branch (Branch Leaf (Branch (Branch (Branch (Branch Leaf Leaf) Leaf) (Branch Leaf (Branch Leaf Leaf))) (Branch Leaf (Branch Leaf (Branch (Branch (Branch (Branch Leaf (Branch Leaf Leaf)) (Branch (Branch Leaf (Branch Leaf Leaf)) (Branch Leaf (Branch (Branch Leaf (Branch Leaf (Branch (Branch (Branch (Branch (Branch (Branch (Branch (Branch Leaf Leaf) (Branch (Branch (Branch (Branch Leaf Leaf) (Branch Leaf Leaf)) Leaf) Leaf)) (Branch (Branch Leaf (Branch (Branch Leaf Leaf) Leaf)) (Branch (Branch (Branch (Branch Leaf Leaf) Leaf) (Branch (Branch (Branch Leaf (Branch (Branch (Branch (Branch (Branch Leaf (Branch Leaf Leaf)) (Branch Leaf (Branch Leaf (Branch Leaf Leaf)))) (Branch (Branch Leaf Leaf) (Branch (Branch (Branch Leaf (Branch (Branch Leaf (Branch (Branch Leaf Leaf) Leaf)) Leaf)) (Branch Leaf Leaf)) Leaf))) (Branch Leaf (Branch (Branch Leaf Leaf) (Branch (Branch Leaf Leaf)  ...

I have not written out that last tree completely; not even remotely. Let’s have a closer look at what happens

depth :: Tree -> Int
depth Leaf = 0
depth (Branch l r) = 1 + max (depth l) (depth r)

Now we can compute the depth of what we get

sample (depth <$> arbitrary)
> 0,0,862,0,0,12,1,0,1,0,0

A depth of 862? That seems pretty disproportionate. oneof throws a fair coin. On heads, it will just return a Leaf and the process terminates. On tails, it will spawn a branching and repeat the same process independently. Now, in order for the tree to terminate, all active branches have to terminate. Around half of them will die, half of them will branch even further. Do these two processes cancel each other out? Can we expect the trees to terminate at a reasonable depth or is there even a possibility that the program will run indefinately?

Some experimentation

In our example, the dying and spawning of new branches weirdly balance each other, making predictions hard. But our coins don’t have to be fair. We can have QuickCheck favor certain alternatives with some frequency

instance Arbitrary Tree where
    arbitrary = frequency 
        [ (2, return Leaf)
        , (1, Branch <$> arbitrary <*> arbitrary) ]

Now twice as many branches will die than newly spawn. We should expect finite trees now

sample (depth <$> arbitrary)
> 1,0,0,6,0,0,2,0,0,0,0

Fine. A lot of trees terminate right at the root, but no weird runaway recursions happen. On the other hand

instance Arbitrary Tree where
    arbitrary = frequency 
        [ (1, return Leaf)
        , (2, Branch <$> arbitrary <*> arbitrary) ]

will spawn more trees than it will kill, so most trees will send Haskell into an unending recursion. Actually, there is a probability strictly between $0$ and $1$ that the generated tree will be finite. Interestingly, the trees which do end up finite will be very short! What happens in our initial case with the fair coin stil remains unanswered.

Time to set up some maths. Let $p_0$ be the probability of getting a leaf and write $p(d)$ for the probability of the tree having depth $d$. Firstly

$$p(0) = p_0$$

Now a tree has depth precisely $d$ if

  • Option “branch” was chosen
  • both left and right child have depth $d-1$
  • or the left child has depth $d-1$ and the right one has a smaller depth $j$
  • or the right child has depth $d-1$ and the left one has a smaller depth $j$

giving us a recursive formula
$$p(d) = (1-p)\left(p(d – 1)^2 + 2p(d – 1)\sum_{j=0}^{d-2} P(j) \right)$$

Let’s experiment a bit with these

# Random trees

n = 1000000

# d = depth of a random tree

# p0 = probability of producing a leaf/terminating in each step
for p0 in [0.2, 1/3, 0.5, 2/3, 0.8]:
    q0 = 1-p0

    # p[n] = P(d = n)
    p = [0]*n

    # t[j] = P(d < j) = sum(p[i] for i in range(0,j))
    t = [0]*n

    p[0] = p0
    p[1] = q0 * p[0]**2
    t[1] = p[0]

    for i in range(2,n):
        p[i] = q0*(p[i-1]**2 + 2*p[i-1]*t[i-1])
        t[i] = t[i-1] + p[i-1]

    E = sum(d * p[d] for d in range(0,n))

    print("Random tree depth (n=%i, p=%f)" % (n,p0))
    print("P(d < inf) = %f" % t[n-1])
    print("E[d | d < inf] = %f" % E)

For $p=\frac 2 3$, we get

Random tree depth (n=1000000, p=0.666667)
P(d < inf) = 1.000000
E[d | d < inf] = 0.833477

100% of the trees are finite, but their expected depth is only 0.83. On the contrary, for $p=\frac 1 3$, we get

Random tree depth (n=1000000, p=0.333333)
P(d < inf) = 0.500000
E[d | d < inf] = 0.416738

Exactly 50% of the trees will be finite! Of the ones that do, the expected height is only 0.417. You either die young, our you live long enough to see yourself become infinite …

What happens for our original fair coin-toss?

Random tree depth (n=1000000, p=0.500000)
P(d < inf) = 0.999998
E[d | d < inf] = 22.581590

So, most trees seem to terminate, the expected height awkwardly hangs arounds 22. That’s no conclusive result, and the numbers change if we crank up the number of iterations. We need some theory to back up the empiricism.

Maths to the rescue

Our kind of branching tree experiment is an easy special case of a so called Galton-Watson process. In each generation, the nodes are allowed to have a random number of offspring with some probabilities $p_n$. In our case, that’s simply

$$p_0 = p,\,p_2 = 1-p$$

The deciding factor for the process is the expected number $\mu$ of offspring, which in our case is

$$\mu = 0\cdot p_0 + 2\cdot p_2 = 2-2p.$$

We can now apply some theorems:

If $\mu \leq 1$, the process terminates with probability 100%.

The depth $\tau$ of our trees would in the theory of these processes be called extinction time. $\mu \leq 1$ tells us the extinction time is always finite. We want to understand their exact distribution. What’s the probability that $\tau > n$? Proposition 5 says

  • if $\mu < 1$, $P(\tau > n)$ decays exponentially with $n$.
  • if $\mu = 1$, $P(\tau > n)$ decays only linearly with $n$.

We can always compute the expected value from these tail distributions with the formula

$$ \mathbb E[\tau] = \sum_{n=1}^\infty P(\tau \geq n) $$

This means that for $\mu < 1$, the expected tree depth has a finite value, whereas for $\mu = 1$, we get
$$\mathbb E[\tau] \approx \sum_{n=1}^\infty \frac 1 n = \infty.$$

Fair coin tosses leed to trees that all eventually terminate, but we can expect them to take arbitrarily long. As soon as we tweak the frequencies a bit towards termination, we get nice and finite trees.

tl/dr Never let QuickCheck throw fair coins. Tweak the frequencies.

Remark: The other idiomatic solution to the problem is to carry around a size parameter in the generating routine

genTree depth 
    | depth == 0 = return Leaf
    | depth > 0 = 
            [ return Leaf
            , Branch <$> (genTree (depth - 1)) <*> (genTree (depth - 1)) ] 

QuickCheck can even use the size parameters for its internal workings, so in the end we just declare

instance Arbitrary Tree where
    arbitrary = sized genTree

Laaarge numbers, Part I

What is the largest number you’ve come across? A million is a lot when you think of it as money. A billion is a lot when you think of it as operations to do on a computer. $10^{100}$ is a lot when compared to the number of atoms in the universe. $10^{2 000 000}$ is a loot of books. Still, each one of those numbers can be written out by a one and a moderate amount of zeroes. The googolplex $10^{(10^{100})}$ couldn’t be written out in a physical form any more (though in a digital one), and what it would count is unimaginably large. But again, it comes in a nice and compact form that we’ve just written down: $$10^{(10^{100})}$$

We want to get numbers larger than that. And not just a single one of them, but a whole function $f$ that takes a number and computes a large number out of it, growing extremely fast. For example
$$f(n) = 10^{10^{10^n}}$$
is growing quite fast, in fact we already get the googolplex as $f(2)$. We could now play the silly game and include larger and larger power towers, take factorials like
$$f(n) = n^{n^{n!}}$$
and so an. We will be looking for a concept of a fast-growing function that subsumes all of these.

The LOOP language

The LOOP language is a very simple programming language. All of its variables do hold arbitrarily sized integers and are initialized to $0$. The only statements of the language are assignments with some arithmetic (+,-,*,/) like

x = x + 2*(y - z/2)

and loops like

loop n

The loop reads the value of n at the beginning and runs its body exactly as many times (or not at all if the expression isn’t positive). It doesn’t care if we change the value of n inside the loop. So the program

n = 5
loop n
    n= 2*n

doubles n five times, resulting in n = 160.

Let’s convince ourselves that LOOP is surprisingly powerful: For example, we can simulate if switches with loops. Instead of

if n == 5

we write

loop n - 4
    loop 6 - n
        eq = 1
neq = 1
loop eq
    neq = 0
loop neq

We don’t care if the solution is efficient, we just care for a LOOP program that computes what we want.

Brainteaser 1) Can you write a LOOP program that tests if $n$ is prime?
Brainteaser 2) Can you write every LOOP program as one that only uses basic statements var = var + 1, var = var - 1 and loop var?

Let’s now use LOOP to generate large numbers. Surely, we can exponentiate

n = 10
loop 100
   n = n * 10

and take factorials

n = 10
i = n - 1
loop n - 2
   n = n * i
   i = i - 1

and we can get our googolplex

k = 10
loop 100
    k = k * 10
n = 10
loop k
    n = n * 10

We can create incredibly large numbers by nesting loops in worse and worse ways

n = 10
loop 100
   loop n
      n = n*n

Brainteaser 3) Think of how large the output of this program is.

Sure, we have a program that generates an enormous number. What’s so special when we have programs that generate arbitrarily high numbers?

while true
    n = n + 1

But that’s not a LOOP program. And we cannot write any LOOP program that does the same job. This is because

Every LOOP program terminates.

The instruction pointer will inevitably reach the end of the program, and it can only loop back a fixed number of times. There is no endless loop in LOOP. Let that statement sink in! It means the length of a program somehow bounds how long it can run! Bigger numbers require larger programs. Still, already a small program can run extremely long. We want to measure how long.

If we take n to be our dedicated output variable, every loop program will generate a well-defined number. There are only finitely many sourcecodes of length at most k and even fewer of these are syntactically valid LOOP programs. We can therefore define
$$f(k) = \text{ maximum output of a LOOP program of length at most $k$. }$$

Brainteaser 4) Convince yourself that f grows faster than any of the “large” functions we looked at before.
Brainteaser 5) Can there be a LOOP program that, given $k$, computes $f(k)$?
Brainteaser 6) Can you implement a quick parser/interpreter for LOOP?
Brainteaser 7) How would you implement procedures in LOOP? Is recursion possible?