QuickCheck and random infinite trees

So I was QuickCheck-ing a bit of code with this amazing library that will automatically produce random test cases for your properties. Often in Haskell, we’ll have recursive, tree-like datastructures, like this simple binary tree

data Tree = Leaf | Branch Tree Tree deriving Show

In order for QuickCheck to generate some random trees, we make this type an instance of Arbitrary:

import Control.Monad
import Control.Applicative
import Test.QuickCheck

instance Arbitrary Tree where
    arbitrary = oneof 
        [ return Leaf
        , Branch <$> arbitrary <*> arbitrary ]

Easy! An arbitrary Tree is either a Leaf or a Branch into two arbitrary trees. I quickly learned that THIS IS NOT THE WAY YOU DO IT. Trying it out with

sample (arbitrary :: Gen Tree)

lets Haskell give us some random trees

Branch (Branch Leaf Leaf) Leaf
Branch Leaf Leaf
Branch Leaf (Branch (Branch (Branch Leaf (Branch (Branch Leaf (Branch (Branch (Branch (Branch Leaf Leaf) Leaf) (Branch Leaf (Branch Leaf Leaf))) (Branch Leaf (Branch Leaf (Branch (Branch (Branch (Branch Leaf (Branch Leaf Leaf)) (Branch (Branch Leaf (Branch Leaf Leaf)) (Branch Leaf (Branch (Branch Leaf (Branch Leaf (Branch (Branch (Branch (Branch (Branch (Branch (Branch (Branch Leaf Leaf) (Branch (Branch (Branch (Branch Leaf Leaf) (Branch Leaf Leaf)) Leaf) Leaf)) (Branch (Branch Leaf (Branch (Branch Leaf Leaf) Leaf)) (Branch (Branch (Branch (Branch Leaf Leaf) Leaf) (Branch (Branch (Branch Leaf (Branch (Branch (Branch (Branch (Branch Leaf (Branch Leaf Leaf)) (Branch Leaf (Branch Leaf (Branch Leaf Leaf)))) (Branch (Branch Leaf Leaf) (Branch (Branch (Branch Leaf (Branch (Branch Leaf (Branch (Branch Leaf Leaf) Leaf)) Leaf)) (Branch Leaf Leaf)) Leaf))) (Branch Leaf (Branch (Branch Leaf Leaf) (Branch (Branch Leaf Leaf)  ...

I have not written out that last tree completely; not even remotely. Let’s have a closer look at what happens

depth :: Tree -> Int
depth Leaf = 0
depth (Branch l r) = 1 + max (depth l) (depth r)

Now we can compute the depth of what we get

sample (depth <$> arbitrary)
> 0,0,862,0,0,12,1,0,1,0,0

A depth of 862? That seems pretty disproportionate. oneof throws a fair coin. On heads, it will just return a Leaf and the process terminates. On tails, it will spawn a branching and repeat the same process independently. Now, in order for the tree to terminate, all active branches have to terminate. Around half of them will die, half of them will branch even further. Do these two processes cancel each other out? Can we expect the trees to terminate at a reasonable depth or is there even a possibility that the program will run indefinately?

Some experimentation

In our example, the dying and spawning of new branches weirdly balance each other, making predictions hard. But our coins don’t have to be fair. We can have QuickCheck favor certain alternatives with some frequency

instance Arbitrary Tree where
    arbitrary = frequency 
        [ (2, return Leaf)
        , (1, Branch <$> arbitrary <*> arbitrary) ]

Now twice as many branches will die than newly spawn. We should expect finite trees now

sample (depth <$> arbitrary)
> 1,0,0,6,0,0,2,0,0,0,0

Fine. A lot of trees terminate right at the root, but no weird runaway recursions happen. On the other hand

instance Arbitrary Tree where
    arbitrary = frequency 
        [ (1, return Leaf)
        , (2, Branch <$> arbitrary <*> arbitrary) ]

will spawn more trees than it will kill, so most trees will send Haskell into an unending recursion. Actually, there is a probability strictly between $0$ and $1$ that the generated tree will be finite. Interestingly, the trees which do end up finite will be very short! What happens in our initial case with the fair coin stil remains unanswered.

Time to set up some maths. Let $p_0$ be the probability of getting a leaf and write $p(d)$ for the probability of the tree having depth $d$. Firstly

$$p(0) = p_0$$

Now a tree has depth precisely $d$ if

  • Option “branch” was chosen
  • both left and right child have depth $d-1$
  • or the left child has depth $d-1$ and the right one has a smaller depth $j$
  • or the right child has depth $d-1$ and the left one has a smaller depth $j$

giving us a recursive formula
$$p(d) = (1-p)\left(p(d – 1)^2 + 2p(d – 1)\sum_{j=0}^{d-2} P(j) \right)$$

Let’s experiment a bit with these

# Random trees

n = 1000000

# d = depth of a random tree

# p0 = probability of producing a leaf/terminating in each step
for p0 in [0.2, 1/3, 0.5, 2/3, 0.8]:
    q0 = 1-p0

    # p[n] = P(d = n)
    p = [0]*n

    # t[j] = P(d < j) = sum(p[i] for i in range(0,j))
    t = [0]*n

    p[0] = p0
    p[1] = q0 * p[0]**2
    t[1] = p[0]

    for i in range(2,n):
        p[i] = q0*(p[i-1]**2 + 2*p[i-1]*t[i-1])
        t[i] = t[i-1] + p[i-1]

    E = sum(d * p[d] for d in range(0,n))

    print("Random tree depth (n=%i, p=%f)" % (n,p0))
    print("P(d < inf) = %f" % t[n-1])
    print("E[d | d < inf] = %f" % E)

For $p=\frac 2 3$, we get

Random tree depth (n=1000000, p=0.666667)
P(d < inf) = 1.000000
E[d | d < inf] = 0.833477

100% of the trees are finite, but their expected depth is only 0.83. On the contrary, for $p=\frac 1 3$, we get

Random tree depth (n=1000000, p=0.333333)
P(d < inf) = 0.500000
E[d | d < inf] = 0.416738

Exactly 50% of the trees will be finite! Of the ones that do, the expected height is only 0.417. You either die young, our you live long enough to see yourself become infinite …

What happens for our original fair coin-toss?

Random tree depth (n=1000000, p=0.500000)
P(d < inf) = 0.999998
E[d | d < inf] = 22.581590

So, most trees seem to terminate, the expected height awkwardly hangs arounds 22. That’s no conclusive result, and the numbers change if we crank up the number of iterations. We need some theory to back up the empiricism.

Maths to the rescue

Our kind of branching tree experiment is an easy special case of a so called Galton-Watson process. In each generation, the nodes are allowed to have a random number of offspring with some probabilities $p_n$. In our case, that’s simply

$$p_0 = p,\,p_2 = 1-p$$

The deciding factor for the process is the expected number $\mu$ of offspring, which in our case is

$$\mu = 0\cdot p_0 + 2\cdot p_2 = 2-2p.$$

We can now apply some theorems:

If $\mu \leq 1$, the process terminates with probability 100%.

The depth $\tau$ of our trees would in the theory of these processes be called extinction time. $\mu \leq 1$ tells us the extinction time is always finite. We want to understand their exact distribution. What’s the probability that $\tau > n$? Proposition 5 says

  • if $\mu < 1$, $P(\tau > n)$ decays exponentially with $n$.
  • if $\mu = 1$, $P(\tau > n)$ decays only linearly with $n$.

We can always compute the expected value from these tail distributions with the formula

$$ \mathbb E[\tau] = \sum_{n=1}^\infty P(\tau \geq n) $$

This means that for $\mu < 1$, the expected tree depth has a finite value, whereas for $\mu = 1$, we get
$$\mathbb E[\tau] \approx \sum_{n=1}^\infty \frac 1 n = \infty.$$

Fair coin tosses leed to trees that all eventually terminate, but we can expect them to take arbitrarily long. As soon as we tweak the frequencies a bit towards termination, we get nice and finite trees.

tl/dr Never let QuickCheck throw fair coins. Tweak the frequencies.

Remark: The other idiomatic solution to the problem is to carry around a size parameter in the generating routine

genTree depth 
    | depth == 0 = return Leaf
    | depth > 0 = 
            [ return Leaf
            , Branch <$> (genTree (depth - 1)) <*> (genTree (depth - 1)) ] 

QuickCheck can even use the size parameters for its internal workings, so in the end we just declare

instance Arbitrary Tree where
    arbitrary = sized genTree

Leave a Reply

Your email address will not be published. Required fields are marked *