From Tests to Properties

Today's livecode is here.

(Brown CSCI 1710) Logistics

The first assignment goes out today: it's a homework, in Python. If you're not familiar with Python, you should check out our optional Python lab. The homework uses a library called Hypothesis, which we'll see some of in lecture today. You'll also be asked to use ChatGPT to generate code. Next Wednesday's lab will also use ChatGPT.

Please fill out the lab preference form!

Where are we going?

We'll talk about more than just software soon. For now, let's go back to testing. Most of us have learned how to write test cases. Given an input, here's the output to expect. Tests are a kind of pointwise \emph{specification}; a partial one, and not great for fully describing what you want, but a kind of specification nonetheless. They're cheap, non-trivially useful, and better than nothing.

We talked a bit last time about how tests aren't always good enough: they carry our biases, they can't cover an infinite input space, etc. But even more, they're not always adequate carriers of intent: if I write assert median([1,2,3]) == 2, what exactly is the behavior of the system I'm trying to confirm? Surely I'm not writing the test because I care specifically about [1,2,3] only, and not about [3,4,5] in the same way? Maybe there was some broader aspect, some property of median I cared about when I wrote that test. What do you think it was? What makes an implementation of median correct?

Think, then click!

There might be many things! One particular idea is that, if the input list has odd length, the median needs to be an element of the list.

There isn't always an easy-to-extract property for every unit test. But this idea---of encoding goals instead of specific behaviors---forces us to start thinking critically about what we want from a system and helps us to express it in a way that others (including, perhaps, LLMs) can use. It's only a short hop from there to some of the real applications we talked about last time, like verifying firewalls or modeling the Java type system.

Sometimes, you can test exhaustively!

Sometimes the input space is small enough that exhaustive testing works well. This blog post, entitled "There are only four billion floats" is an example.

Notice that this, too, is a different kind from testing from what you're used to.

A New Kind of Testing

Cheapest Paths

Consider the problem of finding cheapest paths in a weighted graph. There are quite a few algorithms you might use: Dijkstra, Bellman-Ford, even a plain breadth-first search for an unweighted graph. You might have implemented one of these for another class!

The problem statement seems simple: take a graph and two vertex names and as input. Produce the cheapest path from to in . But it turns out that this problem hides a lurking issue.

Find the cheapest path from vertex to vertex on the graph below.

Think, then click! The path is G to A to B to E.

Great! We have the answer. Now we can go and add a test case for with that graph as input and (G, A, B, E) as the output.

Wait -- you found a different path? G to D to B to E?

And another path? G to H to F to E?

If we add a traditional test case corresponding to one of the correct answers, our test suite will falsely raise alarms for correct implementations that happen to find different answers. In short, we'll be over-fitting our tests to @italic{one specific implementation}: ours. But there's a fix. Maybe instead of writing:

shortest(GRAPH, G, E) == [(G, A), (A, B), (B, E)]

we write:

shortest(GRAPH, G, E) == [(G, A), (A, B), (B, E)] or
shortest(GRAPH, G, E) == [(G, D), (D, B), (B, E)] or
shortest(GRAPH, G, E) == [(G, H), (H, F), (F, E)]

What's wrong with the "big or" strategy? Can you think of a graph where it'd be unwise to try to do this?

Think, then click!

There are at least two problems. First, we might have missed some possible solutions, which is quite easy to do; the first time Tim was preparing these notes, he missed the third path above! Second, there might be an unmanageable number of equally correct solutions. The most pathological case might be something like a graph with all possible edges present, all of which have weight zero. Then, every path is cheapest.

This problem -- multiple correct answers -- occurs in every part of Computer Science. Once you're looking for it, you can't stop seeing it. Most graph problems exhibit it. Worse, so do most optimization problems. Unique solutions are convenient, but the universe isn't built for our convenience.

What's the solution? If test cases won't work, is there an alternative? (Hint: instead of defining correctness bottom-up, by small test cases, think top-down: can we say what it means for an implementation to be correct, at a high level?)

Think, then click!

In the cheapest-path case, we can notice that the costs of all cheapest paths are the same. This enables us to write:

cost(cheapest(GRAPH, G, E)) = 11

which is now robust against multiple implementations of cheapest.

This might be something you were taught to do when implementing cheapest-path algorithms, or it might be something you did on your own, unconsciously. (You might also have been told to ignore this problem, or not told about it at all...) We're not going to stop there, however.

Notice that we just did something subtle and interesting. Even if there are a billion cheapest paths between two vertices in the input graph, they all have that same, minimal length. Our testing strategy has just evolved past naming specific values of output to checking broader properties of output.

Similarly, we can move past specific inputs: randomly generate them. Then, write a function is_valid that takes an arbitrary input, output pair and returns true if and only if the output is a valid solution for the input. Just pipe in a bunch of inputs, and the function will try them all. You can apply this strategy to most any problem, in any programming language. (For your homework this week, you'll be using Python.)

Let's be more careful, though. Is there something else that cheapest needs to guarantee for that input, beyond finding a path with the same cost as our solution?

Think, then click!

We also need to confirm that the path returned by cheapest is indeed a path in the graph!

Here's a sketch of an is_valid function for cheapest path:

Think, then click!
isValid : input: (graph, vertex, vertex), output: list(vertex) -> bool
  returns true IFF:
    (1) output.cost == trustedImplementation(input).cost
    (2) every vertex in output is in input's graph
    (3) every step in output is an edge in input
    ... and so on ...

This style of testing is called Property-Based Testing (PBT). When we're using a trusted implementation---or some other artifact---to either evaluate the output or to help generate useful inputs, it is also a variety of Model-Based Testing (MBT).

Model-Based Testing

There's a lot of techniques under the umbrella of MBT. If time permits, we'll talk more about this later in the semester. For now, know that modeling systems, which is what we'll be doing for much of this class, can be used to help generate good tests.

There are a few questions, though...

Question: Can we really trust a "trusted" implementation?

No, not completely. It's impossible to reach a hundred percent trust; anybody who tells you otherwise is selling something. Even if you spend years creating a correct-by-construction system, there could be a bug in (say) how it is deployed or connected to other systems.

But often, questions of correctness are really about the transfer of confidence: my old, slow implementation has worked for a couple of years now, and it's probably mostly right. I don't trust my new, optimized implementation at all: maybe it uses an obscure data structure, or a language I'm not familiar with, or maybe I don't even have access to the source code at all.

And anyway, often we don't need recourse to any trusted model; we can just phrase the properties directly.

Exercise: What if we don't have a trusted implementation?

You can use this approach whenever you can write a function that checks the correctness of a given output. It doesn't need to use an existing implementation (it's just easier to talk about that way). In the next example we won't use a trusted implementation at all!

Question: Where do the inputs come from?

Great question! Some we will manually create based on our own cleverness and understanding of the problem. Others, we'll generate randomly.

Random inputs are used for many purposes in software engineering: "fuzz testing", for instance, creates vast quantities of random inputs in an attempt to find crashes and other serious errors. We'll use that same idea here, except that our notion of correctness is usually a bit more nuanced.

Concretely:

A diagram of property-based testing. A random input generator, plus some manually-chosen inputs, are sent to the implementation under test. The outputs are then run through the validator function.

It's important to note that some creativity is still involved here: you need to come up with an is_valid function (the "property"), and you'll almost always want to create some hand-crafted inputs (don't trust a random generator to find the subtle corner cases you already know about!) The strength of this approach lies in its resilience against problems with multiple correct answers, and in its ability to mine for bugs while you sleep. Did your random testing find a bug? Fix it, and then add that input to your list of regression tests. Rinse, repeat.

If we were still thinking in terms of traditional test cases, this would make no sense: where would the outputs come from? Instead, we've created a testing system where concrete outputs aren't something we need to provide. Instead, we check whether the program under test produces any valid output.

Hypothesis

There are PBT libraries for most every popular language. For your homework, you'll be using a library for Python called Hypothesis. Hypothesis has many helper functions to make generating random inputs relatively easy. I want to spend the rest of class stepping you through using the library. Let's test a function in Python itself: the median function in the statistics library, which we began this chapter with. What are some important properties of median?

CSCI 1710: LLMs and Testing

We're all going to be using LLMs to generate code in the future. But (like humans) LLMs aren't infallable. Knowing how to test will be even more important than ever. That's why your first homework in 1710 starts by asking you to generate code using an LLM of your choice, such as ChatGPT. Then, you'll use property-based testing to assess its correctness.

To be clear, you will not be graded on the correctness of the code you prompt an LLM to generate. Rather, you will be graded on how good your property-based testing is.

Later in the semester, you'll be using PBT again to test more complex software.

Now let's use Hypothesis to test at least one of those properties. We'll start with this template:

from hypothesis import given, settings
from hypothesis.strategies import integers, lists
from statistics import median

# Tell Hypothesis: inputs for the following function are non-empty lists of integers
@given(lists(integers(), min_size=1)) 
# Tell Hypothesis: run up to 500 random inputs
@settings(max_examples=500)
def test_python_median(input_list):    
    pass

# Because of how Python's imports work, this if statement is needed to prevent 
# the test function from running whenever a module imports this one. This is a 
# common feature in Python modules that are meant to be run as scripts. 
if __name__ == "__main__": # ...if this is the main module, then...
    test_python_median()

Let's start by filling in the shape of the property-based test case:

def test_python_median(input_list):    
    output_median = median(input_list) # call the implementation under test
    print(f'{input_list} -> {output_median}') # for debugging our property function
    if len(input_list) % 2 == 1:
        assert output_median in input_list 
    # The above checks a conditional property. But what if the list length isn't even?
    #   (and can we write a stronger property, regardless?)
    # ...

After some back and forth, we might end up somewhere like this:

def test_python_median(input_list):
    output_median = median(input_list)
    print(f'{input_list} -> {output_median}')
    if len(input_list) % 2 == 1:
        assert output_median in input_list
    
    # Question 1: What's going wrong? The _property_ seems reasonable...
    lower_or_eq =  [val for val in input_list if val <= output_median]
    higher_or_eq = [val for val in input_list if val >= output_median]
    assert len(lower_or_eq) >= len(input_list) // 2    # // ~ floor
    assert len(higher_or_eq) >= len(input_list) // 2   # // ~ floor
    # Question 2: Is this enough? :-)

As the questions in the code hint, there's still a problem with this. What do you think is missing?

Notice how being precise about what correctness means is very important. With ordinary unit tests, we're able to think about behavior point-wise; here, we need to broadly describe our goals. There's advantages to that work: comprehension, more powerful testing, better coverage, etc.

There's more to do, but hopefully this gives you a reasonably nice starting point.