2023.10,11: Counterexamples To Induction

You'll need this exercise template for these notes, which cover 3 class sessions. The completed version, complete with in-class comments, is here.

We prototyped some confidence tests about the reachable predicate (livecode file).

CSCI 1710: curiosity modeling ideas

Curiosity modeling signups are going out soon. Read over others' ideas! Post your own!

CSCI 1710: endorsed Ed posts

I endorse public posts where there's an answer I think could be broadly useful. For example, these (as of today) are useful for physical keys:

Forge Performance

Some of you encountered bad Forge performance in the n-queens lab. I think it's useful to discuss the problem briefly. Forge works by converting your model into a boolean satisfiability problem. That is, it builds a boolean circuit where inputs making the circuit true satisfy your model. But boolean circuits don't understand quantifiers, and so it needs to compile them out.

The compiler has a lot of clever tricks to make this fast, and we'll talk about some of them around mid-semester. But if it can't apply those tricks, it uses a basic idea: an all is just a big and, and a some is just a big or. And this very simply conversion process increases the size of the circuit exponentially in the depth of quantification.

Here is a perfectly reasonable and correct way to approach part of this week's lab:

pred notAttacking {
  all disj q1, q2 : Queen | {
    some r1, c1, r2, c2: Int | {
      // ...
    } } }

The problem is: there are 8 queens, and 16 integers. It turns out this is a pathological case for the compiler, and it runs for a really long time. In fact, it runs for a long time even if we reduce the scope to 4 queens. The default verbosity option shows the blowup here, in "translation":

:stats ((size-variables 410425) (size-clauses 617523) (size-primary 1028) (time-translation 18770) (time-solving 184) (time-building 40)) :metadata ())
#vars: (size-variables 410425); #primary: (size-primary 1028); #clauses: (size-clauses 617523)        
Transl (ms): (time-translation 18770); Solving (ms): (time-solving 184)

The time-translation figure is the number of milliseconds used to convert the model to a boolean circuit. Ouch!

Instead, we might try a different approach that uses fewer quantifiers. In fact, we can write the constraint without referring to specific queens at all -- just 4 integers.

How?

Does the identity of the queens matter at all, if they are in different squares?

If you encounter bad performance from Forge, this sort of branching blowup is a common cause, and can often be fixed by reducing quantifier nesting, or by narrowing the scope of what's being quantified over.

Induction

When we're talking about whether or not a reachable state violates a desirable property $P$ (recall we sometimes say that if this holds, $P$ is an invariant of the system), it's useful to think geometrically. Here's a picture of the space of all states, with the cluster of "good" states separated from the "bad":

If this space is large, we probably can't use trace-finding to get a real proof: we'd have to either reduce the trace length (in which case, maybe there's a bad state just barely out of reach of that length) or we'd be waiting until the sun expands and engulfs the earth.

Traces are still useful, especially for finding shallow bugs, and the technique is used in industry! But we need more than one tool in our bag of tricks.

Step 1: Initiation or Base Case

Let's break the problem down. What if we just consider reachability for traces of length $0$ ---that is, traces of only one state, an initial state?

This we can check in Forge just by asking for a state s satisfying {initial[s] and wellformed[s] and not P[s]}. There's no exponential blowup with trace length since the transition predicates are never even involved! If we see something like this:

We know that at least the starting states are good. If instead there was a region of the starting states that overlapped the bad states, then we immediately know that the property isn't invariant.

Step 1.5: Noticing and Wondering

We can surely also check whether there are bad states within $1$ transition. We're using the transition predicate (or predicates) now, but only once. Forge can also do this; we ask for a pair of states s0, s1 satisfying {initial[s0] and someTransition[s0, s1] and not P[s1]} (where someTransition is my shorthand for allowing any transition predicate to work; we could write the predicate ourselves).

If Forge doesn't find any way for the second state to violate $P$ , it means we have a picture like this:

It's looking promising! Note that in general, there might be overlap (as shown) between the set of possible initial states and the set of possible second states. (For example, imagine if we allowed a doNothing transition at any time).

If we keep following this process exactly, we'll arrive back at the trace-based approach: a set of 3rd states, a set of 4th states, and so on. That's sometimes useful (and people do it in industry---if you're interested, you can see the original paper by Biere, et al.) but it won't scale to systems with very long traces.

Sometimes great ideas arise from dire limitations. What if we limit ourselves to only ever asking Forge for these two state examples? That would solve the exponential-blowup problem of traces, but how can we ever get something useful, that is, a result that isn't limited to trace length 1?

I claim that we can (often) use these small, efficient queries to show that $P$ holds at any finite length from a starting state. But how?

By no longer caring whether the pre-state of the check is reachable or not.

Step 2: Consecution or Inductive Case

We'll ask Forge whether {P[s0] and someTransition[s0, s1] and not P[s1]} is satisfiable for any pair of states. Just so long as the pre-state satisfies $P$ and the post-state doesn't. We're asking Forge if it can find a transition that looks like this:

If the answer is no, then it is simply impossible (up to the bounds we gave Forge) for any transition predicate to stop property $P$ from holding: if it holds in the pre-state, it must hold in the post-state.

But if that's true, and we know that all initial states satisfy $P$ , then all states reachable in $1$ transition satisfy $P$ (by what we just checked). And if that's true, then all states reachable in $2$ transitions satisfy $P$ also (since all potential pre-states must satisfy $P$ ). And so on: any state that's reachable in a finite number of transitions must satisfy $P$ .

If you've seen "proof by induction" before in another class, we've just applied the same idea here. Except, rather than using it to show that the sum of the numbers from $1$ to $n$ is $\frac{k ( k + 1 )}{2}$ , we've just used it to prove that $P$ is invariant in our system.

In Tic-Tac-Toe, this would be something like "cheating states can't be reached with legal moves". In an operating system, this might be "two processes can never modify a block of memory simultaneously". In hardware, it might be "only one device has access to write to the bus at any point". In a model of binary search, it might be "if the target is in the array, it's located between the low and high indexes".

For most computer-scientists, I think that this feels like a far more relatable and immediately useful example of the induction principle. That's not to dismiss mathematical induction! I quite like it (and it's useful for establishing some useful results related to Forge). But multiple perspectives enrich life.

Exercise: Try it!

In the binary-search model, there's a pair of tests labeled initStep and inductiveStep, under a comment that says "Exercise 1". Fill these in using the logic above, and run them. Do they both pass? Does one fail?

What if Forge finds a counterexample?

What if Forge does find a transition that fails to preserve our property $P$ ? Does it mean that $P$ is not an invariant of the system?

Think, then click!

No! It just means that $P$ isn't inductively invariant. The pre-state that Forge finds might not itself be reachable!

This technique is a great way to quickly show that $P$ is invariant, but if it fails, we need to do more work.

We see this happen when we try to run the above checks for our binary-search model! The inductiveStep test fails, and we get a counterexample.

Fix 1: Adding Reasonable Assumptions

Sometimes there are conditions about the world that we need in order for the system to work at all. For example, we already require the array to be sorted. Are there other requirements?

It's possible that the counterexample we're shown involves a subtle bug in binary search: if the array is large relative to the maximum integer in the system, the value (high+low) can overflow.

We're not trying to fix this problem in the algorithm. This is a real bug, and modeling found it. So let's log the issue and add a global assumption that the array is not so large. We'll add that in the safeArraySize predicate, which we'll then use in the 2 checks. How can we express what we want---that the array is not too large?

Think, then click!

The core challenge here is that we'll never have enough integers to actually count #Int. However, we can ask for the maximum integer---max[Int]. So we could say that arr.lastIndex is less than divide[max[Int], 2]. This might be a little conservative, but it guarantees that the array is never larger than half of the maximum integer.

Exercise: Try expressing this in the safeArraySize predicate, and adding it to the test that failed. Add a comment explaining why this is a requirement.

Fix 2: Enriching the Invariant

Sometimes the property we're hoping to verify is simply not preserved by system transitions in general, but the pre-state in any counterexample isn't actually reachable. This is the tradeoff of the inductive approach! We have to do some work to make progress.

Concretely, we'll need to verify something stronger. At the very least, if $S$ is the pre-state of the counterexample we were given, we need to also add "and this state isn't $S$ " to the property $P$ we're checking. In practice, we usually add something more general that covers whatever quality it is about $S$ that makes it unreachable.

The technique is sometimes called "enriching the invariant".

Exercise: Do you believe the counterexample you're getting is reachable? If not, what is it about the state that looks suspicious?

Exercise: Now add new constraints to the bsearchInvariant predicate that exclude this narrow criterion. Be careful not to exclude too much! An over-constrained property can be easy to verify, but may not actually give many guarantees.

But Can We Trust The Model?

What would it mean for this verification idea if there were simply no initial states, or no way to take a certain transition? That would probably be a bug in the model; how would it impact our proof?

Look again at the two checks we wrote. If initial were unsatisfiable by any state, surely the Step 1 check would also be unsatisfiable (since it just adds more constraints). Similarly, unsatisfiable transition predicates would limit the power of Step 2 to find ways that the system could transition out of safety. This would mean that our confidence in the check was premature: Forge would find no initial bad states, but only because we narrowed its search to nothing!

This problem is called vacuity, and I'll give you another example. Suppose I told you: "All my billionaire friends love Logic for Systems". I have, as far as I know anyway, no billionaire friends. So is the sentence true or false? If you asked Forge, it would say that the sentence was true---after all, there's no billionaire friend of mine who doesn't love Logic for Systems...

This is a problem you might first hear about in other courses like 0220, or in the philosophy department. There's a risk you'll think vacuity is silly, or possibly a topic for debate among people who like drawing their As upside down and their Es backwards, and love writing formulas with lots of Greek letters in. Don't be fooled! Vacuity is a major problem even in industrial settings like Intel, because verification tools are literal-minded. (Still doubtful? Ask Tim to send you links, and check out the EdStem post at the top of these notes.)

At the very least, we'd better test that the left-hand-side of the implication can be satisfied. This isn't a guarantee of trustworthiness, but it's a start. And it's easy to check with Forge.