Too late now


Check the implementation, not the deployment, not the abstract spec

2024-04-24

There are now a few papers covering distributed systems model checking (FlyMC, SandTable). These take one of two approaches:

The benefits seem to be either:

But why can't we have both? Well, I think we can.

The reason the first approach is slow is that it is actually running the system with all of the ancillary components. Meanwhile, the second approach is fast because it isn't the actual implementation, rather its just some abstract model. The second approach also has problems ensuring that the implementation and abstract model remain synchronised and describing the same thing.

So, a solution: model-check the implementation.

The implementation of a distributed system should likely revolve around an actor-based system. Each actor takes a message, processes it and optionally produces a message for another actor. This is a clean interface that can run on a network, but doesn't need to, it can run in memory. For disk and other ancillary components we just need the implementation to support swapping them out for in-memory simpler implementations; the abstact model wasn't modelling them anyway.

The actors can be combined together with a modelled network in-memory, producing a model-checkable implementation. Using a sensible language the in-memory storage components can be maintained using copying of the memory and optimizations made to them for checking.

This new strategy, true implementation-level model checking can be both things we want:

Notably, optimising the implementation in this scenario can doubly improve performance, once for model checking, and once for the deployments. Also, engineers who know how to write code can just continue in the same language, just needing to express their integration tests as model properties to be checked.

If you want to check out an example of this that I work on, see Themelios, a model-checked reimplementation of Kubernetes.

Caveats: