This is a common question among TLA+ beginners. There *can* be bugs in the imple...

throwawayjava · on Aug 31, 2019

> In practice, we don't do it.

There are lots of very large Coq code bases that work just fine with code extraction. Scalability of extraction isn't the problem.

Is it really true that TLA scales to large programs? Why isn't there a Compcert in TLA? Why isn't there a Flyspec in TLA? And how about stuff like CertiGrad, where just the spec is pretty involved?

My impression has always been that Coq/HOL scale and TLA-style systems break down once your spec gets big enough (let alone the impl).

pron · on Aug 31, 2019

> There are lots of very large Coq code bases that work just fine with code extraction. Scalability of extraction isn't the problem.

The largest programs ever verified end-to-end with Coq, Isabelle, or any other proof assistant correspond to about ~10KLOC of C, or less than 1/5 of jQuery (although such a small program could require up to 50K or even 100K lines of proofs). And even those certainly-non-trivial-yet-rather-tiny programs took years of efforts by experts. Moreover, concessions in the programs' design and choice of algorithms had to be made in order to make verification feasible at all.

And I agree that extraction doesn't suffer from a scalability problem. What does is the ability to verify the spec in the first place. If we could write proofs that correspond to programs with 100K, then extracting an executable would not be the isse, but we can't (or, at least, we don't know how).

> Is it really true that TLA scales to large programs?

Yes. Again, there's no magic: we can only verify so many lines of a formal spec (of which code is an example). So it's either a small program or a small high-level spec that corresponds to a large program. No known technique can affordably verify large (or even medium, or even non-small) programs, end-to-end, for arbitrary functional correctness properties.

> Why isn't there a Compcert in TLA?

There are programs fifty times the size of CompCert that used TLA+ (to be honest, most real-world programs are 50x the size of CompCert), but not for end-to-end verification; i.e. there is an informal translation of spec to code. Unlike Coq, which is usually used for research or by researchers, TLA+ is used in industry and by "ordinary" engineers without lengthy training in formal logic, so there aren't many papers written about its use, but here's one[1]. Ironically, another reason you don't see as many papers about verification in TLA+ as in Coq is that its use is much simpler. I once saw a paper about specifying and verifying a program's time-complexity in Coq. No one would ever write such a paper about TLA+, because it's just too easy; a TLA+ beginner could do it in their first week of learning (the main reason there's a paper about using TLA+ for PharOS is because they used TLAPS, the TLA+ proof assistant, rather than the TLC model checker, which made the work academically interesting).

In any event, if someone is interested in using any of those tools "in production," I recommend they give them all a try.

[1]: https://members.loria.fr/SMerz/papers/abz2016-pharos.html

throwawayjava · on Aug 31, 2019

Thanks, that's clarifying. You can do this sort of "lightweight" verification in basically any theorem prover. The difference you observe sounds like a difference in community emphasis rather than capability?

pron · on Aug 31, 2019

There is one big difference in tooling: Coq doesn't have a good model checker. While you can verify several thousand lines of code/spec, writing proofs for those could take months or years, while model-checking is pretty much pushing a button (and then letting the computer or cluster work, possibly for hours or even days), and you probably need to do this a few times, but the difference between having a model checker and not is often the difference between practically reasonable and not. TLA+ has a proof assistant, but people who use TLA+ in industry hardly ever use it because the small added confidence is simply not worth so much more work.

dwohnitmok · on Aug 31, 2019

TLA+ does have things that make it hard for you to modularize and reuse pieces of it in the large.

The first category of problems deal with tooling. TLA+ doesn't have a great story when it comes to automation. There is no package manager. Version control is trickier than it needs to be (it's not obvious what files you might want to commit to git and which you shouldn't when you first start working with TLA+). There isn't a lot of support for automatically running things at the commandline (the community mostly uses the TLA+ toolbox, which is a GUI-based IDE). You can and people do these things, but they feel like distinctly second-class citizens. Really TLA+ feels tied to the TLA+ toolbox, warts and all.

The second problem is language support. TLA+'s main way of breaking down problems into modules and then using one module as an interface for another (its notion of refinement) suffers from problems analogous to the ones faced when dealing with inheritance. In TLA+'s case, it is easy to add additional new states into a state machine defined in another module (this is TLA+'s celebrated idea of stuttering), but is rather more annoying to do the opposite, e.g. show that in your new module a single state is meant to correspond to a two states in another module (e.g. you have a transaction primitive that allows you to assume that two states can be done simultaneously).

This is complicated by the fact that the easiest verification technique in TLA+ is model-checking, which fares the worst when dealing with transitive chains of logical implication. TLA+'s model checker does not have a native notion of logical implication. Instead the way you e.g. prove the statement A => B is by effectively proving A and B but privileging A to be the search space that the model checker operates in. This means chains of the form A => B, B => C, C => D and therefore A => D (and all the intermediate stages) are somewhat annoying. You either prove each step with separate models along the chain and just let a human infer that this means A => D, or you prove the statement A => (B and C and D), without using any of the intermediate implications.

Without robust support for transitive implication, the option of using composition in the traditional composition vs. inheritance debate becomes much less palatable as an option when it comes to inheritance.

To some extent this is a limitation inherent in model checking and goes away with the proof based verifier for TLA+ TLAPS. But there isn't a collection, that I know of, of proved theorems for TLAPS like many other theorem provers.

I suspect again this is due to both the lack of a package manager in the TLA+ ecosystem as well as some not so ergonomic parts of TLAPS. TLAPS feels very magical. Its main way of proving things is what's often referred to as sledgehammer or auto in other theorem provers; write the next step in your proof and hope the computer is smart enough to infer it from the previous step. It feels magical when it works. When the computer can't figure something out, it can be a huge pain to figure out why. It doesn't feel like I'm mechanically building up a proof so much as it feels like I'm playing a guessing game with the computer to see which leaps of logic it'll accept. This gets better with time as you develop an intuition of which leaps work and which don't, but can still be painful. This is mainly because TLAPS itself isn't a prover, but rather delegates to the "magic" methods of other provers.

Finally set theory and first order logic, the theoretical basis of TLA+, sometimes makes showing equivalence between items (which is vitally necessary when dealing with code written by multiple individuals who might use different encodings for the same idea) very difficult. In particular, trying to use the fact that A is equivalent to B, therefore some predicate P has an exactly analogous P' on B is tricky. Another way of stating this is that TLA+ as a language makes it a bit difficult to leverage isomorphisms and homomorphisms to reduce the amount of work you have to do. This is somewhere where alternative logical systems can have a big leg up.

All that being said, among the technologies commonly referred to as "formal verification," I would say that TLA+ is the one I would most commonly recommend to people in business situations. This precisely because of what many people view as a flaw, that TLA+ is a formal specification language and not a way to actually write or verify actual running code.

Because TLA+ specs are divorced from your codebase, they require far less buy-in to see dividends. You don't have to rewrite your codebase in another language. You don't even have to change your coding practices.

Indeed TLA+'s closest competitor on most teams is also precisely what most teams lack, which is high-quality documentation.

This means you can integrate TLA+ into your work with extremely little risk. Try formalizing your mental model in TLA+. See if it reveals interesting conclusions. If it does, congratulations! TLA+ has been useful to you and you can keep it around as an artifact detailing your thoughts in the future. If not, delete your specs and no harm no foul. No code depends on its existence. You don't need some lengthy transition period to migrate systems away from "using" TLA+. You lose no more time than the amount you put into writing your spec.

hwayne · on Sept 1, 2019

> The first category of problems deal with tooling. TLA+ doesn't have a great story when it comes to automation. There is no package manager. Version control is trickier than it needs to be (it's not obvious what files you might want to commit to git and which you shouldn't when you first start working with TLA+). There isn't a lot of support for automatically running things at the commandline (the community mostly uses the TLA+ toolbox, which is a GUI-based IDE). You can and people do these things, but they feel like distinctly second-class citizens. Really TLA+ feels tied to the TLA+ toolbox, warts and all.

I currently see this as the biggest barrier to widespread adoption. Not the only barrier, but the biggest by far. A modernized CLI + exportable traces would make a huge difference.

pron · on Aug 31, 2019

> you to modularize and reuse pieces of it in the large.

There is no such thing as formal specifications "in the large." The largest software specification ever written in any language that could be fully verified is no more than a few thousand lines (although proofs can be very long). Having said that, I think TLA+ module system is fine; reminds me of ML.

> The second problem is language support. TLA+'s main way of breaking down problems into modules and then using one module as an interface for another (its notion of refinement) suffers from problems analogous to the ones faced when dealing with inheritance.

I don't think so, but if you're modularizing your specs you may be doing something wrong.

> but is rather more annoying to do the opposite, e.g. show that in your new module a single state is meant to correspond to a two states in another module

Right, but the problem here is fundamental. Verifying temporal quantifiers is co-NP-hard in the number of states.

> This means chains of the form A => B, B => C, C => D and therefore A => D (and all the intermediate stages) are somewhat annoying.

Yes, but still about 10x less annoying than writing proofs. :)

> Without robust support for transitive implication, the option of using composition in the traditional composition vs. inheritance debate becomes much less palatable as an option when it comes to inheritance.

Again, using composition is a hint that you may be doing it wrong.

> But there isn't a collection, that I know of, of proved theorems for TLAPS like many other theorem provers.

There are some basic ones here: https://github.com/tlaplus/v2-tlapm/tree/master/library

Deductive proofs are currently not a viable verification approach in industry (except in some small niches, and/or to tie together model-checking results), so libraries of that kind are mostly useful for researchers who research formal mathematics or deductive verification; that's not exactly TLA+'s target audience. Other proof assistants are used almost exclusively for research or by researchers so they're optimized for that.

> This is mainly because TLAPS itself isn't a prover, but rather delegates to the "magic" methods of other provers.

That's not why (you can choose specific Isabelle tactics in your BY clause). The reason is that the TLA+ proof language was designed to be declarative. I guess some people prefer a more imperative style.

> Another way of stating this is that TLA+ as a language makes it a bit difficult to leverage isomorphisms and homomorphisms to reduce the amount of work you have to do. This is somewhere where alternative logical systems can have a big leg up.

There are many things in TLA+'s design that make deductive verification harder, and many things in other language's designs that makes deductive verification harder. There is no one best way to do this.

I've written about this (a bit more generally) here: https://old.reddit.com/r/tlaplus/comments/abi3oz/using_tla_t...

I think the difference is that TLA+ was designed not for research but for industry use, and so its preferences take into account what's feasible. There is no point making something that few people would do easier at the expense of making something many would do harder.

> Because TLA+ specs are divorced from your codebase, they require far less buy-in to see dividends. You don't have to rewrite your codebase in another language. You don't even have to change your coding practices.

... And because there is no tool in existence that can verify those systems at the code-level anyway, so it's not like there's any other option. You get ~2000-5000 lines of formal spec. You can use them for code and get a tiny program or for a high-level spec formally "divorced" from your code.

dwohnitmok · on Sept 1, 2019

There is formal specification in the large if you count formal mathematics. Mizar's library of formalized mathematics is easily millions of lines if not tens of millions of lines at this point. There are individual formalized mathematics repositories in other languages (Coq and Agda included I think? It's been a while since I looked into that) that run into the hundreds of thousands of lines of code.

So there are pointers as to how to organize formal specs for extremely large projects.

The reason as far as I can tell for why there aren't large software verification projects as opposed to pure formal math is that the cost benefit analysis is wildly out of whack. It makes essentially no business sense to try to formally verify things past several thousand lines. This doesn't seem like a bright line to me though, e.g. that there is a technological inflection point at several thousand lines that prevents further scaling. Again formal mathematics repositories are a counterexample. It's just that because of the high cost of formal verification, there is a business ceiling at the equivalent of man hours willing to be spent somewhere in the vicinity of hundreds of thousands of lines of unverified code to millions of lines of unverified code and you can rapidly exhaust that budget with full formal verification (which can easily account for orders of magnitude increases in time spent).

Ah thanks for the TLAPS repo. I forgot about the library, but even there I'm not seeing a lot of reuse in it. There's some of it, but not a lot. Maybe just because users of TLAPS are far fewer than users of TLC.

Indeed you can choose specific Isabelle tactics in your BY line. In practice I'm not sure I've ever seen it in any TLAPS proof. I've very very briefly played with that functionality and at first glance it seems much more difficult than using Isabelle directly because of the added layer of indirection. If you have one in the wild, I'd love to take a look to learn from.

I definitely agree there is no one way to do things. I've left out all the things that I think TLA+ is fantastic at. I think it's the closest thing to mathematics that can be practically used in industry. Far more so than the rhetoric you usually hear about FP languages, TLA+ is immediately familiar to anyone with a pure mathematics background, both syntactically and semantically. TLA+, unlike something like Agda, also has a fairly straightforward set of semantics and mathematical grounding in its theory. The benefit of writing each spec from scratch allows them to be independently understandable with no need to understand the idiosyncrasies of arcane libraries or frameworks.

But I do think it has problems with modularity. My state example wasn't to do with the difficulty of verification or the explosion of the state space (which honestly don't bother me, my TLA+ specs are often exponential or worse wrt model checking time in the size of my constants, but I still get a lot of mileage out of them), but rather the difficulty in even expressing that idea. When I'm confronted with wanting one state to stand in for two I often have to resort to using two separate specs, one with fewer states and then the other being the original spec and then showing my new spec fulfills both. I think the modularity problems mostly stemming from tooling and language choices, but a little bit from theory (although Mizar is also set theory based). For example I don't think I've ever used a 3rd party module in a TLA+ spec I've written. Even trying to express the equivalence between two specs that just use different strings for the same states can be tricky. Wouldn't it be cool if people could import Lamport's Paxos spec directly into their own specs of their own systems? Or any other specs someone has written specifying one part of their system? That would be pretty awesome if there was something akin to an NPM repository of all these specs for various algos that you could import into your own spec. The tools for this exist in Mizar. They do not for TLA+.

Turning my attention back to deductive verification rather than the expression part of the equation, more specifically for TLAPS, I'm actually not a fan of the imperative tactics that are found in Coq. But an error message better than what amounts to red highlighting and "no" would be nice in TLAPS. Something where you can mix YOLOing with more hands-on proof building, with tactics, functions, or even just a trace of the constraint search would be great.

I think the idea that you have some limited thousands of line to "spend" is a useful bit of intuition. One knob you're leaving out though is the richness of the properties you're trying to prove. As you decrease the richness of the properties you can prove you can drastically decrease your burden. This starts bleeding into static analyzers and then if you dial the knob even further down into mainstream type systems. But the knob can be fairly low and still be useful.

For example absence of crashing/abnormal exits is something that is feasible to prove on very large codebases. It's hard to do completely end-to-end (even CompCert's front-end isn't proven), mainly because of the difficulty of fully modeling stuff like filesystems, but can be done for modules spanning hundreds of thousands of lines of code. It is of course a weak property. That the program does something is nowhere near the same thing as doing the correct thing. But it is nonetheless a very useful one.

Absence of SQL injection attacks and CSRF attacks are other examples (can be done either by external analysis or by leveraging language level features), although those feel a bit more like cheating since you basically cleverly restrict the amount of code that "matters."

That would be the closest to an alternative to formal specifications I could suggest. I still think formal specs are lower risk because this still requires coding in certain styles amenable to this static check (or languages that enforce it).

pron · on Sept 1, 2019

> There is formal specification in the large if you count formal mathematics.

Sure, but TLA+ is not aimed at research. The biggest problem, by far, plaguing not just proof assistants but formal logic in general, pretty much since its inception, is complexity. Which is ironic because the entire purpose of formal logic -- from Leibniz's original dream to Frege's early success -- is to be simple, clear and easy tool for practitioners. Compared to the dismal failure of formal logic in achieving that (which was acknowledged and lamented by the likes of von Neumann and Turing, but has since been shrugged off even though the problem has gotten considerably worse), I think the fact that people who are not trained logicians use TLA+ to do real work makes it a crowning achievement not only in software verification but in the history of formal logic. To achieve that, TLA+ had to shed away everything that isn't absolutely necessary. Lamport once said after hearing a lecture about some proposed addition to TLA+: "it's taken you 6 months to add something to TLA+ that it had taken me 15 years to remove".

> It makes essentially no business sense to try to formally verify things past several thousand lines.

Well, we don't know how to do that in a way that would be feasible; I consider that a "limit" because the software we use and need is very commonly three orders of magnitude ahead of what deductive verification can handle. And the gap has been constant or even growing for the past 40 years or so.

> Again formal mathematics repositories are a counterexample.

I disagree. Formal mathematics is about taking years to prove a proposition that could be written in one or a couple of lines. I think that the resemblance between deductive mathematical proofs and deductive software verification is like that between painting the Sistine Chapel and painting a large hospital. The similarity is superficial.

> If you have one in the wild, I'd love to take a look to learn from.

I wouldn't call it "in the wild", and it is used to bootstrap something fundamental (induction), but here: https://github.com/tlaplus/v2-tlapm/blob/master/library/Natu...

Also, search for "IsaM" here: https://github.com/tlaplus/v2-tlapm/blob/master/library/Well...

> But I do think it has problems with modularity. My state example wasn't to do with the difficulty of verification or the explosion of the state space ... but rather the difficulty in even expressing that idea.

Can you DM me on Twitter (get there from the about page of my blog: https://pron.github.io) or send me an email (my name at Google's mail service)? I'm curious to see the problem you're referring to.

> Wouldn't it be cool if people could import Lamport's Paxos spec directly into their own specs of their own systems?

Well, I don't yet understand the exact issue you're having, but no, I don't think that would be particularly helpful at all, for the following reason: If people wanted to modify Paxos itself, they can get their hands on Lamport's spec and tinker with it. If they want to use it as a component, then they can write an abstraction of it in a few line. All of TLA+ is about the abstraction/refinement relation (that is, IMO, what makes its theory cooler than Coq et al. for software).

> But an error message better than what amounts to red highlighting and "no" would be nice in TLAPS.

Yep.

> One knob you're leaving out though is the richness of the properties you're trying to prove. As you decrease the richness of the properties you can prove you can drastically decrease your burden. This starts bleeding into static analyzers and then if you dial the knob even further down into mainstream type systems. But the knob can be fairly low and still be useful.

Right, but we know more than to say it's just richness. We know exactly what kind of properties scale (through abstract interpretation, of which both type systems and static analysis are examples): inductive (= compositional) ones. I.e. a property such that if it holds for any state (whether or not reachable by the program), then any step in the program preserves it, or, alternatively, a property P, such that P(x) and P(y), where x and y are terms, implies that P(x ∘ y). Problem is that most functional correctness properties are not inductive/compositional for the language (i.e. that the above two description hold for any term in the language).

> That would be the closest to an alternative to formal specifications I could suggest.

I think that's one additional tool. But a knob that's easier to turn is soundness. The difference between verifying a property to 100% certainty and to 99.999% certainty can be 10x in cost, and as -- unlike mathematics -- an engineer is not interested in the correctness of an abstract algorithm but in that of a physical system that can always fail with some nonzero probability -- absolute confidence is never strictly a requirement. Of course, this is not just my opinion but it also seems to be that of many formal methods researchers, which is why concolic testing is such a hot topic right now.

I think that specification and verification must be separate -- i.e. you specify at any "depth" and then choose a verification strategy based on cost/benefit (so, no to dependent types), and in most cases, it's possible that concolic testing would be enough.

pron · on Sept 1, 2019

P.S.

1. The very large Coq/Agda texts are usually large because of the proofs, not the specification/propositions. There's a 5-10x ratio, if not more.

2. I don't think that TLA+'s theory is intrinsically more "straightforward" than Agda's, even though it's certainly more familiar (I know some people who would insist that type theory is more straightforward than set theory). I think that a lot of work has gone into making TLA+ relatively easy to grasp; I think it's more design than a specific choice of theory, although the choice of theory certainly affects the design. In addition, as I've already mentioned, I think that TLA itself (i.e. not the set theory in the "+") is a particularly beautiful theory for talking about computation and discrete systems, that has some obvious advantages over the lambda calculus; in particular the abstraction/refinement relation that's the core of TLA (the semantic domain of TLA is a Boolean lattice where the less-than relation is behavior refinement).

dunkelheit · on Aug 31, 2019

Thanks for the detailed answer. I can see immediate value of TLA+ in helping you clarify your high-level thinking about the system and finding bugs in it. Maybe the problem is the expectation that "verified by formal methods" is some magical stamp of approval that makes your program bug-free, when in reality it is just one of many tools to make your program better.

pron · on Aug 31, 2019

Exactly! (Except formal methods are not one but many methods and tools that are as different from one another as code review is from testing. Also, I wouldn't call TLA+ a "lightweight" formal method despite being relatively easy to learn and to apply; it's hard to get more heavyweight than TLA+.)

While we’re on the subject, I think it’s worthwhile to explain the name “formal methods”, and especially “formal” (methods just means techniques, in our case techniques for software verification), as I’ve seen people constantly misuse it.

Formal means mechanical, i.e. doable in principle by a computer (sometimes we say doable or checkable by a computer, but the two are equivalent from a computability perspective though not from a complexity perspective). So when you say you have a formal proof that does not mean you have a mathematical proof, but a proof that can be verified by a computer. Then why do we say "formal" instead of "mechanical"? Because the word formal was used for centuries before Alan Turing discovered that it is equivalent to "mechanical." The main idea of symbolic logic is that of the formula, or logical form; a sentence whose truth can be derived by the direct manipulation of its form (syntax) without any use of intuition or understanding. So, to this day, when we talk of things making use of symbolic (or formal) logic, we say formal rather than mechanical.

Now, if formal means mechanical and methods means techniques for software verification, are testing and type checking — both software verification techniques that are carried out by the computer — formal methods? Technically they are! But formal methods is also the name of an academic discipline, and so formal methods usually refers to whatever it is that people who research formal methods study, and type checking have not traditionally been a part of that discipline. What’s interesting that both of them — or, rather, certain forms of them — are now actually part of formal methods research. A few formal methods researchers study dependent type systems, type systems that can express a far richer set of program properties than simple type systems, and the study of concolic testing, a combination of symbolic and abstract interpretation/model-checking and concrete (i.e. “normal”) testing is a very hot topic in formal methods research at the moment. What's common to all topics in formal methods research is some use of formal logic.

noneeeed · on Sept 1, 2019

Nice summary.

I've not been involved with formal methods for 15 years (at what is now Altran UK) so I'm rather rusty.

Wasn't B an attempt at bridging the gap between specification and implementation by a process of gradual refinement, with each "layer" being record and verified? I only touched it while at uni, so I don't remember any of the details of how that was supposed to work. Did that ever go anywhere?

pron · on Sept 1, 2019

I don't know B, and I think you are correct, but it, too, doesn't break the "scalability barrier". It would allow you to generate small programs by a process of refinement, but not very large ones. I think it is aimed mostly at safety-critical realtime embedded systems (and I believe it's used mostly for railways), so it's extremely useful (like other model-checked languages for that domain, like SCADE), but not something that can significantly help mainstream software.

noneeeed · on Sept 2, 2019

Absolutely, none of the current formal methods approach scale to the level of much commercial software. Much of the systems engineering work that we did at Altran was focused on minimising the extent of the system that would need the higher levels of verification. So flow analysis and absence of runtime error proofs were commonplace, but proof of correctness against the spec was something that was much more focused on specific very high integrity components.

I remember thinking that B was a heck of a lot of work for event the smallest projects. I'm kind of surprised that it's still being used.