The Power of 10: Rules for Developing Safety-Critical Code [pdf]

AlotOfReading · on June 28, 2023

These are somewhat dated rules, and generally too restrictive for modern safety critical development. For one thing, unbounded loops are situationally fine. The key distinction is that the program has to make forward progress regardless, which is not always the same thing as escaping the loop. Pool allocators have also become very common, so the complete prohibition on "dynamic allocation" is a lot more flexible now.

Many of the other rules broadly hold up though. You still want to avoid things like pointers because static analyzers are terrible at proving through them. This typically means LUTs are wrapped in the equivalent of an unsafe block though. Not ideal, and hopefully better tooling will eventually exist for these sorts of things.

dwohnitmok · on June 28, 2023

> For one thing, unbounded loops are situationally fine. The key distinction is that the program has to make forward progress regardless, which is not always the same thing as escaping the loop.

The article somewhat alludes to this, although they don't use the notion of progress that you use, which is extremely important.

> This rule does not, of course, apply to iterations that are meant to be nonterminating—for example, in a process scheduler. In those special cases, the reverse rule is applied: It should be possible for a checking tool to prove statically that the iteration cannot terminate.

As a further searchable term, people can look up "total functions." Total functions must either terminate (loop ends) or be productive (always make forward progress). That as many functions as possible should be total is a pretty good rule of thumb in any codebase.

TeMPOraL · on June 28, 2023

>> In those special cases, the reverse rule is applied: It should be possible for a checking tool to prove statically that the iteration cannot terminate.

This part made me pause for a moment, but then it clicked, and I'm remembering the rare and unusual experience of facing a process that escaped a loop that was supposed to go on indefinitely. A software equivalent of transmission belt slipping off. Most frequently seen in video games, where it's mostly harmless - but imagining a scheduler in an embedded system quitting on me like this fills me with dread to irrational extent.

> As a further searchable term, people can look up "total functions." Total functions must either terminate (loop ends) or be productive (always make forward progress).

That's a new one to me. I'm not deep into more advanced flavors of functional programming, but I thought total functions are defined as terminating in finite time for all inputs. I suppose it depends on whether you consider infinite loop a successful termination in infinity or not, but personally I wouldn't, at least not when it's clear the function is plain running in circles (as opposed to asymptotically approaching a value).

AlotOfReading · on June 28, 2023

Yep, it's a bit terrifying to contemplate. If you're ever unfortunate enough to do safety critical systems, you'll have to remember that you have to protect against this not only at the software level with formal methods, but also at the hardware level to ensure that e.g. a bit flip in the CPU registers or the instruction sequence as it travels from memory also doesn't violate your safety model.

dwohnitmok · on June 29, 2023

> I'm not deep into more advanced flavors of functional programming, but I thought total functions are defined as terminating in finite time for all inputs.

The main paper that lays this out is "Total Functional Programming" (https://www.jucs.org/jucs_10_7/total_functional_programming/...).

But to briefly summarize:

I was playing a bit fast and loose with wording since most notions of total functions are built on top of FP languages as you point out and therefore ultimately on the lambda calculus where we don't have a native notion of a loop.

Instead of loops in FP languages we have lists (and folds on the list or an unfold to generate the list) and other data structures that reify the notion of iteration.

Hence what I was really saying is that a total function can also generate an infinite data structure, which is only roughly like saying an infinite loop.

Totality is usually formally defined in terms of data and codata (or equivalently recursion and corecursion). The co- prefix may sound a bit scary, but an informal way of thinking about it is that data is always finite and codata can be infinite. The canonical example is that a list is an example of data and an infinite, lazy stream is an example of codata.

This seems like a lot of ceremony for no real good reason if you consider an "infinite" data structure as simply a finite data structure that can store functions that happens to have additional computation in it hidden behind some closure. And indeed you are still correct that an informal and usually sufficiently useful definition of totality for a function is that it terminates in finite time for all inputs (after all even if you generate an "infinite" data structure you still terminate in finite time and just hide the additional computation within a finite data structure).

It's still nice to have the data vs codata distinction though.

Practically speaking bringing in the notion of codata gives us a nice little heuristic for checking termination when creating an infinite data structure. Only types marked as codata are allowed to have infinite descent and when returning them from a function we have to always follow a certain syntactic pattern known as guarded recursion. Otherwise if we didn't have the notion of codata, a naive termination checker would have a hard time with checking termination of a function that recurses on itself, but only behind a lambda.

E.g.

  // Pseudo-code

  data class Stream(head: Int, rest: () => Stream)

  // generateIncrementingStream(0) generates the infinite stream 0, 1, 2, ...
  generateIncrementingStream(start: Int): Int = {
    Stream(start,  () => generateIncrementingStream(start + 1))

    // This appears to a naive termination checker to never terminate
    // You could try to simply check if the recursive call is inside a function
    // but then you run into other issues if e.g. the entire function is called 
    // again and it becomes very thorny very quickly
    // e.g. Stream(start, {(() => generateIncrementingStream(start + 1))(); () => generateIncrementingStream(start + 1)}
    // never terminates even though all calls to generateIncrementingStream are 
    // contained behind lambdas. You might have to start doing some annoying 
    // symbolic computation.

  }

vs

  // Note that the recursive call to Stream in rest is no longer behind a lambda
  // Under the hood this might be implemented with lambdas
  codata class Stream(head: Int, rest: Stream)

  def generateIncrementingStream(start: Int): Stream = {
    // Much easier heuristic: all recursive calls must be inside a codata
    // constructor, i.e. "guarded" by a constructor call hence "guarded recursion"
    Stream(start, generateIncrementingStream(start + 1))
  }

The general idea here is that a function that generates codata is productive and hence total if it generates a finite amount of the potentially infinite codata structure before its next recursive call. A nice heuristic that captures most useful cases is guarded recursion (of course because of the halting problem this is only a heuristic).

The usual heuristic is that for functions that generate data, you must have structural recursion, i.e. every recursive call must be on a piece of data "smaller" than the original input. On the other hand if you generate codata, you must have guarded recursion, i.e. every recursive call must be guarded behind a codata constructor.

This resolves some things quite nicely. E.g. if have a recursive function that accepts a codata `Stream` and wants to return some data, e.g. a finite `List`, then we must have some other argument that gets smaller every function call. E.g.

  // Taking in infinite codata and producing finite data
  def take(stream: Stream, n: NaturalNum): List[Int] = {
    n match {
      // This passes totality checking because m is "smaller" than n, which here
      // means we had to strip away one level of NaturalNum constructors
      // hence 
      Successor(m) => Cons(stream.head, take(stream.rest, m))
      Zero => EmptyList
    }
  }

  
  def streamIdentity(stream: Stream): Stream = {
    // Since we're producing codata, we check for guarded recursion
    // Passes totality checker because streamIdentity is guarded behind Stream
    Stream(stream.head, streamIdentity(stream.rest))
  }

  def listIdentity<A>(list: List[A]): List[A] = {
    list match {
      case EmptyList => EmptyList
      // Since we're producing data, we check for structural recursion
      // Passes totality checker because the recursive call is on a smaller piece of data
      case Cons(x, rest) => Cons(x, listIdentity(rest))
    }
  }

  // Can't blindly produce finite data from an infinite piece of codata!
  def nontotalStreamToList(stream: Stream): List[Int] = {
    // Looking for structural recursion
    // Does not pass totality checker because stream.rest is not considered to be smaller than stream since Stream is codata
    Cons(stream.head, nontotalStreamToList(stream.rest))
  }

  // Can definitely produce infinite codata from finite data though
  def padOutListWithInfiniteZeroes(initialSegment: List[Int]): Stream = {
    initialSegment match {
      case Cons(x, rest) => Stream x (padOutListWithInfiniteZeroes(rest))
      case EmptyList => Stream 0 (padOutListWithInfiniteZeroes(initialSegment))
    }
  }

Bringing in the notion of codata also gives us some theoretical nicety. How do you express totality for a program that must run forever, e.g. a server? You say that the IO type a la Haskell is codata and that's that! Then you know that `main` having the `IO` type means that it can potentially run forever, but it will always generate a finite amount of stuff on every codata "iteration" and therefore is productive and is not just spinning its wheels in a pointless infinite loop that does nothing.

hedora · on June 29, 2023

Rust solves the pointer escape problem via static analysis and annotations, so that’s feeling pretty dated too.

Similarly, allocation is fine with today’s large memories. You need to bound the request queue depth and the amount of memory allocated per request though.

I also question the idea that safety critical code is a special class that is distinct from normal code.

Buggy consumer devices lead to all sorts of deaths — after all, a one in a billion scenario is going to hit a few people per day on average — for example, look at the correlated 999 and 911 spam that hit iOS and (separately) android this week.

eschneider · on June 28, 2023

The problem with dynamic allocation (aside from the problem that it can fail) is that calls to malloc don't complete in a predicable time, so it's pretty much verboten in hard realtime apps. Pool allocators can (though not always are) written to return in a bounded time, so if it's created at startup, yeah, not really dynamic allocation.

AlotOfReading · on June 28, 2023

They're not meaningfully different from the perspective of the programmer. You call a malloc-like function and get some memory back. The total dynamic memory use of the application is simply capped. You can do this even with libc malloc in a standards compliant way (e.g. Broadcom used to ship a malloc implemented as a series of pool allocators over statically initialized arenas), it's just uncommon.

AnimalMuppet · on June 28, 2023

If you're the programmer who has to provably meet hard real time constraints, the difference matters quite a bit.

JohnFen · on June 28, 2023

> You can do this even with libc malloc in a standards compliant way

This is what I do in nontrivial embedded projects. It makes things easier and less error-prone to replace malloc()/free()/etc implementations directly with ones that behave as needed. And then you don't have the standard implementations sitting around waiting for a new dev to accidentally use them.

LorenPechtel · on June 28, 2023

I think they're being too restrictive on recursion. They're excluding it based on not being able to prove it will terminate--but you can do that with a time-to-die counter.

oleganza · on June 28, 2023

Stack is a dynamic memory of sorts and some languages (Rust) heavily optimize towards keeping things on stack so that your data types could turn out to be quite big.

It is hard to determine the max value for time-to-die counter if your stack footprint is hard to estimate due to nested structs/enums/unions/generics/optimizations.

hedora · on June 29, 2023

I wish there was a “this is tail recursive; if you cannot turn it into a loop, fail to compile” compiler annotation for C++/rust.

Svetlitski · on June 29, 2023

Clang provides the "musttail" attribute for C and C++:

https://clang.llvm.org/docs/AttributeReference.html#musttail

MatthiasPortzel · on June 28, 2023

By disallowing recursion and requiring loops to have a fixed upper bound, you’ve defined a subset of C that is less than Turning complete.

jcranmer · on June 28, 2023

If you're being that pedantic, C itself is not Turing-complete in the first place: pointers have finite size, which means that C can address only a bounded amount of memory, so it is a finite state machine (just with an extremely large number of states).

tgv · on June 28, 2023

I don't think it's about that. Technically, everything is finite. But placing limits on the loops and restricting memory allocation bounds the complexity. It probably limits everything to P-time.

adrianN · on June 28, 2023

You can easily have bounded loops and superexponential runtime.

tgv · on June 28, 2023

Without dynamic memory allocation? I think you'd need at least O(n) memory to store enough state to go exponential. But it's some time I did complexity theory; I could be wrong.

adrianN · on June 29, 2023

Unless your input is read-only you should always have O(n) memory available.

l33t233372 · on June 28, 2023

When everything is finite and memory is always limited, what are you meaningfully saying about the possible behaviors of software here?

c_crank · on June 28, 2023

Bounding the complexity is desirable for a system that should never fail due to software errors.

adrianN · on June 28, 2023

How many problems in safety critical systems need Turing completeness and how many are fine with something weaker eg primitive recursion? My intuition is that most things are fine without full Turing completeness (as far as you can even apply terms about computability to systems with complex IO behavior)

marcosdumay · on June 28, 2023

> How many problems in safety critical systems need Turing completeness

Probably all, like the ones outside of critical systems.

But if you concentrate the complex behavior in a single layer, you will probably get a very thin layer, that you can call an OS or something like that, verify it very well, and have almost all the code follow a non-complete idiom.

adrianN · on June 29, 2023

I have worked on a safety critical system and can't recall a part that needed Turing completeness.

marcosdumay · on June 29, 2023

What starts your safety critical code?

adrianN · on June 30, 2023

Some part of the hardware starts executing the code at some fixed address when the system is powered on. I don't know the details, I only do software.

zokier · on June 28, 2023

That is a good thing, right? Turing completeness is rarely a desirable property

AnimalMuppet · on June 28, 2023

Yes. And by doing so (and by removing recursion), you are also making the Halting Problem decidable.

hedora · on June 29, 2023

Rule 5 prohibits programs that average less than two undecidable branches per function.

fallingmeat · on June 28, 2023

Fun fact: the author of these rules (Gerard Holzmann) is also the inventor of the Spin model checker (spinroot.com)!

Zamiel_Snawley · on June 29, 2023

There is something ironic about there being an obvious typo in the subtitle(domponents instead of components).

Regardless, I think these are generally good guidelines, but I'd not call the rules. Sometimes you really need the C preprocessor. For example, for compile-time assertions in C99.

hedora · on June 29, 2023

I love rule 5.

Each function should have on average, at least two assertions that cannot be checked statically, and the calling code must try to recover from the failure.

So much for writing code that doesn’t suffer from the halting problem, or that has high branch coverage!

isityouyesitsme · on June 29, 2023

Ever write against AUTOSAR?

Vendors will implement full state machines in the preprocessor. Generated code will include the same header 2 or 3 times, because each time through, the state machine has changed and different code is exposed.

I am not joking.

Shawnj2 · on June 29, 2023

They are the rules for JPL, where they are sending shit to other planets and have 1 incredibly expensive shot to get the software right on the first try. The software cannot have an unrecoverable error condition and the safety critical portions of the code like that used for landing the spacecraft need to work on the first shot no matter what goes wrong. Even other spacecraft don’t need to have as strict rules but this is a different playing field

isityouyesitsme · on June 29, 2023

Perhaps not everyone knows what AUTOSAR is. https://en.m.wikipedia.org/wiki/AUTOSAR. Think: automotive brake controllers, that type of shit. It is safety critical.

My point isn't that the rules apply universally, it's that something found to be a good practice in a safety critical industry is flagrantly discarded in another, without good reason. Literal state machines in the preprocessor. Metaprogramming in C. But because it is generated code, it is "ok."

RealityVoid · on June 29, 2023

The automotive industry is weird like that. They started doing things a way, got stuck in a groove and now if you push back against it they look at you like you killed their dog.

I should really find a different field, for my mental well being.

jpc0 · on June 29, 2023

More modern take of this

https://yurichev.com/mirrors/C/JPL_Coding_Standard_C.pdf

Farmadupe · on June 28, 2023

just a few drive-by comments:

First and most important, the title claims these are safety-critical rules, but the conclusion says that they are only being used on mission-critical projects. I get the feeling that there might have been some late-change copyediting done here, as the content is far too generalizing to be taken seriously in any specific safety critical context, and hopefully the author would have known that.

* Agree that many safety-critical coding standards are a grab-bag of sometimes dubiously-valuable rules.

* Agree with the implication that stylistic guidelines are almost totally worthless, and may distract reviews from their core task of finding functional errors.

* The document states that manual review of large cases may be infeasible, but I have never worked on a project that didn't mandate it. Drudgery goes hand-in-hand with safety-critical, and mindnumbing close review is an accepted element of this.

* Agree that less rules is good, but the author doesn't justify why 10 rules is the best. Unfortunately this is a field where correctness trumps easiness or simplicity, and any class of error with unacceptable risk to life/limb must be prevented. Thus any ruleset cannot justify itself simply because it is small.

* In fact, there are many wordings in this document that are generally-unacceptable because they are admitting incompleteness in the general case (and therefore unsuitability for any safety-critical purpose), such as "Although such a small set of rules cannot be all-encompassing" or "In return, it should be possible to demonstrate more convincingly that critical software will work as intended"

commentaries on specific rules:

* rule 4: "No function should be longer than what can be printed on a single sheet of paper". In my experience, readability is not always improved by limiting function length. The main goal of safety-critical be correct (and hopefully be obviously so). Sometimes splitting up code is harmful to this. (also this is actually a stylistic rule of the kind rejected in the 2nd paragraph of the document.)

* rule 5: In the safety-critical domain, defensive checks and "assertions" are usually considered at low-level-design. It's generally unacceptable for a coder to "just" insert defensive behaviours into the implementation, because all such behaviours must be analyzed for correctness. I don't disagree with the idea behind this rules, but a coding standard is the wrong place for it.

nunuvit · on June 28, 2023

> Agree that many safety-critical coding standards are a grab-bag of sometimes dubiously-valuable rules.

What frustrates me is that most standards are self-aware of this and explicitly allow you to tailor them, but hardly anyone does that to a meaningful extent.

munch117 · on June 28, 2023

Do that and you are just creating targets for the next security auditor that you need any sort of certification or approval from.

In the end you'll choose the path of least resistance, which is to slavishly obey every rule on the checklist. It's not that people don't want to tailor the rules. They try at first, and then it gets beaten out of them.

Frustrating indeed.

nunuvit · on June 28, 2023

It's doable sometimes when there's a tailoring framework. Here's a publicly available example [1]. Though I admit that it's usually easier to do in the US than in the European Union, so your experience may vary.

[1] https://www.nasa.gov/seh/3-11_tailoring_and_customization

peplee · on June 28, 2023

I suppose I'm not surprised that this breakdown exists, I just didn't know it existed. Thank you for sharing.