Errors occur in software on a daily basis. Not just on our laptops and desktop computers, but in embedded devices everywhere. An app on your phone crashes. Visible errors appear on a screen at a railway station (often BSODs).
While these errors are annoying, they don’t have a high impact on our lives. Safety-critical systems are the complete opposite. In a safety-critical system, an error risks lives and signficant damage to the environment. In aviation, the DO-178B standard has 5 criticality levels for ranking failures, which classify the impact on safety (the plane), the crew and passengers:
- Catastrophic: failure may cause the plane to crash.
- Hazardous: failure greatly reduces safety, causes potentially fatal injuries to passengers, or puts crew at risk of not operating the aircraft properly.
- Major: failure has an impact on safety, causes potential (non-fatal) injuries to passengers or increases crew workload.
- Minor: failure is noticeable and causes passenger inconvenience (but not injury).
- No effect: failure has no impact on safety, crew or passengers.
Software systems in the plane are then classified according to their criticality. The higher the criticality (catastrophic is the highest), the more strict the development processes that are required before the software can be deployed. High criticality processes require certification by independent bodies.
Applications with different criticalities – especially the higher rankings – were traditionally physically isolated. Different hardware and cables for systems of different criticalities. This was for cost and safety reasons. The higher the criticality of a piece of software, the more time and effort required in order to use it. You don’t want to incorporate functionality that is rated minor with functionality that is rated catastrophic, or you will have to apply the certification process for the catastrophic rating to the entire application. This quickly gets expensive.
The cost reason then facilitates the safety reason. If you don’t want to certify a minor criticality functionality to the level of a catastrophic one, it must be isolated from the catastrophic one. Otherwise the minor application could have errors and bugs that cause the catastrophic one to fail.
All of this makes sense. But it’s wrong. Physical hardware isolation is no longer practical. Enter mixed-criticality systems.
A mixed-criticality system is that runs applications of different criticalities on the same physical hardware. The physical isolation of yesterday is lifted to the operating system – a very small, privileged, trusted piece of code that guarantees that applications cannot cause each other to fail. By certifying the operating system the obligation to do rigorous certification on low-criticality applications disappears.
There are several motivations for building mixed-criticality systems. The first is all about weight. Three factors have led to a massive growth in the amount of processors in cars and planes:
- Software is flexible: a standard computer can do anything. To change what it does, you change the software. If the software is connected to a network, this can be done remotely. Unlike custom hardware and circuits, which need to be called back to the manufacturer for physical changes.
- Hardware breaks: hardware failures, even in specialised hardware, are a possibility. As a result, safety-critical systems have back-up processors just in case.
- Software is useful: we do more with software than ever before. Just compare a smart phone with one of those 90′s bricks. List all of the functionality and count how much more the smartphone does.
As a result, safety-critical systems have a huge amount of processors. Cars have nearly 100. Isolation only makes this worse. Isolating tasks of different criticality means duplicating hardware and all of the cables that go with it. For a plane, the sheer weight is a huge cost in terms of fuel.
The “internet of things” includes your pacemaker
Another factor driving the demand for mixed-criticality systems is the internet. Specifically the internet of things. Consider the hypothetical development of a pacemaker (because I don’t know much about real ones). Early pacemakers were probably hard-coded circuits. They couldn’t be updated without traumatic surgery to take it out and put a new one in. Instead of taking the risk, old pacemakers are left in.
Time moves on. The pacemaker is now just a chip with some special hardware to regulate the heart. Why can’t we put a web-server on there? It would allow doctors to download reports of your heart rate over time. A pacemaker connected to the internet could allow you to look at them and monitor your own health. They could even contact your doctor for you if something goes wrong.
This is a mixed-criticality system. A web-server is complex software that would cost way too much to certify. In no way should the web-server be able to interfere with pace-maker functionality due to errors (it could kill you), otherwise hackers could control your heart (they could kill you). The system requires isolation, with controlled information flowing between the web-server and the pace maker (one way).
Now think of all the other critical devices that could be part of the internet of things.
Image credit goes to hellofish