Complex systems will have accidents

The railway accident in Valencia on the 9th September was caused by a number of unlikely circumstances all happening together at one time. In Learning From Accidents and a Terrorist Attack they analyse accidents in civil engineering systems with the aim of learning from the mistakes and looking for ways in which the principles can be applied in software projects. Here are some conclusions that I extracted from the article:

  • In complex systems, the danger is not in the individual components but in the way in which they are assembled together. In developing software we use unit tests for the components and integration tests for the system as a whole.
  • Individual systems will fail. The coupling between systems should be minimized so that if one fails it will not affect the others. We must try to avoid incidents from becoming accidents. An example might be an error in a date conversion routine that could compromise an entire database.
  • Parts should be designed in such a way that it is impossible to assemble them incorrectly. This has implications in the design of APIs.
  • It’s essential to learn from mistakes. In civil engineering information about mistakes is shared and published in books and reports. In the software industry, information about mistakes is often hidden from the outside world and often from other teams within the same company.

The software industry has to become much more mature in the way it handles accidents and avoids errors, and should try to apply the lessons that other engineering disciplines have already learned.