Investigating Failures in Complex Systems

Posted on September 10, 2015

by Andrew Bugera, EIT

Some recent highly-publicized failures of complex systems have reminded all of us that failures can still occur even when the design team puts forth their best effort to prevent them. For example, the SpaceX CRS-7 mission’s launch vehicle and payload were destroyed during launch near the end of June 2015 due to the reported failure of a single component. SpaceX reported that their engineering teams had “spent thousands of hours […] to understand that final 0.893 seconds prior to loss of telemetry. ” Fortunately, the SpaceX vehicle was not carrying any people on board at the time of the accident. RIVA, a fully-automated IV compounding system, produces IV syringes and bags for use in hospitals. While RIVA may not be as complex as a SpaceX Falcon 9 rocket, it is critical that the doses it produces are made correctly. Our customers and their patients depend on RIVA to maintain and enhance the safety of pharmacy compounding operations.

Not all failures are as destructive as the CRS-7 launch, though. Confusing user interface behaviour is often considered a failure since it makes the system more difficult to use. When deciding what type of response is warranted when a failure occurs, it is necessary to determine the root cause of the failure. While some failures in complex systems are due to design or implementation issues (traditionally referred to as “bugs”), these are not the only causes of failures. Some failures occur due to mechanical problems, consumable abnormalities, or operator error. One of the legendary stories in computer science is that RADM Grace Hopper helped to popularize the term “debugging” after her colleagues discovered an actual moth inside a relay in their Harvard Mark II electromechanical computer in the 1940s. While this article primarily discusses investigation of software failures, many of the same techniques can be applied to other types of system failures as well.

Often when you need to investigate a failure related to your product, you will not be the one who experienced it. The system may be in a remote location and if one of the end users witnessed the failure, they will likely not have as deep an understanding of the system as its designers do. While you can sometimes get valuable information from descriptions provided by end users, logs created by software can be much more detailed and specific, leading to faster diagnosis and resolution of the issue. Logs can take the form of time series data for values of interest, text-based logs of software decisions and actions, or other information stored in database records.

When investigating a failure, one of the early steps must be to collect all of the available objective data related to the situation. Software logs, database records, or photos of hardware can be invaluable when attempting to understand what has occurred, and they can also help rule out other situations which did not occur. Descriptions of what happened from a user perspective can also be useful, but they are often more subjective and may not have all of the detail of a log or photo.

Once there is a source of data available, establishing an overall timeline will help clarify the context around the failure. For example, I frequently visualize the operation of RIVA in my head while reviewing log files; making notes in a single timeline helps to keep things clear. As you are reviewing what happened on your timeline, keep an eye open for things which seem unusual or unexplained. If a user is trying to use a particular feature, for example, you may want to focus your review on the use of that feature and make sure everything went as expected. If you see that something unexpected did happen, you now have a clue which tells you where to keep digging.

Since each situation is different, one cannot assume that a universal process will be able to determine the cause of every failure. I have found the following tools and techniques to be effective when trying to understand failures in a complex system like RIVA:

  • Good logging infrastructure

When investigating a failure in a remote, complex system, it can be difficult to understand what happened unless you have good logging infrastructure in place. When I am designing software with logging functionality, I try to log descriptions of what decisions are being made by the software in addition to the actions occurring. While you may be able to reverse-engineer which logical path the software has taken by reading logs of actions alone, I find it to be much easier when the log explicitly tells you why the software chose to do something.

  • Custom log viewing tools which can interpret application-specific information

Since the RIVA software is multi-threaded and many different operations may be performed in parallel by the system, log review tools which provide filtering controls are incredibly useful. Being able to see log entries related to a single consumable or software thread makes it easier to focus an investigation on the story of an individual item moving through the system, if needed.

  • Graphs or charts of time-series data

Some of the most convincing information I have presented to other developers has come in the form of time-series data shown in a graph or chart. Whether showing consumable measurements for items being loaded into RIVA or performance metrics, these types of visualizations are useful for showing patterns which occur over time or step-changes in a value which is expected to be nearly constant. The other nice thing about using this type of representation is that you can also see the effects of changes you make. If you make a change to the system to attempt to resolve an issue, you can collect additional data, add it to your graph, and see whether your change had the impact you were expecting.

Investigating failures is much easier when clear, trustworthy data is easily available. Each RIVA unit locally logs many kinds of data about its operation, including consumable weight and identification information, environmental monitoring information, as well as system operation logs. For customers who allow us to do so, we generate statistical summaries based on this data to determine which issues cause the most impact to key performance indicators such as overall equipment effectiveness.

If you have a complex system and struggle when investigating failures, consider investing some time into logging improvements so that frequently-used data is easily available. The easier it is for you to find the cause of the problem, the faster you can start working on a solution.

Announcements | Blog | Software