Archive

Posts Tagged ‘safety-critical’

The Expense Of Defense

November 19, 2011 3 comments

The following “borrowed” snippet from a recent Steve Vinoski QCon talk describes the well worn technique of defensive programming:

Steve is right, no? He goes on to deftly point out the expenses of defensive programming:

Steve is right on the money again, no?

Erlang and the (utterly misnamed)¬† Open Telecom Platform (OTP) were designed to obviate the need for the defensive programming “idiom” style of error detection and handling. The following Erlang/OTP features force programmers to address the unglamorous error detection/handling aspect of a design up front, instead of at the tail end of the project where it’s often considered a nuisance and a second class citizen:

Even in applications where error detection and handling is taken seriously upfront (safety-critical and high availability apps), the time to architect, design, code, and test the equivalent “homegrown” capabilities can equal or exceed the time to develop the app’s “fun” functionality. That is, unless you use Erlang.

Is It Safe?

November 1, 2011 Leave a comment

Assume that we place a large, socio-technical system designed to perform some safety-critical mission into operation. During the performance of its mission, the system’s behavior at any point in time (which emerges from the interactions between its people and machines), can be partitioned into 2 zones: safe and unsafe.

Since the second law of dynamics, like all natural laws, is impersonal and unapologetic, the system’s behavior will eventually drift into the unsafe system behavior zone over time. Combine the effects of the second law with poor stewardship on behalf of the system’s owner(s)/designer(s), and the frequency of trips into scary-land will increase.

Just because the system’s behavior veers off into the unsafe behavior zone doesn’t guarantee that an “unacceptable” accident will occur. If mechanisms to detect and correct unsafe behaviors are baked into the system design at the outset, the risk of an accident occurring can be lowered substantially.

Alas, “safety“, like all the other important, non-functional attributes of a system, is often neglected during system design. It’s unglamorous, it costs money, it adds design/development/test time to the sacred schedule. and the people who “do safety” are under-respected. The irony here is that after-the-accident, bolt-on, safety detection and control kludges cost more and are less trustworthy than doing it right in the first place.

Out With The Old And In With The New

October 26, 2011 Leave a comment

Nancy Leveson has a new book coming out that’s titled “Engineering A Safer World” (the full draft of the book is available here: EASW). In the beginning of the book, Ms. Leveson asserts that the conventional assumptions, theory, and techniques (FMEA == Failure Modes and Effects Analysis, Fault Tree Analysis == FTA, Probability Risk Assessment == PRA) for analyzing accidents and building safe systems are antiquated and obsolete.

The expert, old-guard mindset in the field of safety engineering is still stuck on the 20th century notion that systems are aggregations of relatively simple, electro-mechanical parts and interfaces. Hence, the steadfast fixation on FMEA, FTA, and PRA. On the contrary, most of the 21st century safety-critical systems are now designed as massive, distributed, software-intensive systems.

As a result of this emerging, brave new world, Ms. Leveson starts off her book by challenging the flat-earth assumptions of yesteryear:

Note that Ms. Leveson tears down the former truth of reliability == system_safety. After proposing her set of new assumptions, Ms. Leveson goes on to develop a new model, theory, and set of techniques for accident analysis and hazard prevention.

Since the subject of safety-critical systems interests me greatly, I plan to write more about her novel approach as I continue to progress through the book. I hope you’ll join me on this new learning adventure.

Traceability Woes

June 1, 2011 1 comment

For safety-critical systems deployed in aerospace and defense applications where people’s lives may at be at stake, traceability is often talked about but seldom done in a non-superficial manner. Usually, after talking up a storm about how diligent the team will be in tracing concrete components like source code classes/functions/lines and hardware circuits/boards/assemblies up the spec tree to the highest level of abstract system requirements, the trace structure that ends up being put in place is often “whatever we think we can get by with for certification by internal and external regulatory authorities“.

I don’t think companies and teams willfully and maliciously screw up their traceability efforts. It’s just that the pragmatics of diligently maintaining a scrutable traceability structure from ground zero back up into the abstract requirements cloud gets out of hand rather quickly for any system of appreciable size and complexity. The number of parts, types of parts, interconnections between parts, and types of interconnections grows insanely large in the blink of an eye. Manually creating, and more importantly, maintaining, full bottom-to-top traceability evidence in the face of the inevitable change onslaught that’s sure to arise during development becomes a huge problem that nobody seems to openly acknowledge. Thus, “games” are played by both regulators and developers to dance around the reality and pass audits. D’oh!

To illustrate the difficulty of the traceability challenge, observe the specification tree below. During the development, the tree (even when agile and iterative feedback loops are included in the process) grows downward and the number of parts that comprise the system explodes.

In “theory” as the tree expands, multiple traceability tables are rigorously created in pay-as-you-go fashion while the info and knowledge and understanding is still fresh in the minds of the developers. When “done” (<- LOL!), an inverse traceability tree with explicit trace tables connecting levels like the example below is supposed to be in place so that any stakeholder can be 100% certain that the hundreds of requirements at the top have been satisfied by the thousands of “cleanly” interconnected parts on the bottom. Uh yeah, right.

Our Stack

November 18, 2010 Leave a comment

The model below shows “our stack”. It’s a layered representation of the real-time, distributed software system that two peers and I are building for our next product line. Of course, I’ve intentionally abstracted away a boatload of details so as not to give away the keys to the store.

The code that we’ve written so far (the stuff in the funky green layers) is up to 50K SLOC of C++ code and we guesstimate that the system will stabilize at around 250K SLOC. In addition to this computationally intensive back end, our system also includes a Java/PHP/HTML display front end, but I haven’t been involved in that area of development. We’ve been continuously refactoring our “App Support Libs” layer as we’ve found turds in it via testing the multi-threaded, domain-specific app components that use them.

The figure below illustrates the flexibility and scalability of our architecture. From our growing pool of application layer components, a large set of product instantiations can be created and deployed on multiple  hardware configurations. In the ideal case, to minimize end-to-end latency, all of the functionality would be allocated to a single processor node like the top instantiation in the figure. However, at customer sites where processing loads can exceed the computing capacity of a single node, voila, a multi-node configuration can be seamlessly instantiated without any source code changes Рjust XML file configuration parameter changes. In addition, as the bottom instantiation shows, a fault tolerant, auto-failover, dual channel version of the system can be deployed for high availability, safety-critical applications.

Are you working on something as fun as this?

%d bloggers like this: