Root Cause Analysis

Recent events with the large Toyota recall for problems with the software controlling inputs from the brake and accelerator (which are no longer mechanically linked to the systems they control) highlight one of my pet peeves in the approach a company takes toward software development. Whether Toyota is guilty of this or not, I have no idea, but it’s a serious issue.

There is a related topic, which is whether software development is treated as software or as electronics. I’d like to deal with that separately, because it’s equally important (and usually, equally tied to a corporate approach to product development).

Regardless of how the software in a system is designed or architected, it is ultimately coded, reviewed and debugged by coders. When coders are driven by management to meet management goals such as adding features, closing open issues, etc. it is easy for them to lose sight of one of the most essential parts of software development: root cause analysis.

Root cause analysis has a motto: “fixes are cheap, bugs are precious.” When actual reproducible problems are found, they need to be cherished and treated as precious opportunities to find defects in design or in implementation. But for middle managers, this might mean that the programmer will be spending a few extra hours investigating a problem that could be fixed with a simple workaround. Although there is often the justification of “we can come back to that later,” we all know that especially in these crunch economic times, later never comes.

Sometimes a major defect manifests early warning symptoms in a seemingly innocuous bug which is only an annoyance. If a programmer feels empowered to follow the natural instinct to fully understand the problem, he/she can put some additional effort into chasing it down. Of course, this can be done in excess because while it’s nice to pursue zero defects in software, we have to get products out or do whatever pays the bills. But having the right balance in any software development organization is definitely the responsibility of leaders and management.

Managers need to understand the importance of root cause analysis, and need to encourage the front-line coders to pursue leads, because who knows? The next annoying little glitch you find in your test program’s results could be the next recall of 6 million vehicles, and a lone vigilant coder could be the one who went out on a limb to find it.

Commenting code: the 6-month rule

Yesterday I had a discussion about what I think is important in software engineering. My single biggest complaint in reading code (sometimes my own code) is that the author didn’t take the time to write concise, clear comments.

Use what I call the “6-month rule”: you’re writing or modifying a section of code. You’ve been immersed in it for hours or even days, and at the time you know exactly why you’re adding that code or making that change. It all seems obvious (or it should; programming on purpose is the topic for another post). But what if you didn’t look at or think about this project for six months, and in the meantime your memory was erased? Maybe it was a Vulcan mindwipe or maybe you just had millions of lines of other code to work on in the meantime, but you come back to that code and you can’t even remember if you were the one who wrote it.

So now, ask yourself how you would put concise, relevant comments in that would help you get back up to speed as quickly and accurately as possible?

There are a few do’s and don’ts that I’ve found helpful:

  • Do connect comments to the code they apply to. Big block comments have their place but often go unread, and therefore run the risk of becoming untrue. One-liners in exactly the right place can be moved along with the code.
  • Do be concise. Say what you need to say to help your six-month-ahead analog know what’s going on and do the right thing. You probably won’t be less busy in six months than you are now so don’t make your future self take more precious seconds than needed to grasp the concept.
  • Do be clear. Concise is good but not at the risk of being obtuse. Comments like “// Now do it” might satisfy a requirement to add comments but don’t really tell you anything. I once worked on a product that consisted of about a million lines of assembler, and the rule was that every line had to have a meaningful trailing comment. It could be a real challenge at times but was a good exercise in commenting.
  • Don’t be lazy. If you code properly, your code should keep working for years to come in completely unforeseen environments, and future generations of programmers will look at your code and be impressed with how well you organized things logically and did exactly what you set out to do. But most of us don’t write such flawless code, and adding a comment that says what you intended to do will help future maintainers answer the question “what was this coder thinking?”
  • Don’t get on a soapbox. Keep your comments relative to what’s in front of you and avoid indulging in rants about the likely ancestry of the person who designed this, or why you don’t like Perl or PHP or whatever. It may be cathartic but adds to the noise level and makes it less likely that others will take the trouble to maintain your comments.
  • Don’t make your comments personal. If you sign all your comments as a personal statement of opinion, future maintainers will be less likely to change the comments to reflect a better understanding of the code. They may add more comments so that it’s necessary to read an entire trail of comments to grasp what’s happening. That may work when the trail length is <= 2 but breaks down somewhere after 1. Make your comments functional and treat them as if they are snippets of logical thought revealing design decisions driving this particular code.
  • Don’t write comments that only you will understand. Others should eventually read your comments as well, so be clear (see above) and don’t make the language in your comments harder to understand than need be.

Commenting in a way that helps you and others quickly grasp the intention, implementation details and context of code is as much an art as a part of good tactical software development. Like any art, your commenting skills can only improve with practice.

Deltas in software development

One of the keys to quality in software development are deltas, or changes. This is an aspect of what I call tactical software development, or those aspects of software development that are more focused on “how you should write and test code” rather than “how should you plan and organize projects.”

Pair programming, for example, is often considered a part of Extreme Programming, a popular lightweight methodology, yet it falls under the umbrella of what I’d call tactics.

Pair programming is effective because it puts two pairs of eyes on every line of code that is written, as it is written. One person is always waiting to get his or her hands on the keyboard, and in the meantime can think about how the next piece of code can be written. Pair programming also speeds up the ramp-up time for getting a new person started on a project – the new programmer is paired with someone who is already familiar with the project, so there is very little time needed for the new programmer to try to figure out where to start.

Pair programming has its drawbacks, however. It only works when schedules are normalized so that everyone works the same hours – not always a problem in the traditional office environment, but it means that alternatives like occasionally working from home will either be restricted or will be done outside the pair programming paradigm.

Adding a lightweight process called Technical Walk-Through (or TWT for short) provides many of the benefits of pair programming but with added flexibility. Here’s how Technical Walk-Through works:

  • All changes to a project, whether changes to existing code or adding new code or components, are broken into change units. A change unit should correspond to a single version control commit, and should always represent a functional change. It is unacceptable to check in code that will break the build or that is incomplete, other than in some exceptional circumstances.
  • Before a change unit is checked in, the person who worked on it will prepare a directory where copies of the before and after code along with diff listings are available for review. They will then write a document which explains the changes enough so that another software engineer looking at the diff or source listings will be able to make sense of the changes.
  • Another team member will then go through the document and all the changes. Reading code changes can be tedious but is meant to catch oversights. It also provides cross-training on different areas of codes. The reviewer need not be someone as experienced as the person preparing the changes, but they are responsible for understanding every change.
  • Finally, if the reviewer is satisfied that the changes make sense – the reviewer is not responsible for testing the actual code, only for making sure that the changes make sense and don’t have any obvious errors or omissions – the changes are checked in with a checkin comment which may contain a synopsis from the change document, and perhaps a unique identifier for the TWT.

If one prolific coder has worked for two weeks adding a new subsystem to the Linux kernel, for example, it could take another experienced programmer an entire day just to read through and understand the changes. Part of the benefit of this process is that it makes the person who’s about to commit changes think about “why did I do this?” and in many cases, problems or loose ends will be caught before even submitting the TWT for review.

It also makes coders more aware of the need to write code in a way that appropriate comments help someone who is reading the code for the first time to quickly grasp what’s going on. But that’s a topic for another day…