Tuesday, September 2, 2008

Measure your improvements

Metrics are an important part of any development group's toolset. If we want to continually improve our ability to develop software (through a lean engineering kaizen approach, or simply as a learning organization), then we need to have a way to figure out:

  • what parts of our process need improvement?
  • when we make a change, did it help or hurt?

This is where process metrics come into play. I'll start with my definition of a metric, which is a numerical measurement of something. If you can count it, it can be a metric. So "number of outstanding bugs" is a metric, but "software quality" is not. The term "quantitative metric" is redundant, and "qualitative metric" is an oxymoron.

There are generally two types of metrics we can capture:

  1. causal metrics: these are metrics that have a direct business impact: for example, ROI for a feature, unique monthly visitors, click-through ad rate, etc.
  2. symptomatic metrics: these are metrics that do not directly affect ROI (although we might believe they do) but are downstream indicators for one or more causal metrics. The number of outstanding bugs in a product, the number of bugs caught in a certain phase of development, percentage of code covered by unit tests, etc. are all symptomatic metrics.

My general observation, based on reading articles around the use of metrics for improving your processes, is that a lot of metrics-based improvement projects fail to distinguish between these two types of metrics. Partly, I think this is because while the causal metrics properly align your improvement efforts with your business's interests, they are also harder to define and measure. By contrast, a lot of symptomatic metrics are easy to find and measure, but their relationship to the business may be less clear.

For example, consider the balance between software quality and time to market. You can take a longer time when developing a feature or product to reduce the number of bugs that show up at deployment, or you can ship a feature more quickly, knowing that there may be both known and unknown bugs present. In this case, you can measure both number of known bugs at deployment time, and you can measure overall time-to-production for a feature (time from feature conception to deployment).

Now, if you can decrease time-to-production without increasing the bugginess of your code, that's a win. Similarly, if you can reduce bugginess without lengthening your production time, that's a win. However, both of these things will probably require some effort to implement. Another interesting possibility would be to simply make an adjustment of where you sit on this balance. For example, simply spend more time looking for and fixing bugs in your QA phase, to tradeoff fewer bugs for a slower time-to-market. Or vice versa to get to market more quickly, possibly with more bugs. Both of these adjustments are probably relatively painless to implement, in that no one has to change what they do, just how long they do it for.

So the question is, which one of these things ought we to do? My argument is that bugginess and time-to-production, both being symtomatic metrics, don't give us the answer directly. It all depends on our product environment. For example, when producing software to run medical equipment, a company reputation for quality might be more important than shipping new features quickly; or, in a highly innovative internet space, time to market might be king in terms of how much market share you can capture.

It's management's job to both help implement win-win changes as well as to set the "slider" of the quality/time tradeoff at the right spot. The trick, of course, is that it might be hard to measure this directly; there are several symptomatic things we could measure, including:

  • time spent in development
  • time spent in QA
  • time spent in deployment
  • number of outstanding bugs
  • profit for the product at a given point in time

Now really, the profit over some window of time (e.g. for a website, revenue vs. development spend for a given month) is the thing we want to optimize. The interesting idea here is for management to be able to run a series of experiments: if I increase/decrease QA or development time, how does it affect ROI for my product? How does the relative bugginess of a release affect its profitability? For certain "sliders" in the business, it is relatively simple to take a series of measurements to find a current "sweet spot".

An interesting idea here is that sometimes we work through things backwards. For example, we try to estimate "how long will it take to fully regression test a release", or "how long will it take to code up a feature", rather than "how buggy will the release be if we test it for X amount of time", or "how much of this functionality can you develop in X amount of time." In other words, rather than deriving the time-to-market from a set of estimates for all the steps, instead set the time to market by timeboxing those steps, and see what the outcome is. This is a powerful notion of metrics-based management that is hinted at (in the notion of Scrum timeboxed iterations) but which I have not seen explicitly suggested anywhere[1]. (Please post all the references to things that I've missed in the comments section--I'm sure there are plenty).

At the end of the day, however, it is hard to optimize things we can't measure. I think important metrics to gather are:

  • the levers we have available to manipulate our process (e.g. timeboxing)
  • causal metrics that affect our business (ROI, product profit)
We need to be aware which metrics are causal, and which are merely symptomatic, so that we are measuring things that directly affect the business somehow. This approach permits empirical management--adjust something you can control, see how it affects your causal metrics, rinse, repeat.

[1] Scrum timeboxes an entire iteration, but does not timebox an individual feature, so a team may be able to spend all their time on one feature, or spread their effort across many features. The closest thing I've seen here is the notion of the "Small" in INVEST user stories, where stories are limited to a certain amount of complexity. However, the story points in this case are still estimates of the work involved, rather than timeboxes around how much time to spend implementing a feature; the "small" requirement is really to permit more accurate estimation rather than to timebox the amount of effort (although it does secondarily have this effect, I've not seen this stressed in articles about this).


Graham Rohms Friely said...


All good points and well put, but it seems to me that the things you list as causal metrics are to a certain extent well outside the control of the development team insofar as they reflect of the accretive success of marketing and branding, which can be helped by a strong app but not controlled by it. So they aren't really good metrics for any development effort. You could say that the longitudinal TCO of an app would be a causal metric for it, but even there external teams like communications, training, and IT infrastructure factor in. So I'm not seeing pure causal metrics for an SDLC.

Jon Moore said...

Hi Grouse,

Thanks for the input. I struggled a bit with the terms "causal" and "symptomatic", apparently with good reason. I was searching for something to distinguish, for example, "outstanding bug count" from "software quality." Here, we probably really care about the latter, but can only easily wrap a number around the former, which is only a secondary/derived indicator.

I'd definitely welcome suggestions for alternate terms here.

It is true that the development team is not in direct control of the product's profitability; that certainly requires holistic management (should we hire more engineers to enhance the product, or just spend more on advertising?). As you indicate, though, how the development team performs is certainly an influencer on overall product profitability; I'm suggesting collecting trending data over time.

For a given feature that brings in a fixed incremental amount of revenue, for example, spending 2 vs. 3 weeks in QA may be the difference between a positive and a negative ROI for that feature. Having information at this level of detail is the holy grail, and the article here is probably just a small step in that direction.

I think this article needs a rewrite, in general, because it "rohms [too] friely" across a couple of different topics. I'd like to get some more feedback about what the valuable bits are here so I can focus on them during the rewrite.