codeartisan: 2010

This is the third post in a series about software development flow, which I'm describing as the conversion of customer requests (both for new features as well as bug reports) into working software. In the first post on this topic, we talked about how a software development organization can be viewed as a request processing engine, and how queuing theory (and Little's Law in particular) can be applied to optimize overall throughput and time-to-market (also called "cycle time"). In the second article, we revisited these same concepts with a more intuitive explanation and started to identify the management tradeoffs that come into play. This article will focus mostly on this final area: what metrics are important for management to understand, and what are some mechanisms/levers they can apply to try to optimize throughput?

I am being intentionally vague about particular processes here, since these principles can be applied regardless of particular process. I will also not talk about particular functional specialities like UX, Development, QA, or Ops; whether you have a staged waterfall process or a fully cross-functional agile team, the underlying theory still applies. Now, the adjustments we talk about here will have the greatest effect when applied to the greatest organizational scope (for example, including everything from customer request intake all the way to delivering working software), but Little's Law says they can also be applied to individual subsystems (for example, perhaps just Dev/QA/Ops taken together, or even just Ops) as well. Of course, the more you focus on individual parts of the system, the more likely you are to locally optimize, perhaps to the detriment of the system as a whole.

Managing Queuing Delay

As we've seen previously, at least some of the time a customer request is moving through our organization, it isn't being actively worked on; this time is known as queuing delay. There are different potential causes of queuing delay, including:

batching: if we group individual, independent features together into batches (like sprints or releases), then some of the time an individual feature will either be waiting for its turn to get worked on or it will be waiting for the other features in the batch to get finished
multitasking: if people can have more than one task assigned, they can still only work on one thing at a time, so their other tasks will be in a wait state
backlogs: these are explicit queues where features wait their turn for implementation
etc.

The simplest way to observe queuing delay is to measure it directly: what percentage of my in-flight items don't have someone actively working on them? If your process is visualized, perhaps with a kanban board, and you use avatars for your people to show what they are working on, then this is no harder than counting how many in-flight items don't have an avatar on them.

[ Side note: if you have also measured your overall delivery throughput X, Little's Law says:

NQ / N = XRQ / XR = RQ/R

In other words, your queuing delay RQ is the same percentage of your overall cycle time R as the number of queued items NQ is to the overall number of in-flight items N. So you can actually measure your queuing delay pretty easily this way.]

The primary mechanism, then, for reducing queuing delay is to reduce the number of in-flight items allowed in the system. One simple mechanism for managing this is to adopt a "one-in-one-out" mechanism that admits new feature requests only when a previous feature has been delivered; this puts a cap on the number of in-flight items N. We can then periodically (perhaps once a week, or once an iteration) reduce N by taking a slot out: in essence, when we finish one feature request, we don't admit a new one, thus reducing the number of overall requests in-flight.

Undoubtedly there will come a time when a high priority request shows up, and there would be too much opportunity cost to waiting for something else to finish up so it can be inserted. One possibility here is to flag this request as an emergency, perhaps by attaching a red flag to it on a kanban board to note its priority, and temporarily allow that new request in, with the notion that we will not admit a new feature request once the emergency feature finishes.

Managing Failure Demand

Recall that failure demand consists of requests we have to deal with because we didn't deliver something quite right the first time--think production outages or user bug reports. Failure demand can be quite expensive: according to one estimate, fixing a bug in production can be more than 15 times as expensive than correcting it during development. In other words, having work show up as failure demand is probably the most expensive possible way to get that work done. Cost aside, however, any amount of failure demand that shows up detracts from our ability to service value demand--the new features our customers want.

From a monitoring and metrics perspective, we simply compute the percentage of all in-flight requests that are failure demand. Now the question is how to manage that percentage downward so that we aren't paying as much of a failure demand tax on new development.

To get rid of existing failure demand, we need to address the root causes for these issues. Ideally, we would want to do a root cause analysis and fix for every incident (this is the long-term cheapest way to deal with the problems), but for organizations already experiencing high failure demand, this might temporarily drag new feature development to a halt. An alternative is to have a single "root cause fix" token that circulates through the organization: if it is not being used, then the next incident to arrive gets the token assigned. We do a root cause analysis and fix for that issue, then when we've finished that, the token frees up and we look for the next issue to fix. This approach caps the labor investment in root cause analysis fixing, and will, probabilistically, end up fixing the most common issues first. Over time, this will gradually wear away at the causes of existing failure demand. It's worth noting that you may not have to go to the uber root cause to have a positive effect--just fixing the issue in a way that makes it less likely to occur again will ultimately reduce failure demand.

However, we haven't addressed the upstream sources of failure demand yet; if we chip away at existing failure demand but continue to pile more on via new feature development, we'll ultimately lose ground. The primary cause of new failure demand is trying to hit an aggressive deadline with a fixed scope--something has to give here, and what usually gives is quality. There may well be reasons that this is the right tradeoff to make; perhaps there are marketing campaigns scheduled to start or contractual obligations that must be met (we'll save a discussion for how those dates got planned for another time). At any rate, management needs to understand the tradeoffs that are being made, and needs to be given the readouts to responsibly govern the process. "Percent failure demand" turns out to be a pretty simple and informative metric.

Managing Cycle Time

Draining queuing delay and tackling failure demand are pretty much no-brainers: they are easy to track, and there are easy-to-understand ways to reduce both. However, once we've gotten all the gains we can out of those two prongs of attack, all that is left is trying to further reduce cycle time (and hence raise throughput) via process change. This is much harder--there are no silver bullets here. Although there are any number of folks who will claim to "know" the process changes that are needed here, ranging from Agile consultants, to other managers, to the folks working on the software itself, the reality is that these ideas aren't really guaranteed solutions. They are, however, a really good source of process experiments to run.

Measuring cycle time is important, because thanks to queuing theory and Little's Law, it directly corresponds to throughput in a system with a fixed set of work in-flight. Furthermore, it is very easy to measure average cycle time; the data can be collected by hand and run through a spreadsheet with little real effort. This makes it an ideal metric for evaluating a process change experiment:

if cycle time decreases, keep the process change as an improvement
if cycle time increases, revert back to the old process and try something different
if cycle time is not affected, you might as well keep the change but still look for improvement somewhere else

Keeping "no-effect" process changes in place sets the stage for a culture of continual process improvement; it encourages experimentation if nothing else (and the cycle time measurements have indicated it hasn't hurt). Now, regardless of the experiment, it's important to set a timebox around the experiment so that we can evaluate it: "let's try it this way for a month and see what happens". New processes take time to sink in, so it's important not to run experiments that are too short--we want to give the new process a chance to shake out and see what it can really do. It's also worth noting here that managers should expect some of the experiments to "fail" with increased cycle time or to have no appreciable effect. This is unfortunately the nature of the scientific method--if we could be prescient we'd just jump straight to the optimized process--but this is a tried and true method for learning.

Now, process change requires effort to roll out, so a good question to ask here is how to find the time/people to carry this out. There's a related performance tuning concept here known as the Theory of Constraints, which I'll just paraphrase as "there's always a bottleneck somewhere." If we keep reducing work in-flight, and we have the end-to-end process visualized somewhere, we should be able to see where the bottleneck in the process is. The Theory of Constraints also says that you don't need to take on any more work than the bottleneck can process, which means, depending on your process and organizational structure, that we may find that we can apply folks both "upstream" and "downstream" of the bottleneck to a process change experiment without actually decreasing overall throughput. Furthermore, by identifying the bottleneck, we have a good starting point for selecting an experiment to run: let's try something that will alleviate the bottleneck (or, as the Theory of Constraints says, just move it elsewhere).

Conclusion

In this article, we've seen that managers really only need a few easy-to-collect metrics on an end-to-end software delivery flow to enable them to optimize throughput:

total number of items in-flight
number of "idle" in-flight items (not actively being worked)
number of in-flight items that are failure demand
end-to-end average cycle time

We've also identified several mechanisms, ranging from reducing work-in-progress to root cause fixes of failure demand, that can enable managers to perform optimizations on their process at a pace that suits the business. This is the classic empirical process control ("inspect and adapt") model that has been demonstrated to work effectively time and again in many settings, from the shop floor of Toyota factories to the team rooms of agile development organizations.

The Art of Writing Software

Managing Queuing Delay

Managing Failure Demand

Managing Cycle Time

Conclusion