Saturday, November 27, 2010
Managing Software Development Flow

This is the third post in a series about software development flow, which I'm describing as the conversion of customer requests (both for new features as well as bug reports) into working software. In the first post on this topic, we talked about how a software development organization can be viewed as a request processing engine, and how queuing theory (and Little's Law in particular) can be applied to optimize overall throughput and time-to-market (also called "cycle time"). In the second article, we revisited these same concepts with a more intuitive explanation and started to identify the management tradeoffs that come into play. This article will focus mostly on this final area: what metrics are important for management to understand, and what are some mechanisms/levers they can apply to try to optimize throughput?

I am being intentionally vague about particular processes here, since these principles can be applied regardless of particular process. I will also not talk about particular functional specialities like UX, Development, QA, or Ops; whether you have a staged waterfall process or a fully cross-functional agile team, the underlying theory still applies. Now, the adjustments we talk about here will have the greatest effect when applied to the greatest organizational scope (for example, including everything from customer request intake all the way to delivering working software), but Little's Law says they can also be applied to individual subsystems (for example, perhaps just Dev/QA/Ops taken together, or even just Ops) as well. Of course, the more you focus on individual parts of the system, the more likely you are to locally optimize, perhaps to the detriment of the system as a whole.

Managing Queuing Delay

As we've seen previously, at least some of the time a customer request is moving through our organization, it isn't being actively worked on; this time is known as queuing delay. There are different potential causes of queuing delay, including:

  • batching: if we group individual, independent features together into batches (like sprints or releases), then some of the time an individual feature will either be waiting for its turn to get worked on or it will be waiting for the other features in the batch to get finished
  • multitasking: if people can have more than one task assigned, they can still only work on one thing at a time, so their other tasks will be in a wait state
  • backlogs: these are explicit queues where features wait their turn for implementation
  • etc.

The simplest way to observe queuing delay is to measure it directly: what percentage of my in-flight items don't have someone actively working on them? If your process is visualized, perhaps with a kanban board, and you use avatars for your people to show what they are working on, then this is no harder than counting how many in-flight items don't have an avatar on them.

[ Side note: if you have also measured your overall delivery throughput X, Little's Law says:

NQ / N = XRQ / XR = RQ/R

In other words, your queuing delay RQ is the same percentage of your overall cycle time R as the number of queued items NQ is to the overall number of in-flight items N. So you can actually measure your queuing delay pretty easily this way.]

The primary mechanism, then, for reducing queuing delay is to reduce the number of in-flight items allowed in the system. One simple mechanism for managing this is to adopt a "one-in-one-out" mechanism that admits new feature requests only when a previous feature has been delivered; this puts a cap on the number of in-flight items N. We can then periodically (perhaps once a week, or once an iteration) reduce N by taking a slot out: in essence, when we finish one feature request, we don't admit a new one, thus reducing the number of overall requests in-flight.

Undoubtedly there will come a time when a high priority request shows up, and there would be too much opportunity cost to waiting for something else to finish up so it can be inserted. One possibility here is to flag this request as an emergency, perhaps by attaching a red flag to it on a kanban board to note its priority, and temporarily allow that new request in, with the notion that we will not admit a new feature request once the emergency feature finishes.

Managing Failure Demand

Recall that failure demand consists of requests we have to deal with because we didn't deliver something quite right the first time--think production outages or user bug reports. Failure demand can be quite expensive: according to one estimate, fixing a bug in production can be more than 15 times as expensive than correcting it during development. In other words, having work show up as failure demand is probably the most expensive possible way to get that work done. Cost aside, however, any amount of failure demand that shows up detracts from our ability to service value demand--the new features our customers want.

From a monitoring and metrics perspective, we simply compute the percentage of all in-flight requests that are failure demand. Now the question is how to manage that percentage downward so that we aren't paying as much of a failure demand tax on new development.

To get rid of existing failure demand, we need to address the root causes for these issues. Ideally, we would want to do a root cause analysis and fix for every incident (this is the long-term cheapest way to deal with the problems), but for organizations already experiencing high failure demand, this might temporarily drag new feature development to a halt. An alternative is to have a single "root cause fix" token that circulates through the organization: if it is not being used, then the next incident to arrive gets the token assigned. We do a root cause analysis and fix for that issue, then when we've finished that, the token frees up and we look for the next issue to fix. This approach caps the labor investment in root cause analysis fixing, and will, probabilistically, end up fixing the most common issues first. Over time, this will gradually wear away at the causes of existing failure demand. It's worth noting that you may not have to go to the uber root cause to have a positive effect--just fixing the issue in a way that makes it less likely to occur again will ultimately reduce failure demand.

However, we haven't addressed the upstream sources of failure demand yet; if we chip away at existing failure demand but continue to pile more on via new feature development, we'll ultimately lose ground. The primary cause of new failure demand is trying to hit an aggressive deadline with a fixed scope--something has to give here, and what usually gives is quality. There may well be reasons that this is the right tradeoff to make; perhaps there are marketing campaigns scheduled to start or contractual obligations that must be met (we'll save a discussion for how those dates got planned for another time). At any rate, management needs to understand the tradeoffs that are being made, and needs to be given the readouts to responsibly govern the process. "Percent failure demand" turns out to be a pretty simple and informative metric.

Managing Cycle Time

Draining queuing delay and tackling failure demand are pretty much no-brainers: they are easy to track, and there are easy-to-understand ways to reduce both. However, once we've gotten all the gains we can out of those two prongs of attack, all that is left is trying to further reduce cycle time (and hence raise throughput) via process change. This is much harder--there are no silver bullets here. Although there are any number of folks who will claim to "know" the process changes that are needed here, ranging from Agile consultants, to other managers, to the folks working on the software itself, the reality is that these ideas aren't really guaranteed solutions. They are, however, a really good source of process experiments to run.

Measuring cycle time is important, because thanks to queuing theory and Little's Law, it directly corresponds to throughput in a system with a fixed set of work in-flight. Furthermore, it is very easy to measure average cycle time; the data can be collected by hand and run through a spreadsheet with little real effort. This makes it an ideal metric for evaluating a process change experiment:

  • if cycle time decreases, keep the process change as an improvement
  • if cycle time increases, revert back to the old process and try something different
  • if cycle time is not affected, you might as well keep the change but still look for improvement somewhere else

Keeping "no-effect" process changes in place sets the stage for a culture of continual process improvement; it encourages experimentation if nothing else (and the cycle time measurements have indicated it hasn't hurt). Now, regardless of the experiment, it's important to set a timebox around the experiment so that we can evaluate it: "let's try it this way for a month and see what happens". New processes take time to sink in, so it's important not to run experiments that are too short--we want to give the new process a chance to shake out and see what it can really do. It's also worth noting here that managers should expect some of the experiments to "fail" with increased cycle time or to have no appreciable effect. This is unfortunately the nature of the scientific method--if we could be prescient we'd just jump straight to the optimized process--but this is a tried and true method for learning.

Now, process change requires effort to roll out, so a good question to ask here is how to find the time/people to carry this out. There's a related performance tuning concept here known as the Theory of Constraints, which I'll just paraphrase as "there's always a bottleneck somewhere." If we keep reducing work in-flight, and we have the end-to-end process visualized somewhere, we should be able to see where the bottleneck in the process is. The Theory of Constraints also says that you don't need to take on any more work than the bottleneck can process, which means, depending on your process and organizational structure, that we may find that we can apply folks both "upstream" and "downstream" of the bottleneck to a process change experiment without actually decreasing overall throughput. Furthermore, by identifying the bottleneck, we have a good starting point for selecting an experiment to run: let's try something that will alleviate the bottleneck (or, as the Theory of Constraints says, just move it elsewhere).

Conclusion

In this article, we've seen that managers really only need a few easy-to-collect metrics on an end-to-end software delivery flow to enable them to optimize throughput:

  • total number of items in-flight
  • number of "idle" in-flight items (not actively being worked)
  • number of in-flight items that are failure demand
  • end-to-end average cycle time

We've also identified several mechanisms, ranging from reducing work-in-progress to root cause fixes of failure demand, that can enable managers to perform optimizations on their process at a pace that suits the business. This is the classic empirical process control ("inspect and adapt") model that has been demonstrated to work effectively time and again in many settings, from the shop floor of Toyota factories to the team rooms of agile development organizations.

Thursday, November 25, 2010
Intuitions about Software Development Flow

In a previous post, I described the underlying theory behind optimizing the throughput of a software development organization, which consists of a three-pronged attack:

  1. remove queuing delay by limiting the number of features in-flight
  2. remove failure demand by building in quality up front and fixing root causes of problems
  3. reduce average cycle time by experimenting with process improvements

In this article, I'd like to provide an alternative visualization to help motivate these changes. Let's start with some idealized flow, where we have sufficient throughput to deal with all of our incoming customer requests. Or, if we prefer, the rate at which our business stakeholders inject requests for new features is matched to the rate at which we can deliver them.

Queuing Delay

Now let's add some queuing delay, in the form of some extra water sitting in the sink:

If we leave the faucet of customer requests running at the same rate that the development organization can "drain" them out into working software, we can understand that the level of water in the sink will stay constant. Compared to our original diagram, features are still getting shipped at the same rate they were before; the only difference is that now for any particular feature, it takes longer to get out the other side, because it has to spend some time sitting around in the pool of queuing delay.

Getting rid of queuing delay is as simple as turning the faucet down slightly so that the pool can start draining; once we've drained all the queuing delay out, we can turn the faucet back up again, with no net change other than improved time-to-market (cycle time). There's a management investment tradeoff here; the more we turn the faucet down, the faster the pool drains and the sooner we can turn the faucet back up to full speed at a faster cycle time. On the other hand, that requires (temporarily) slowing down feature development to let currently in-flight items "drain" a bit. Fortunately, this is something that can be done completely flexibly as business situations dictate--simply turn the knob on the faucet as desired, and adjust it as many times as needed.

Failure Demand

We can model failure demand as a tube that siphons some of the organization's throughput off and runs it back into the sink in the form of bug reports and production incidents:

Our sink intuition tells us that we'll have to turn the faucet down--even if only slightly--if we don't want queuing delay to start backing up in the system (otherwise we're adding new requests plus the bug fixing to the sink at a rate faster than the drain will accommodate). Now, every time we ship new features that have bugs or aren't robust to failure conditions (particularly common when rushing to hit a deadline), it's like making the failure demand siphon wider; ultimately we're stealing from our future throughput. When we fix the root cause of an issue, it's like making the failure demand siphon narrower, and we not only get happier customers, but we reclaim some of our overall throughput.

Again, there are management tradeoffs to be made here: fixing the root cause of an issue may take longer than just triaging it, but it is ultimately an investment in higher throughput. Similarly, rushing not-quite-solid software out the door is ultimately borrowing against future throughput. However, it's not hard to see that if we never invest in paying down the failure demand, eventually it will consume all of our throughput and severely reduce our ability to ship new features. This is why it is important for management to have a clear view of failure demand in comparison to overall throughput so that these tradeoffs can be managed responsibly.

Process Change

The final thing we can do is to improve our process, which is roughly like taking all the metal of the drain pipe (corresponding loosely to the people in our organization) and reconfiguring it into a shorter, fatter pipe:

This shows the intuition that if we focus on cycle time (length of the pipe) for our process change experiments, it will essentially free up people (metal) to work on more things (pipe width) at a time, thus improving throughput. There is likewise a management tradeoff to make here: process change takes time and investment, and we'll need to back off feature development for a while to enable that. On the other hand, there's simply no way to improve throughput without changing your process somehow; underinvestment here compared to our competitors means eventually we'll get left in the dust, just as surely as failing to invest cash financially will eventually lead to an erosion of purchasing power due to inflation.

Summary

Hopefully, we've given some intuitive descriptions of the ways to improve time-to-market and throughput for a software development organization to complement the theory presented in the first post on this topic. We've also touched on some of the management tradeoffs these changes entail and some of the information management will need to guide things responsibly.


Credits: Sink diagrams are available under a Creative Commons Attribution-ShareAlike 2.0 Generic license and were created using photos by tudor and doortoriver.


Friday, November 19, 2010
How to Go Faster

Ok, I'm going to tell you how to make your software development organization go faster. I'm going to tell you how to get more done without adding people while improving your time to market and increasing your quality. And I'm going to back it all up with queuing theory. [ By actually explaining the relevant concepts of queuing theory, not just by ending sentences with "...which is obvious from queuing theory", which is usually a good bluff in a technical argument being had over beers. Generally a slam dunk in mixed technical/non-technical company. But I digress. ]

An Important Perspective

It's worth saying that this article assumes you've figured out how to deliver software incrementally somehow, even if that's just by doing Scrumfall. The point is that you are familiar with breaking your overall feature set down into discretely deliverable minimum marketable features (MMFs), user stories, epics, tasks, and the like. If you have any customers, you are probably also familiar with production incidents and bugs, which are also discrete chunks of work to do. Now, here's the important perspective:

Your software development organization is a request processing system.

In this case, the requests come from customers or their proxies (product managers, etc.), and the organization processes the request by delivering the requested change as working software. This could end with a deployment to a live website, publishing an update to an app store, or just plain cutting a release and posting it somewhere for your customers to download and use. At any rate, the requests come into your organization, the software gets delivered, and then the request is essentially forgotten (closed out). Now, looking at your organization this way is important, because it means you can understand your capacity for delivery in terms borrowed from tuning other request processing systems (like websites, for example) for performance and scale. Most importantly, though, is that this mysterious branch of mathematics called queuing theory applies to your organization (just as it applies to any request processing system).

A Little Light Queuing Theory

One of the basic principles in queuing theory is Little's Law, which says:

N=XR

where N is the average number of requests currently being processed by the system, X is the transaction rate (requests processed per unit time), and R is the average response time (how long it takes to process one request). In a software development setting, R is sometimes called cycle time.

To put this in more familiar terms, suppose we have a walk-in bank with a number of tellers on staff. If customers arrive at an average rate of one person per minute (X) and it takes a teller an average of 2 minutes to serve a customer (R) then Little's Law says, on average, that we'll have XR = 1(2) = 2 tellers busy (N) on average at any given point in time. We can similarly flip this around: if we have 3 tellers on staff, what's the maximum average customer arrival rate we can handle?

X = N/R = 3/2 = 1.5 customers per minute

Ok, the last thing we need to talk about is: what happens if we suddenly get a rush of customers coming in? Anyone who has entered a Starbucks or visited Disneyland knows the answer to this: a line forms. (The time a customer spends waiting in line is known as "queuing delay" if you want to get theoretical about it.) Let's go back to our bank. Suppose we just have 5 people suddenly walk in all at once, in addition to our regular arrival of one person per minute. What happens? Well, we get a line that is 5 people long. But if we only have 2 tellers on staff, then people come off the line at exactly the same rate that new people are entering from the back, which means: the line never goes away and always stays 5 people long.

What does this look like from the customers' point of view? Well, we know they'll spend 2 minutes with the teller once they get up to the front of the line, and we know that it will take 5 minutes to get to the front of the line, so my average response time is:

R = RV + RQ = 2 + 5 = 7

where RV is the "value added time" where the request (customer) is actually getting worked on/for, and RQ is the amount of time spent waiting in line (queuing delay). Now we can see that on average, we'll have:

N = XR = X(RV + RQ) = 1(2 + 5) = 7

people in the bank on average. Two people at the tellers, and five people waiting in line. We all know how frustrating an experience that is from the customer's point of view. Now, let me summarize this section (if you didn't follow all the math, don't worry, the important thing is that you understand these implications):

  1. If you try to put more requests into a system than it can handle, lines start forming somewhere in the system.
  2. If the request rate never falls below the system's max capacity, the lines never go away.
  3. Time spent waiting in a line doesn't really serve much useful purpose from the customer's point of view.

Software development as customer request processing

If your experience is anything like mine, there is an infinite supply of things the business stakeholders would like the software to do, which means the transaction rate X can be as high as we actually have capacity for. This means one of the primary goals of the organization is figuring out how to get X as high as possible so we can ship more stuff. At the same time, we're also concerned with getting R as low as possible, since this represents our time-to-market and can be a major competitive advantage. If we can ship a feature in a week but it takes our competitors a month to get features through their system, who's more reactive? Every time the competition throws up a compelling feature, we can match them in a week. Every time we ship a compelling feature, it takes them a month to catch up. Who's going to win that battle?

Now, one of the tricky things here is that software development is often far more complicated than our example bank with tellers, since we tend to staff folks with different skillsets. If I have a team of one graphic designer, three developers, a tester, and a sysadmin, it's really hard to predict how long it will take that team to ship a feature, because they will have to collaborate. If I want to hire someone to help them, is it better to hire another tester or another designer? Probably I can't tell a priori, because it depends on the nature of the features being worked on, and it's really hard to measure things like "this user story was 10% design, 25% development, 50% testing, and 15% operations." Nonetheless, we can look at this from another point of view, which is that I have a fixed number of people in the organization, and each person can only be working on one thing at a time (just as a teller can only actively serve one person at a time), and they are probably (hopefully) collaborating on them.

This means the maximum number of things you can realistically be actively working on is less than the number of people in the organization.

If we have more things in flight than that, we know at least some of the time those things are going to be sitting around waiting for someone to work on them (queuing delay). Perhaps they are sitting on a product backlog. Perhaps they are simply marked "Not Started" on a sprint taskboard. Perhaps they are marked "Done" on a sprint taskboard but they have to wait for a release to be rolled at the end of the sprint to move onwards towards production or QA. As we saw above, this queuing delay doesn't increase throughput, it just hurts our time-to-market. Why would we want that?

First optimization: get rid of queuing delay

Ok, as we saw above, we know that the total response time R consists of two parts; actual value-adding work (RV) and queuing delay (RQ). Typically, it's really hard and time consuming to try to measure these two pieces separately without having lots of annoying people running around with stopwatches and taking furious notes. Fortunately, we don't have to resort to that. It is really easy to measure R overall for a feature/story: mark down when the request came in (e.g. got added to a backlog) and then mark down when it shipped. Simple.

Now, let's think back to our bank example where we had a line of people. Most software development organizations have too much in flight, and they have lines all over the place inside, many of which aren't even readily apparent because that's just "the way we do things around here." Lines are bad. Now, we know the only way to drain these queues is if the incoming feature request rate is less than the rate at which we ship them. Sometimes we can try hiring more "tellers", but in a recession that's not always an option. Instead, for many organizations, the best option is admission control, which is to say that we don't take on a new request until we've shipped one out the other side. You can think of this as having a certain number of feature delivery "slots" available, and you can't start something new until you've freed up a slot. This at least prevents you from having your lines get any bigger.

In order to drain the lines out of the system, the easiest thing to do is to periodically retire a slot after it ships. In other words, don't let something new in just that once. This will reduce the overall number of things in flight, and since presumably everyone is still working hard, what we've just gotten rid of must be queuing delay. Magic! So we can just keep doing this and draining queuing delay out of the system, improving our time to market all the time, without necessarily having to change anything else about the way we do things. When do we stop? We stop once we have people standing around not doing anything. At that point, all the queuing delay is out of the system (for now), and we know that we're at a level where all of our "tellers" are busy. To summarize:

  1. We can remove queuing delay from our delivery process simply by limiting and reducing the amount of work in-flight; this improves time-to-market without having to change anything else.
  2. We can keep doing this until people run out of things to work on; at that point we've squeezed all the queuing delay out.

Second optimization: reduce failure demand

The next thing to realize is that the N things we have in flight actually come in two flavors: value demand and failure demand. In our case, value demand consists of requests that create value for the customer: i.e. new and enhanced features. Failure demand, on the other hand, consists of requests that come from not doing something right previously. These are primarily things like website outages (production incidents), bug reports from users, or even support calls from users asking if you've fixed the problem they previously reported. If you have someone collecting these, then these are requests that your organization as a whole has to deal with. On the other hand, for each request of failure demand, someone is busy triaging/fixing it when then could be creating new value. In other words:

N = NV + NF

where NV is value demand and NF is failure demand. Or, if we look at things this way:

X = N/R = (NV + NF)/R = NV/R + NF/R

we can see that the failure demand is stealing a portion (NF/R) of our organization's throughput! This is, incidentally, why spending extra energy on quality up front results in lower overall costs (as Toyota showed); failure demand essentially requires rework.

This means that another way to improve overall throughput of the organization is to reduce failure demand, reclaiming that portion of your throughput that's getting siphoned off. One way to do this involves figuring out how to "build quality in" on new development, but since software development is a creative process (different every time for every feature), it's not possible to actually completely prevent bugs. That said, there are many techniques like test-driven development and user experience testing that can help improve quality. The other way to reduce failure demand involves vigorously fixing root causes of failure as we experience them. In other words, when we fix a problem for a customer, we should fix it in a way that prevents that type of problem from ever occurring again, for any customer. This keeps overall failure demand down by preventing certain classes of it, thereby reserving that precious organizational throughput for delivering new value. To summarize this section:

  1. Improve value delivery capacity by reducing failure demand (production incidents and bug reports).
  2. The cheapest way to reduce failure demand is by building in quality up-front.
  3. When serving a failure demand request, we can reduce overall failure demand by also fixing the root cause of the problem.

Final optimization: cycle time reduction

Ok, now we've gotten to the point where RQ = 0 (or near zero), so R = RV. Now at this point, let's look back at Little's Law:

N = XR

We've already established via draining out our queuing delay in the first phase what our target N is (number of requests in-flight). But we still want to ship more with the same number of people; we want X to go up. But recall that:

X = N/R

If our N is fixed due to the number of people we have on staff, then the only way to increase throughput is to reduce R. Now is where we start to look at process changes and automation. How do we make it so that it takes people less time to handle a request? Focusing on this improves not only time to market but also overall throughput. And furthermore, if we are measuring R over time, we have an easy way to do this: change the process in a way you think will help, and then measure if R went down or not. If it didn't help, try something else. If it made things worse, go back to the old way. Rinse, repeat. The things to try are going to be different for every organization, and one of the best sources of ideas will be the folks actually doing the work. But this doesn't require any kind of high-tech tracking software -- post-it notes on walls with the start and end dates written on them are more than sufficient to measure R and carry these experiments out.

  1. As failure demand and queuing delay are squeezed out of the system, the only way to improve throughput is by reducing response time.
  2. Response time can only be reduced by process changes.
  3. By measuring response time, we have a convenient experimental lab to understand if process changes help or not.

Say, haven't I heard this all before?

Well, yes. You may have heard pieces of this from all sorts of places. The feature "slots" we were talking about before as a means to limit "work-in-progress" (WIP), and are often called kanban. The notion of continually adapting your process to improve it is a tenet of Scrum. Test-driven development and pair programming are methods from Extreme Programming (XP) of building in quality up front. Failure demand is sometimes called out as a form of technical debt, and the list goes on and on.

Hopefully what I've done here, though, without putting a name on any kind of methodology, is explain why all these things are good ideas (or are good ideas to try). Ultimately, practices won't help unless they do one of three things:

  1. drive out queuing delay (RQ);
  2. reduce value-adding response time (RV); OR
  3. reduce failure demand (NF/R)

In general, the easiest way to do these for an organization is:

  1. reduce the number of things in-flight
  2. aggressively beat back failure demand by fixing root causes and building in quality up-front
  3. measure response (cycle) time and improve via process experimentation

Fortunately, all of those things are very, very easy to measure. If you can mark a request as either value or failure demand, if you can count the number of things in-flight, and if you can measure the time between starting something and shipping it, that's all you need.

Update: See the next post on this topic for a more intuitive motivation of the theory presented in this article.