Friday, September 19, 2008
Simple Backups for your Mac

You are probably well aware of the need for offsite backups; as a technology professional this is one of the first arrangements I look into for any permanent storage of business information. When I started two years ago for an internal "startup" for a large company, one of the first things we did was set up an SVN repository and then work out an arrangement with an offsite data storage provider. However, the cobbler's children have no shoes: I've never set up a proper backup scheme for my own data at home, and it's about time to take care of business.

Fortunately, now that we've moved to using Macs at home and with the advent of cheap UN*X hosting providers, it's about time I stopped putting this off. The scheme here is pretty simple: get a hosted Linux server from someone like 1and1.com, dreamhost.com, or rackspace.com where the storage is backed up and they take care of security updates for the OS. Then set up a pretty simple combination of the UN*X utilities rsync, ssh, cron, and bash scripts to get secure nightly backups going. Just to make it more fun, I'm going to challenge myself to have this all working in under an hour! I'll keep notes as I'm doing it as to how long it takes, not counting the writeup before or afterward.

I decided to register a domain name with a hosting provider, since it was included. My basic requirements were:

  • SSH access
  • rsync installed
  • enough storage for my data

Be sure to acquire the following information from your hosting provider:

  • username/password with SSH access (preferably root, if you want to use the server for other purposes, but this is not necessary)
  • IP address
  • SSH host key of the server

I ended up registering a new domain with at dreamhost.com at $9.95/month. As it happened, DreamHost was running a promotion with unlimited disk space and bandwidth for the lifetime of my account. Score! I did have to email tech support to get the ssh host key. If you find yourself in a similar position, you can ask for the output of:

$ ssh-keygen -l -f ssh_host_rsa_key.pub
2048 0e:c2:f6:f4:d9:86:9d:4b:c4:3d:77:e7:a4:bb:59:14 ssh_host_rsa_key.pub

Ok, great! Now you have a destination for your offsite storage. Next step is to make sure we can securely log in over the network (we'll use ssh for this). On the Mac you want to back up, open up a Terminal window, and ssh into your server using your username and the server's hostname, as in the following. N.B. Do not finish connecting if the ssh server host key you got from your hosting provider does not match the key you see when you try this!

macbook:~ jonm$ ssh jonm@backup.dreamhost.com
The authenticity of host 'backup.dreamhost.com (67.205.39.2)' can't be established.
RSA key fingerprint is 0e:c2:f6:f4:d9:86:9d:4b:c4:3d:77:e7:a4:bb:59:14.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'backup.dreamhost.com,67.205.39.2' (RSA) to the list of known hosts.
jonm@backup.dreamhost.com's password:
[backup]$

Ok, so far so good. Now we need to make sure we can do it without needing a password; this is where user ssh keys come into play. First, let's create an ssh key to use for backups. We'll want to do this as the root user on our Mac, so that when we run the backup script out of cron, we won't run into permissions problems. You can use the "sudo" command to become root on your Mac:

macbook:~ jonm$ sudo su -

WARNING: Improper use of the sudo command could lead to data loss
or the deletion of important system files. Please double-check your
typing when using sudo. Type "man sudo" for more information.

To proceed, enter your password, or type Ctrl-C to abort.

Password: <enter jonm's password on my mac>
macbook:~ root#

Now we need to create an SSH public/private key pair; this is a similar concept to PGP email encryption/signing; you can read a really interesting description of the chronology behind public key cryptography in the book Crypto by Steven Levy. We'll keep the private key locally on our Mac, and take a copy of the public key and copy it securely up to our backup server; then ssh will use the private key when we connect, allowing the backup server to verify using the public key that we are who we say we are, without having to send a password. Nice.

Specifically, we will want to do the following (still as root):

macbook:~ root# mkdir .ssh
macbook:~ root# chmod 700 .ssh
macbook:~ root# ls -ld .ssh
drwx------  2 root  wheel  68 Sep 19 22:01 .ssh
macbook:~ root# ssh-keygen -t dsa
Generating public/private dsa key pair.
Enter file in which to save the key (/var/root/.ssh/id_dsa):
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /var/root/.ssh/id_dsa.
Your public key has been saved in /var/root/.ssh/id_dsa.pub.
The key fingerprint is:
fd:47:1d:a6:ac:d0:7d:fb:a5:17:cf:e2:8a:93:a5:30 root@jon-moores-macbook.local
macbook:~ root#

Use an empty passphrase (i.e. just hit return when prompted for the passphrase), as this will allow the ssh program to load the key without interaction from you. Also note, however, that anyone who gets root access to your Mac will be able to ssh into your backup server at will. Given that our backup server contains a copy of what this would-be hacker would be able to see on the actual Mac anyway, I don't really see this being a big risk....

Now, we need to copy the public key over to the backup server:

macbook:~ root# scp .ssh/id_dsa.pub jonm@backup.dreamhost.com:
jonm@backup.dreamhost.com's password:
id_dsa.pub                             100%  619     0.6KB/s   00:00
macbook:~ root#

You'll have to verify the server SSH key one more time, because now you are connecting from root rather than from your normal user account. Now we'll tell the backup host to accept a login from this key pair:

[backup]$ mkdir -p .ssh
[backup]$ chmod 700 .ssh
[backup]$ cat id_dsa.pub > ~/.ssh/authorized_keys
[backup]$ chmod 600 ~/.ssh/authorized_keys
[backup]$ exit

Now, we should be able to log in without a password from our Mac:

macbook:~ root# ssh jonm@backup.dreamhost.com
[backup]$

Sweet. Now we create a directory where our mirrored filesystems will live:

[backup]$ mkdir mac-backups
[backup]$ chmod 700 mac-backups
[backup]$ exit
macbook:~ root#

The utility we'll use to do the mirroring is the rsync utility, which can be invoked to run securely over ssh. This actually makes a nice backup utility for regular use, as the rsync protocol is actually pretty smart about being able to find just the small subsets of data that changed since the last sync; after the first big sync, for most personal file use, there won't be much work to do every night.

For now, let's set up a test directory on our local Mac.

macbook:~ root# mkdir /tmp/back-me-up
macbook:~ root# echo "data" > /tmp/back-me-up/afile.txt

Now, to make the magic happen, we do this:

macbook:~ root# rsync -avz -e ssh /tmp/back-me-up jonm@backup.dreamhost.com:mac-backups
building file list ... done
back-me-up/
back-me-up/afile.txt

sent 116 bytes  received 40 bytes  62.40 bytes/sec
total size is 5  speedup is 0.03
macbook:~ root#

Now we can keep a window open on our backups host, and we should see everything show up there:

[backup]$ ls -lR mac-backups
mac-backups:
total 4
drwxr-xr-x 2 jonm pg1807352 4096 2008-09-19 19:11 back-me-up/

mac-backups/back-me-up:
total 4
-rw-r--r-- 1 jonm pg1807352 5 2008-09-19 19:11 afile.txt
[backup]$

Just for fun, run the same rsync command above and see that nothing happens if there have been no changes (or rather, just that a very small amount of data gets exchanged to verify no changes).

Let's just make sure changes show up:

macbook:~ root# echo "changed-data" > /tmp/back-me-up/afile.txt
macbook:~ root# rsync -avz -e ssh /tmp/back-me-up jonm@backup.dreamhost.com:mac-backups

(other window)

[backup]$ cat mac-backups/back-me-up/afile.txt
changed-data
[backup]$

Ok, looking good. Next step is to identify all the directories you want to back up; let's keep a list of them in a config file on our mac:

macbook:~ root# mkdir -p /usr/local/etc
macbook:~ root# cat - > /usr/local/etc/backups.conf
/Users/jonm/Documents
/tmp/back-me-up
macbook:~ root#

Note that it is important *not* to have trailing slashes on these directory names, as this changes rsync's behavior slightly in a way that you will probably find annoying (it won't copy the directory name over, just the contents).

Ok, now the next step is to set up a script that can sync each of the directories:

macbook:~ root# mkdir -p /usr/local/bin
macbook:~ root# touch /usr/local/bin/do-backups
macbook:~ root# chmod 700 /usr/local/bin/do-backups
macbook:~ root# cat - > /usr/local/bin/do-backups
#!/bin/sh
for dir in `cat /usr/local/etc/backups.conf`; do
  rsync -avz -e ssh $dir jonm@backups.dreamhost.com:mac-backups
done
macbook:~ root#

Now we run it once by hand to make sure it works:

macbook:~ root# /usr/local/bin/do-backups

Finally, we install this in root's crontab as follows:

macbook:~ root# crontab -l > /tmp/root.cron
macbook:~ root# cat - >> /tmp/root.cron
# take a backup every day at 3am
0 3 * * * /usr/local/bin/do-backups >/dev/null
macbook:~ root# crontab /tmp/root.cron
macbook:~ root# rm /tmp/root.cron

Nice and simple. Now the backups are off and running every night without your intervention.

If you ever need to restore from the backup, you can always reverse the rsync process like this:

macbook:~ root# rsync -avz -e ssh jonm@backup.dreamhost.com:mac-backups/back-me-up /tmp

for each of the directories you have backed up over there.

Enjoy, and sleep well tonight....

P.S. Total elapsed time for the exercise was 2 hours from the time I placed the hosting order to the time the crontab was installed, but I took a one hour break in the middle for dessert and bedtime with the kids. So I'll claim this really did only take one hour of "CPU time" for me.

Tuesday, September 2, 2008
Measure your improvements

Metrics are an important part of any development group's toolset. If we want to continually improve our ability to develop software (through a lean engineering kaizen approach, or simply as a learning organization), then we need to have a way to figure out:

  • what parts of our process need improvement?
  • when we make a change, did it help or hurt?

This is where process metrics come into play. I'll start with my definition of a metric, which is a numerical measurement of something. If you can count it, it can be a metric. So "number of outstanding bugs" is a metric, but "software quality" is not. The term "quantitative metric" is redundant, and "qualitative metric" is an oxymoron.

There are generally two types of metrics we can capture:

  1. causal metrics: these are metrics that have a direct business impact: for example, ROI for a feature, unique monthly visitors, click-through ad rate, etc.
  2. symptomatic metrics: these are metrics that do not directly affect ROI (although we might believe they do) but are downstream indicators for one or more causal metrics. The number of outstanding bugs in a product, the number of bugs caught in a certain phase of development, percentage of code covered by unit tests, etc. are all symptomatic metrics.

My general observation, based on reading articles around the use of metrics for improving your processes, is that a lot of metrics-based improvement projects fail to distinguish between these two types of metrics. Partly, I think this is because while the causal metrics properly align your improvement efforts with your business's interests, they are also harder to define and measure. By contrast, a lot of symptomatic metrics are easy to find and measure, but their relationship to the business may be less clear.

For example, consider the balance between software quality and time to market. You can take a longer time when developing a feature or product to reduce the number of bugs that show up at deployment, or you can ship a feature more quickly, knowing that there may be both known and unknown bugs present. In this case, you can measure both number of known bugs at deployment time, and you can measure overall time-to-production for a feature (time from feature conception to deployment).

Now, if you can decrease time-to-production without increasing the bugginess of your code, that's a win. Similarly, if you can reduce bugginess without lengthening your production time, that's a win. However, both of these things will probably require some effort to implement. Another interesting possibility would be to simply make an adjustment of where you sit on this balance. For example, simply spend more time looking for and fixing bugs in your QA phase, to tradeoff fewer bugs for a slower time-to-market. Or vice versa to get to market more quickly, possibly with more bugs. Both of these adjustments are probably relatively painless to implement, in that no one has to change what they do, just how long they do it for.

So the question is, which one of these things ought we to do? My argument is that bugginess and time-to-production, both being symtomatic metrics, don't give us the answer directly. It all depends on our product environment. For example, when producing software to run medical equipment, a company reputation for quality might be more important than shipping new features quickly; or, in a highly innovative internet space, time to market might be king in terms of how much market share you can capture.

It's management's job to both help implement win-win changes as well as to set the "slider" of the quality/time tradeoff at the right spot. The trick, of course, is that it might be hard to measure this directly; there are several symptomatic things we could measure, including:

  • time spent in development
  • time spent in QA
  • time spent in deployment
  • number of outstanding bugs
  • profit for the product at a given point in time

Now really, the profit over some window of time (e.g. for a website, revenue vs. development spend for a given month) is the thing we want to optimize. The interesting idea here is for management to be able to run a series of experiments: if I increase/decrease QA or development time, how does it affect ROI for my product? How does the relative bugginess of a release affect its profitability? For certain "sliders" in the business, it is relatively simple to take a series of measurements to find a current "sweet spot".

An interesting idea here is that sometimes we work through things backwards. For example, we try to estimate "how long will it take to fully regression test a release", or "how long will it take to code up a feature", rather than "how buggy will the release be if we test it for X amount of time", or "how much of this functionality can you develop in X amount of time." In other words, rather than deriving the time-to-market from a set of estimates for all the steps, instead set the time to market by timeboxing those steps, and see what the outcome is. This is a powerful notion of metrics-based management that is hinted at (in the notion of Scrum timeboxed iterations) but which I have not seen explicitly suggested anywhere[1]. (Please post all the references to things that I've missed in the comments section--I'm sure there are plenty).

At the end of the day, however, it is hard to optimize things we can't measure. I think important metrics to gather are:

  • the levers we have available to manipulate our process (e.g. timeboxing)
  • causal metrics that affect our business (ROI, product profit)
We need to be aware which metrics are causal, and which are merely symptomatic, so that we are measuring things that directly affect the business somehow. This approach permits empirical management--adjust something you can control, see how it affects your causal metrics, rinse, repeat.


[1] Scrum timeboxes an entire iteration, but does not timebox an individual feature, so a team may be able to spend all their time on one feature, or spread their effort across many features. The closest thing I've seen here is the notion of the "Small" in INVEST user stories, where stories are limited to a certain amount of complexity. However, the story points in this case are still estimates of the work involved, rather than timeboxes around how much time to spend implementing a feature; the "small" requirement is really to permit more accurate estimation rather than to timebox the amount of effort (although it does secondarily have this effect, I've not seen this stressed in articles about this).

Sunday, August 3, 2008
Cracking down on technical debt

"Simplicity is the ultimate sophistication." --Leonardo da Vinci.

"Everything should be made as simple as possible, but no simpler." --Albert Einstein

"A designer knows he has achieved perfection not when there is nothing left to add, but when there is nothing left to take away." --Antoine de Saint-Exupery

I've written before about the notion of technical debt. In this post, I want to discuss a few specific sources of technical debt that are easy to accrue, particularly in an agile, iteration-based setting.

Incomplete technology transitions

These can arise when a technical decision gets made to transition from one technology/architecture/design to another, and the transition happens incrementally. What can end up happening is that an agile team, say one operating under the Scrum framework, does not complete its incremental transition during the current sprint. Now, although the code is in a working state, there is a good chunk of technical debt arising from having code operating under two separate systems. This transition debt is problematic for a few reasons:

First, this can complicate debugging efforts -- when there is a problem with the system, someone has to determine under which scheme the code in question was written. Typically this can mean looking in two different source code hierarchies, or looking through two separate sets of configuration. The system is, as a result, more complicated than it needs to be.

Secondly, this can be an attractor for additional debt; if the old system is still around, and a developer is more familiar with the old system than the new, there is a very strong temptation to make changes/additions in the old system. This work simply adds to the outstanding transition work, and despite the developer's familiarity, is likely to be implemented in a more difficult or less efficient way (assuming, of course, there were valid technical reasons for making the transition in the first place).

Finally, this can cause extra work to happen during feature development that touches/interacts with the subsystem in transition, because either the cooperating subsystems have to special case two different interaction styles, or an adaptation layer has to be built to handle both subsystems and abstract their existence away from clients' concerns. Either way, you are writing more code that you would have if you had completed the transition and you just had one implementation of the subsystem.

Teams working on an iteration-based methodology need to do several things to avoid the pitfalls from transition debt:

  1. when a technical decision for a transition has been made, it must be communicated clearly to the whole development team, including the reasons for the transition. This can help prevent the unintentional accrual of additional transition debt.
  2. plan for more refactoring time when signing up for work, to leave time to complete transitions before an iteration ends.
  3. communicate the existence of the transition debt to the Product Owner at the review, so that completing the transition can be scheduled as a backlog item. Furthermore, stress the priority of this carryover work to ensure that the transition debt exists for the shortest amount of time possible.

Obsolete/extraneous configuration

We'll call this type of technical debt configuration debt. There are a couple of sources of this type of debt:

  1. transitional runtime configuration that still exists after the transition. For example, when a data partner was making an id space transition to extend the length of their ids, we had a flag to govern whether to use old or new ids with that partner, so that we could decouple our code releases from the partner's transitions. Over a year later, the flag still exists, but it is always set to use "new" ids, so there is certainly unneeded code to handle this.
  2. exposing properties that would only change with a code drop as runtime configuration. In this case, the values of the properties would really only change if we rolled a new code release, so they could just as easily be compile-time constants that would not require the scaffolding to make them runtime properties, no matter how simple that scaffolding might be.

Unnecessary code hurts you in several ways:

  • it took someone time to write it in the first place
  • you have to compile it or run its unit tests over and over again while you're developing (death of a thousand cuts)!
  • people need to keep it in their mental model of the system instead of leaving room for parts of the system that actually do something useful

The easiest way to prevent the accrual of configuration debt is to review any new runtime configuration parameters at the end of each sprint (which you probably have to do anyway so your operations folk know how to properly configure the new system). Then, where possible:

  • turn as many runtime parameters into compile-time constants as possible
  • ask under what conditions the parameters will no longer be needed (for example, for configuration that assists with an external transition, add an item to the product backlog to clean up the codebase after the transition is successful)

Obsolete/insufficient architecture and design

Architectural debt is probably the most nefarious, because this is debt that doesn't actually get created when the code is first written. Instead, this is usually caused by external factors such as:

  • business environment changes and expected traffic is significantly different than originally anticipated. This can leave you either with an overly complex, over-engineered system, or with a too-simple system that can't easily scale.
  • product direction changes, and the architecture is not flexible along the new axis of change, so that new development is overly difficult.
  • expected performance of a new provisional architecture is invalidated by experience

Basically, as soon as you realize you need to change your architecture, you have magically "created" technical debt out of all the code that depended on the first architecture. In reality, this debt is probably unavoidable, and what you've really done is convert your inability to perfectly predict the future into a set of work that incorporates new knowledge about the problem domain.

This can also be hard to identify by scrutinizing the code, but there are some external symptoms of it:

  • difficulty meeting desired performance or scalability targets, especially when concentrated in a certain feature subsystem
  • adding new instances of a certain class of feature does not get easier over time
  • lots of bugs being generated by a specific subsystem
  • increased time-to-market for new features
  • accleration of bug creation rates
  • accrual of standard operational processes that require manual intervention/support

So when you have some or (gulp) all of those symptoms, you probably have architectural debt lurking in your system. Once you have identified it and have a new target architecture, a lot of this will get converted into transitional debt while you are making the changes.

Technical debt vs. technical investment

I want to be careful here to distinguish between two sorts of non-functional requirements that might show up on a "tech backlog":

  • technical debt: this is current brokenness or unneeded complexity in the system that is actively slowing down the business of turning product backlog into working software for your customer.
  • technical investment: these are things that are not necessarily broken per se, but which could speed things up for someone. A good example of this would be automating a manual process.

Technical investments can probably be put off while you have existing technical debt, although it can sometimes be hard to distinguish between the two. Clear technical debt should probably be prioritized at the top of a product backlog, unless there are really high ROI items that might trump it. In general, getting rid of technical debt will increase the ROI of everything else on the backlog, simply by decreasing the "I" part. It can also make estimation more accurate by reducing the complexity of the system to which new functionality will be added.

Whose responsibility is technical debt?

Generally, as the folks with the technical ability to recognize it, it is the development team's responsibility to try to avoid accruing technical debt while producing product. Failing that, it is their responsibility to recognize/document existing debt and to advocate for its removal. However, note that there are often symptoms of technical debt, such as those I've listed above for architectural debt, that can be recognized by non-technical folks too.

On the flip side, business folks / product owners need to be able to trade off short term wins that accrue technical debt vs. taking longer to produce a product with less debt. Communication with the tech team is of vital importance here; undoubtedly there will be times when a short-term win will be important (especially with a first-to-market situation), but it needs to be accompanied by a plan to eliminate the accrued debt. i.e. Treat your technical debt like credit card debt that should be paid down ASAP, and not as a long-term mortgage.

The interest on your technical debt is probably not tax-deductible.

Friday, April 18, 2008
The Crucible

We've recently re-adjusted our sprint length to be 3 weeks, down from 4. The reasoning behind this was to allow us to align with sprints on other products, to give us a uniformity of scheduling and to permit inter-product developer swaps from sprint to sprint, if needed.

One of the side effects is that our planning process now takes up proportionately more time of the sprint and so the actual work time of the sprint is quite compressed. Couple this with a handful of developers being out for personal reasons (wedding, health issues, impending birth of a child) and suddenly this sprint had a lot of high priority sprint backlog and not a lot of available story point capacity on the teams.

But then something magical happened--what I'll call the "crucible moment." The status of the sprint was that there were a bunch of high priority, smaller maintenance tasks (cleanup of existing features, releasing the previous sprints' work to production, etc.) and then one big new feature. As the teams were working their way down the list of user stories and filling up their story points, everyone quickly realized that that big new feature might not fit into the sprint.

With the purifying fire of timeboxing (forgive my overly dramatic metaphors here), the sprint teams and product teams immediately began self-organizing and negotiating. No one wanted to have a sprint without a killer feature, and so the horse trading began. Some of the higher priority stories were off the "tech backlog" -- non-functional investments in build infrastructure, etc. Developers began identifying tech backlog items that could stand to wait. Product owners began reconsidering some of the higher priority smaller stuff, conceding that some of them might not be so important after all. Different development teams tried to juggle stories between themselves so that the large story could get onto one team's sprint backlog (we prefer not to split stories across teams, to minimize cross-team dependencies). Some stories were scoped down to take less time.

And then, when it was all done, we discovered there was enough room for two pretty big features.

This was scrum at its best--focusing on the art of the possible to squeeze as much as possible into a fixed amount of time. No time for extraneous explorations, just a focus on extracting the maximum product value out of the available resources, with a brutal willingness to critically re-examine priorities. What's more, it was carried out in a context of teamwork between the product owners and the teams; everyone was trying to figure out how to get those features in.

So, a great day for our product and for our organization. We might actually be getting the hang of this scrum thing....

Wednesday, March 26, 2008
Sprint Review

Just wanted to follow up on my previous post about how we were going to get back to scrum basics this sprint. We had our sprint review and sprint retrospectives today, and the sprint was viewed as a great success.

Here are some things that I think contributed to our success:

Better estimation: We used past task estimation history plus monte-carlo simulation, as I described earlier, to sign up for a set of achievable tasks. The end result was that we showed up at the review with 90-95% of the committed work done. The parts that didn't get done were due to external blockers we couldn't do anything about (e.g. external partner didn't have a data feed ready) or were due to too many round-trips through the design-IA-product-eng collaboration cycle (essentially, deviating from the plan we set forth at the beginning of the sprint) that took time away from other things.

Commitment: Towards the end of the set of product backlog our team was planning, we hit a fairly large task that was hard to estimate out. However, we had two team members who were very lightly booked, and we simply asked them if they were willing to commit to finishing the task somehow by the end of the sprint. They signed up, and they got it done. There was really no talk of punting things off the sprint, although there were a couple of rescopings and/or a change of plan due to time constraints that happened to get things done by the review. In general, I think these changes actually just meant the abstract functionality in the user story got built with fewer resources -- i.e. we completed them more efficiently thanks to the pressure of timeboxing.

Empowerment: Thanks to being committed to their tasks, my teammates basically kept themselves unblocked most of the sprint by engaging with other departments (design, product, IA, ops, QA) early on to make sure they could get what they needed. Most of the time things stayed unblocked by getting the right folks together in one room. One of my teammates mentioned that one of the things I was best able to do was to simply identify which people needed to be in the room! Then he went off and arranged the meetings and got things done himself.

Shippable product: We planned explicit tasks for each feature to test/verify it, to show QA how it worked, and to demo it and get explicit signoff from the product owner. Our product owner gave me the feedback that this resulted in most of the user stories we called "done" being potentially shippable.

Self-organization: The team dynamically swapped tasks with one another to load balance effectively during the sprint. There was such appreciation for this help that people started buying each other donuts as thank-yous, and I think we ended up with around 4 dozen donuts being exchanged by the end of the sprint. Bad for the waistline, but good for team morale!

On another note, this was a re-entry for me into the dual role of scrum master and team member (in that I signed up for development sprint tasks as well as being scrum master). For other developers who find themselves being called on to carry the scrum master mantle, here are some tips to help you survive:

  • Only book yourself half-time. The rest of the time will be eaten up by scrum mastering, and if you don't scale back on your coding commitments, you're just going to end up staying up late working all sprint.
  • Tell people how to get unblocked rather than trying to unblock them yourself. This seems pretty simple, but is a big time saver. For one thing, it's probably quicker to describe what to do (e.g. just set up a meeting with X, Y, and Z) than it is for you to send out the meeting invite yourself. Plus, you remove yourself as a bottleneck -- so your teammate doesn't have to wait for you to send the invite. Furthermore, after you do this a few times, your teammates will pick it up and will be able to just solve more of their problems themselves without asking you for help.
  • Get good reports. I invested 5-6 hours towards the beginning of the sprint to get some automated reports set up to get info out of our ticketing system. I really ended up just using two reports:
    1. Storyboard: for each story, identify who has tickets against it, and identify whether each ticket is "Not Started", "In Progress", or "Finished". This provides a quick way to scan the status of each story and reminds you to ask questions like: what do we need to do to close out that story again? why are we working on lower priority things instead of the higher priority ones? etc.
    2. Burndown count: for each story, add up time remaining on each ticket, and then provide a total for the amount of hours left for the sprint.
    So what I did was: right before scrum, I pulled up the Storyboard report to check status so I could ask followup questions during scrum. Right after scrum, I hit the burndown count and then actually plotted by hand on a big piece of flipchart paper the burndown graph. That was pretty much the extent of the reporting I needed.

So, all in all, things turned out pretty well. I'm excited to see the teams back in a good sprint cycle groove and turning things out (our review meeting lasted almost 5 hours, mainly because so much had gotten done across the multiple teams working on the product that it took a while to go over it all). We'll be mostly keeping things the same going into the next sprint, so I'll be curious to see if the teams are able to accelerate now that they're used to working this way.

Sunday, March 16, 2008
REST: Unix programming for the Web

I've been giving some thought to REST-style architectures recently, and recently re-read some of The Art of Unix Programming by Eric S. Raymond. ESR notes that some of the characteristics of Unix-style programming include:

  • Do one thing, and do it well (attributed to Doug McIlroy)
  • Everything is a file.
  • Comprise complex systems by connecting smaller, simpler programs (e.g. Unix pipes).

Unix-style systems have had undeniable success and remarkable stickiness for a technology; I have an old copy of my father's System V manual from when he worked at Bell Labs (yeah, they actually printed out the man pages and bound them!), and I pretty much recognize everything in there. Sure, there are many new commands available, new kernels, new distributions, etc., but they are all very recognizably Unixy.

I've been thinking about how this philosophy applies to the Web 2.0 world. I think this list turns into:

  • Do one thing, and do it well.
  • Everything is a RESTful service.
  • Comprise complex systems by interconnecting smaller, simpler services.

For one thing, "do one thing, and do it well" isn't limited to Unix, this is a key part of abstracting and decomposing a technical problem. But we see it everywhere: I read my mail on Gmail, keep my to-do lists on Remember The Milk, store photos on Flickr, keep my browser bookmarks on del.icio.us, etc. All these sites adhere to this principle.

But taking things down a layer, what does this mean to someone architecting or implementing a Web 2.0 service? For one thing, I think this means that it would make sense to break your overall service down into individual microservices that are individually maintained and deployed; i.e. break things down to the smallest level where they still make coherent sense.

As for "everything is a REST service", the everything-is-a-file abstraction worked for Unix because there was a small set of common operations that applied to files (open, close, dup, read, write). Sounds a lot like doing REST over HTTP, with HEAD, GET, PUT, POST, DELETE.

Combining various small REST microservices into larger services is already being done (this is, after all, basically what iGoogle is) on a one-off basis. The main question is: what is the Web 2.0 equivalent of the pipe? Namely, is there an easily understood abstraction for composing webservices? Sounds like a topic that might be rife for some kind of logical calculus (like the relational calculus for databases or the pi-calculus for concurrent processes), e.g. the REST calculus. If there were a couple of easily understand and specifiable combinators, these could be pretty easily built into some language-specific libraries for use in quickly building some new macro-services.

I might try to interest some of my old colleagues from my programming language research days to see if they have anything to say about the matter....

Comments definitely welcome here. Is this a new idea? Are others espousing this? Is this even a good idea?

Thursday, February 28, 2008
Return on INVESTment

At our last pre-planning meeting, we made a point of putting all the user stories into INVEST format, and I was pleased that there was a general consensus that this worked well.

I think this took quite a while, partially because we were all getting used to evaluating the statement of the stories critically, but I think this was worth it. We had product, IA, and engineering folks suggesting wordings for stories, and at least some of our stories came in in a canonical format:

"As <who>, I want <feature>, so that <value>."

Having the full INVEST filter did help us with a few things:

  • Independent: We did not have a lot of stories that were dependent, but we did end up with a very few that were cross-team. Since this was just a handful of the stories, we figured it would be ok to leave the story as is (since we couldn't quickly come up with a way to rewrite it) and track the dependencies through our scrum-of-scrums.
  • Negotiable: I think we generally nailed this one. We did not look at any IA wireframes or Design mockups during pre-planning, and were able to identify the high-level goals here. This has already become useful in planning, where my team has suggested a new / faster approach to at least one story which was not as originally conceived, but our Product Owner agreed that we still hit the abstract requirement.
  • Valuable: There were a few stories that were original cast as abstract technical requirements (e.g. make the middleware support foo), but we refactored them in a way that expressed value to the end user (which, incidentally, will make it more obvious how to test).
  • Estimable: We've already agreed not to do a small handful of stories because we realized there were too many unknowns; for one of these, there was a middleware task that depended on IA we hadn't seen and on data we didn't have in the DB yet. We actually spent a lot of time trying to talk about how to tackle this before we realized we couldn't estimate it. Since it was not critical for implementation this sprint, we agreed with the product owner to turn this into a story where we would do a feasibility study and rough design by the end of the sprint instead, which was something we could commit to. (The value here is to the product owner, who will then be able to write an Estimable story about the feature!)
  • Small: We did a good job here of refactoring large stories into multiple pieces and then figuring out how to get Value out of each piece. We only had a handful of medium to large stories.
  • Testable: We're asking each team during planning to make sure they hash out "how to demo" for the feature with the Product Owner, so we should know if we've gotten there by the end of sprint planning.

With the up-front work on the stories we did, our team has found it easier to negotiate a specific solution and in some cases to actually plan without the product owner, which is actually convenient (he's doing multiple duty so he's having to bounce back and forth between multiple sprint planning sessions!).

Sunday, February 24, 2008
INVESTing in user stories

In my previous post I made reference to the INVEST acronym for evaluating user stories:

  • (I)ndependent
  • (N)egotiable
  • (V)aluable
  • (E)stimable
  • (S)mall
  • (T)estable

I'd like to spend a little bit of time talking about each of these characteristics, and motivating why each is important through some anecdotes about what happens when each characteristic is not attained.

Independent: The idea here is that stories are free of dependencies from one another. (A good test would be to ask if you could implement them in any order, rather than just in the priority order generated by the Product Owner). We want this for multiple reasons: first, on a multi-team project, it allows stories to be load-balanced across teams more easily, since in theory any one story could get moved around, rather than having to move an entire batch of them. Secondly, and on a related note, it means that when sprint teams sign up for work, they can draw the commit line anywhere. Thirdly, it helps avoid cross-team dependencies (although a scrum-of-scrums and specific attention to dependent situations can handle it, it is still much easier to handle all your dependencies intra-team).

For a specific example, consider if you have teams that are layered horizontally by functional layer (e.g. a database team, and a webapp team). If you have two stories which are the "halves" of building some functionality, and assign one to each team, you take on the following risks:

  • teams must coordinate closely around this feature (extra communication/tracking overhead)
  • if one team finishes but the other doesn't, you may have extra "clean up" work at the end of the sprint to make the software still work, and you may have had a team do work that ultimately had no product impact that sprint (e.g. middleware can handle some new data and display it, but data didn't actually apppear in the DB, so no actual behavioral difference)

We'll see that some of the other INVEST characteristics actually help track this down as well.

Negotiable. To me, this means that the requirements given are as abstract as possible, so that the actual details of what is going to get built are determined by the team and Product Owner during sprint planning and modified as needed over the course of the sprint. For example, "user can change timezone with one click" vs. "user timezone is shown in a 100x25 dropdown box in the header". (Of course, if the latter language actually represented a contractual obligation, then that might be as abstract as you can make it, but you get the idea...).

This is primarily important for two reasons: to allow the team to exercise maximal creativity and to allow the team to adjust for the unpredictable events that will happen mid-sprint. In the former case, the dropdown box might be tacked on to the story as a starting point or suggestion, but the team may come up with an approach that is easier to implement or which actually satisfies the Product Owner more. In the latter case, it leaves some room to negotiate if the team has under-estimated and is behind mid-sprint, and still be able to "finish" the user story.

The danger of not being able to negotiate the functionality is really twofold: first, not completing the full set of work in a sprint eventually crushes team morale; not being able to complete an assignment is a big downer, especially when you have worked hard and because, due to the inherent complexity of software development, things just took longer than you thought. Eventually you get to the point where your team will just shrug, and say, oh well, I guess we won't finish that stuff. I have seen this happen first-hand, and it is demotivating.

Secondly, there is danger to the product roadmap. By not having requirements couched abstractly, you run the risk of slipping features or having unfinished, "carryover" work, which adds up and pushes the product roadmap back. By giving the team the freedom to brainstorm a way to satisfy the requirement in less time, you are not letting them off the hook -- you are using the deadline of the end of the sprint as pressure to encourage the team to develop the functionality as efficiently as possible. The resulting functionality may not be as complex or deep as originally hoped for or conceived, but if the spirit of the story is met, then you have some aspect of it ready to go out as shippable product.

Valuable. Each user story should provide value somehow. The original articles I read about this amended this to "provide value to the end user / customer." While I think noting the value explicitly can help product owners check off that the story will help their Key Performance Indicators (KPI), I think the more important thing this brings is that the team is aware of why this story is important. This can help constrain the solution space for Negotiation in an important way. Finally, this is just a cross-check for Indepencence; if this story is completed, and no other, does it generate value, or do we also need another story to be finished in order to get the value?

On a side note, the phrase "to the end user" is an interesting one. We keep a "tech backlog" of infrastructure/refactoring ideas around, and prior to each sprint, we evaluate which ones are critical to current development, and ask for those to be prioritized in with the current product backlog. Generally, we have been very judicious about this, maybe in a nod to providing direct benefit to an end user -- we usually wait until an infrastructure adjustment is needed or desirable for implementing new functionality before taking it on. This suggests that other infrastructure work get carried out during lab days, to scratch those developers' itches. I'm still up in the air over how strict we should be about "to the end user", but I do lean towards requiring that most of the time.

Estimable. If you can't put an estimate on it, either the requirements are too vague (see Testable), or the technical solution is unknown (e.g. "I don't even know if this is feasible!"). In the latter case, a suggested tactic would be to alter the user story into a feasibility study / research effort for this sprint, which could be easily timeboxed; the original story could then be revisited in a later sprint, when there will be less cloudiness around it, and it can be properly estimated.

Where this can bite you is in signing up for work you don't know you can finish. This sets up for an expectation mismatch with your Product Owner, and also prevents you from making efficient use of your time; you run the risk of attempting to estimate it, and either grossly underestimating it, putting all the lower-priority sprint backlog at risk, of grossly overestimating it and not signing up for enough work, or of padding the estimate to try to account for the risk and then spending more time on it than the feature is really worth.

Small. We want things that will take no more than a full sprint to do, and hopefully less. While the full scope of a feature vision may require more than one sprint, you want to refactor it somehow so that you get something out of even the very first sprint (see Valuable); then you can be sure that you reap some result from your effort in shippable product, as opposed to doing some partial work, and then having the remaining work deprioritized for several months before you can extract value.

Finally, a purely pragmatic reason for keeping the stories small is to reduce the risk of gross under-estimation. I've personally been way off on a single big story (as everyone is from time to time), and when it's been a big one, I've simultaneously trashed my personal life for a month of overtime and sleep deprivation, while requiring a bunch of load-balancing and rejuggling across multiple sprint teams due to the fact that the rest of the sprint backlog I signed up for was now at risk. So don't do that -- take smaller bites, just like Mom used to say.

Testable. This primarily serves three purposes: ensuring that Product Owner expectations are in-tune with what the team thinks it is delivering (see also Negotiable), giving the team a way to know when it can stop working on the story, and giving the testers a starting point for writing their test cases. Beware of non-quantifiable adjectives like "good" or "acceptable" in your user story descriptions. For a while, when we were doing sprint planning on spreadsheets, we had a "How to Demo" column--this worked great while we did it, but we never had enough discipline to follow through here and continue doing it. This is one of the things I'm hoping to bring back during my Scrum revival next sprint.

There's nothing worse than showing up to a sprint review and having your Product Owner say, "but that's not at all what I asked for, or that's not what I meant." Big morale crusher for everyone involved (team and stakeholders).

Finally, this gives the Scrum master a hook to save the team from perfectionism or unbounded creativity. For example, if you've gotten the feature to the point where it satisfies the acceptance tests and has been built up to your standards of quality, just stop working on it, and start working on the next user story. This is your old friend, the Pareto principle, at work -- would you rather spend a day mining the long tail of a functionally complete feature, or would you rather spend it getting the up-front meat of a new feature? The other place this helps is when you finish a feature, and you and the Product Owner are looking at it, and you now see something totally awesome that is now possible -- stop, ship the feature you have, and queue the good idea up as a user story for the next sprint so it can be properly prioritized with everything else. Again, this is about efficient use of the time in the sprint.


I'm anticipating having this be a little painful as we work through this together with the product team the first time; we've all signed off on this in principle, but we've never actually attempted to make each story adhere to INVEST. I suspect, like all similar things, it will be a bit robotic for the first few stories, but we'll quickly get the hang of it and be able to move on during pre-planning. I'll let you know how it goes.

Saturday, February 23, 2008
Back to Basics

I've been re-reading "Agile Software Development with Scrum" to see if it has any insight for some of our current product struggles. Fortunately, I think it does, and I've been getting fired up for a great Scrum revival. [Revival tent picture courtesy of PinkMoose on Flickr, used under a Creative Commons Attribution license.] I'm going to be reprising my role as Scrum Master for my team next sprint as well, so hopefully I'll be able to transmit some of these values.

Looking forward to getting back to some Scrum fundamentals, including:

  • sprint goal: let's set an overarching sprint goal we can be working towards
  • INVEST user stories: in particular, the "N" -- negotiable -- means we have maximum flexibility to react with agility during the sprint
  • commitment: the team signs up for a set of work, and commits to completing it by the end of the sprint. No carryover. If burndown shows we are behind, we need to collaborate with the Product Owner to revise functionality / approach so that we have an achievable set of work. Also, we need to have the team take responsibility for completing the work -- no more reporting blockers and then giving up without trying something else.
  • product increment: end of the sprint, we have a potentially shippable product. No half-baked / half-finished pages on the site.

I'm pretty excited for the upcoming sprint. We've got a lot of talented folks, we just need to really get them motivated and self-organized, and then awesome stuff will erupt.

Wednesday, February 13, 2008
REST-based service design, v2

I want to revisit the basic "favorite food" service from last post, in light of some further discussion I've had with colleagues.

http://host/path/{userid}/favorites
  • GET: returns a list of the user's favorite foods, in some representation. Returns 404 NOT FOUND if the user has never specified favorites.
  • PUT: sets the list of the user's favorite foods to be the value sent, creating it if necessary.
  • DELETE: removes the user's favorite list. This is different than doing a PUT with an empty list.
  • POST: returns 405 METHOD NOT ALLOWED.
http://host/path/{userid}/favorites/{food}
  • GET: returns 200 OK if the food is a favorite of the user. Returns 404 NOT FOUND if the food is not a favorite of the user.
  • PUT: makes the food one of the user's favorites
  • DELETE: makes the food *not* a favorite
  • POST: returns 405 METHOD NOT ALLOWED

So, PUT can create a resource if it doesn't exist, and DELETE of a URI means the next GET of it (in the absence of any other operations) will be a 404 NOT FOUND.

One note is that as a client of this API, I want and can make use of the most atomic operation available. For example, to make "spaghetti" a favorite food, I could either:

  1. GET http://host/path/{userid}/favorites
  2. PUT http://host/path/{userid}/favorites (new list with spaghetti added in)

or I could just:

  1. PUT http://host/path/{userid}/favorites/spaghetti

Note that in the first case, I might have an atomicity issue in the presence of concurrent access, so I might need to build in some sort of optimistic locking protocol, where the representation of the favorites list has a version number on it that is checked by the PUT operation. However, if I just use the second method, I don't have this issue, because the server handles all my concurrency/transactional stuff for me.

Tuesday, February 12, 2008
REST-based service design

Just wanted to walk through a quick REST-style API design exercise. Let's suppose I want a service that lets folks maintain a list of favorite foods. Just as a quick strawman design, let's walk through some URIs and supported operations:

http://host:port/path/{userid}/favorites
GET on this URI returns a list of the current user's favorites. List might be empty. PUT on this URI overwrites an existing list of favorites. DELETE on this URI sets the list of favorites to the empty list.
http://host:port/path/{userid}/favorites/{food}
GET on this URI returns something (really, we just want a 200 status code) if the food is a favorite, or returns 404 if it is not. DELETE on this URI un-favorites the food for the user. PUT on this URI makes the food a favorite.

So, questions for the audience:

  1. Does this look reasonable?
  2. Does the use of PUT look right, or would you use POST here?
  3. How about the use of DELETE to set the full list to the empty list? Does it make sense to DELETE a URI and then be able to GET it and have something be there?