Friday, September 19, 2008
Simple Backups for your Mac

You are probably well aware of the need for offsite backups; as a technology professional this is one of the first arrangements I look into for any permanent storage of business information. When I started two years ago for an internal "startup" for a large company, one of the first things we did was set up an SVN repository and then work out an arrangement with an offsite data storage provider. However, the cobbler's children have no shoes: I've never set up a proper backup scheme for my own data at home, and it's about time to take care of business.

Fortunately, now that we've moved to using Macs at home and with the advent of cheap UN*X hosting providers, it's about time I stopped putting this off. The scheme here is pretty simple: get a hosted Linux server from someone like 1and1.com, dreamhost.com, or rackspace.com where the storage is backed up and they take care of security updates for the OS. Then set up a pretty simple combination of the UN*X utilities rsync, ssh, cron, and bash scripts to get secure nightly backups going. Just to make it more fun, I'm going to challenge myself to have this all working in under an hour! I'll keep notes as I'm doing it as to how long it takes, not counting the writeup before or afterward.

I decided to register a domain name with a hosting provider, since it was included. My basic requirements were:

  • SSH access
  • rsync installed
  • enough storage for my data

Be sure to acquire the following information from your hosting provider:

  • username/password with SSH access (preferably root, if you want to use the server for other purposes, but this is not necessary)
  • IP address
  • SSH host key of the server

I ended up registering a new domain with at dreamhost.com at $9.95/month. As it happened, DreamHost was running a promotion with unlimited disk space and bandwidth for the lifetime of my account. Score! I did have to email tech support to get the ssh host key. If you find yourself in a similar position, you can ask for the output of:

$ ssh-keygen -l -f ssh_host_rsa_key.pub
2048 0e:c2:f6:f4:d9:86:9d:4b:c4:3d:77:e7:a4:bb:59:14 ssh_host_rsa_key.pub

Ok, great! Now you have a destination for your offsite storage. Next step is to make sure we can securely log in over the network (we'll use ssh for this). On the Mac you want to back up, open up a Terminal window, and ssh into your server using your username and the server's hostname, as in the following. N.B. Do not finish connecting if the ssh server host key you got from your hosting provider does not match the key you see when you try this!

macbook:~ jonm$ ssh jonm@backup.dreamhost.com
The authenticity of host 'backup.dreamhost.com (67.205.39.2)' can't be established.
RSA key fingerprint is 0e:c2:f6:f4:d9:86:9d:4b:c4:3d:77:e7:a4:bb:59:14.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'backup.dreamhost.com,67.205.39.2' (RSA) to the list of known hosts.
jonm@backup.dreamhost.com's password:
[backup]$

Ok, so far so good. Now we need to make sure we can do it without needing a password; this is where user ssh keys come into play. First, let's create an ssh key to use for backups. We'll want to do this as the root user on our Mac, so that when we run the backup script out of cron, we won't run into permissions problems. You can use the "sudo" command to become root on your Mac:

macbook:~ jonm$ sudo su -

WARNING: Improper use of the sudo command could lead to data loss
or the deletion of important system files. Please double-check your
typing when using sudo. Type "man sudo" for more information.

To proceed, enter your password, or type Ctrl-C to abort.

Password: <enter jonm's password on my mac>
macbook:~ root#

Now we need to create an SSH public/private key pair; this is a similar concept to PGP email encryption/signing; you can read a really interesting description of the chronology behind public key cryptography in the book Crypto by Steven Levy. We'll keep the private key locally on our Mac, and take a copy of the public key and copy it securely up to our backup server; then ssh will use the private key when we connect, allowing the backup server to verify using the public key that we are who we say we are, without having to send a password. Nice.

Specifically, we will want to do the following (still as root):

macbook:~ root# mkdir .ssh
macbook:~ root# chmod 700 .ssh
macbook:~ root# ls -ld .ssh
drwx------  2 root  wheel  68 Sep 19 22:01 .ssh
macbook:~ root# ssh-keygen -t dsa
Generating public/private dsa key pair.
Enter file in which to save the key (/var/root/.ssh/id_dsa):
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /var/root/.ssh/id_dsa.
Your public key has been saved in /var/root/.ssh/id_dsa.pub.
The key fingerprint is:
fd:47:1d:a6:ac:d0:7d:fb:a5:17:cf:e2:8a:93:a5:30 root@jon-moores-macbook.local
macbook:~ root#

Use an empty passphrase (i.e. just hit return when prompted for the passphrase), as this will allow the ssh program to load the key without interaction from you. Also note, however, that anyone who gets root access to your Mac will be able to ssh into your backup server at will. Given that our backup server contains a copy of what this would-be hacker would be able to see on the actual Mac anyway, I don't really see this being a big risk....

Now, we need to copy the public key over to the backup server:

macbook:~ root# scp .ssh/id_dsa.pub jonm@backup.dreamhost.com:
jonm@backup.dreamhost.com's password:
id_dsa.pub                             100%  619     0.6KB/s   00:00
macbook:~ root#

You'll have to verify the server SSH key one more time, because now you are connecting from root rather than from your normal user account. Now we'll tell the backup host to accept a login from this key pair:

[backup]$ mkdir -p .ssh
[backup]$ chmod 700 .ssh
[backup]$ cat id_dsa.pub > ~/.ssh/authorized_keys
[backup]$ chmod 600 ~/.ssh/authorized_keys
[backup]$ exit

Now, we should be able to log in without a password from our Mac:

macbook:~ root# ssh jonm@backup.dreamhost.com
[backup]$

Sweet. Now we create a directory where our mirrored filesystems will live:

[backup]$ mkdir mac-backups
[backup]$ chmod 700 mac-backups
[backup]$ exit
macbook:~ root#

The utility we'll use to do the mirroring is the rsync utility, which can be invoked to run securely over ssh. This actually makes a nice backup utility for regular use, as the rsync protocol is actually pretty smart about being able to find just the small subsets of data that changed since the last sync; after the first big sync, for most personal file use, there won't be much work to do every night.

For now, let's set up a test directory on our local Mac.

macbook:~ root# mkdir /tmp/back-me-up
macbook:~ root# echo "data" > /tmp/back-me-up/afile.txt

Now, to make the magic happen, we do this:

macbook:~ root# rsync -avz -e ssh /tmp/back-me-up jonm@backup.dreamhost.com:mac-backups
building file list ... done
back-me-up/
back-me-up/afile.txt

sent 116 bytes  received 40 bytes  62.40 bytes/sec
total size is 5  speedup is 0.03
macbook:~ root#

Now we can keep a window open on our backups host, and we should see everything show up there:

[backup]$ ls -lR mac-backups
mac-backups:
total 4
drwxr-xr-x 2 jonm pg1807352 4096 2008-09-19 19:11 back-me-up/

mac-backups/back-me-up:
total 4
-rw-r--r-- 1 jonm pg1807352 5 2008-09-19 19:11 afile.txt
[backup]$

Just for fun, run the same rsync command above and see that nothing happens if there have been no changes (or rather, just that a very small amount of data gets exchanged to verify no changes).

Let's just make sure changes show up:

macbook:~ root# echo "changed-data" > /tmp/back-me-up/afile.txt
macbook:~ root# rsync -avz -e ssh /tmp/back-me-up jonm@backup.dreamhost.com:mac-backups

(other window)

[backup]$ cat mac-backups/back-me-up/afile.txt
changed-data
[backup]$

Ok, looking good. Next step is to identify all the directories you want to back up; let's keep a list of them in a config file on our mac:

macbook:~ root# mkdir -p /usr/local/etc
macbook:~ root# cat - > /usr/local/etc/backups.conf
/Users/jonm/Documents
/tmp/back-me-up
macbook:~ root#

Note that it is important *not* to have trailing slashes on these directory names, as this changes rsync's behavior slightly in a way that you will probably find annoying (it won't copy the directory name over, just the contents).

Ok, now the next step is to set up a script that can sync each of the directories:

macbook:~ root# mkdir -p /usr/local/bin
macbook:~ root# touch /usr/local/bin/do-backups
macbook:~ root# chmod 700 /usr/local/bin/do-backups
macbook:~ root# cat - > /usr/local/bin/do-backups
#!/bin/sh
for dir in `cat /usr/local/etc/backups.conf`; do
  rsync -avz -e ssh $dir jonm@backups.dreamhost.com:mac-backups
done
macbook:~ root#

Now we run it once by hand to make sure it works:

macbook:~ root# /usr/local/bin/do-backups

Finally, we install this in root's crontab as follows:

macbook:~ root# crontab -l > /tmp/root.cron
macbook:~ root# cat - >> /tmp/root.cron
# take a backup every day at 3am
0 3 * * * /usr/local/bin/do-backups >/dev/null
macbook:~ root# crontab /tmp/root.cron
macbook:~ root# rm /tmp/root.cron

Nice and simple. Now the backups are off and running every night without your intervention.

If you ever need to restore from the backup, you can always reverse the rsync process like this:

macbook:~ root# rsync -avz -e ssh jonm@backup.dreamhost.com:mac-backups/back-me-up /tmp

for each of the directories you have backed up over there.

Enjoy, and sleep well tonight....

P.S. Total elapsed time for the exercise was 2 hours from the time I placed the hosting order to the time the crontab was installed, but I took a one hour break in the middle for dessert and bedtime with the kids. So I'll claim this really did only take one hour of "CPU time" for me.

Tuesday, September 2, 2008
Measure your improvements

Metrics are an important part of any development group's toolset. If we want to continually improve our ability to develop software (through a lean engineering kaizen approach, or simply as a learning organization), then we need to have a way to figure out:

  • what parts of our process need improvement?
  • when we make a change, did it help or hurt?

This is where process metrics come into play. I'll start with my definition of a metric, which is a numerical measurement of something. If you can count it, it can be a metric. So "number of outstanding bugs" is a metric, but "software quality" is not. The term "quantitative metric" is redundant, and "qualitative metric" is an oxymoron.

There are generally two types of metrics we can capture:

  1. causal metrics: these are metrics that have a direct business impact: for example, ROI for a feature, unique monthly visitors, click-through ad rate, etc.
  2. symptomatic metrics: these are metrics that do not directly affect ROI (although we might believe they do) but are downstream indicators for one or more causal metrics. The number of outstanding bugs in a product, the number of bugs caught in a certain phase of development, percentage of code covered by unit tests, etc. are all symptomatic metrics.

My general observation, based on reading articles around the use of metrics for improving your processes, is that a lot of metrics-based improvement projects fail to distinguish between these two types of metrics. Partly, I think this is because while the causal metrics properly align your improvement efforts with your business's interests, they are also harder to define and measure. By contrast, a lot of symptomatic metrics are easy to find and measure, but their relationship to the business may be less clear.

For example, consider the balance between software quality and time to market. You can take a longer time when developing a feature or product to reduce the number of bugs that show up at deployment, or you can ship a feature more quickly, knowing that there may be both known and unknown bugs present. In this case, you can measure both number of known bugs at deployment time, and you can measure overall time-to-production for a feature (time from feature conception to deployment).

Now, if you can decrease time-to-production without increasing the bugginess of your code, that's a win. Similarly, if you can reduce bugginess without lengthening your production time, that's a win. However, both of these things will probably require some effort to implement. Another interesting possibility would be to simply make an adjustment of where you sit on this balance. For example, simply spend more time looking for and fixing bugs in your QA phase, to tradeoff fewer bugs for a slower time-to-market. Or vice versa to get to market more quickly, possibly with more bugs. Both of these adjustments are probably relatively painless to implement, in that no one has to change what they do, just how long they do it for.

So the question is, which one of these things ought we to do? My argument is that bugginess and time-to-production, both being symtomatic metrics, don't give us the answer directly. It all depends on our product environment. For example, when producing software to run medical equipment, a company reputation for quality might be more important than shipping new features quickly; or, in a highly innovative internet space, time to market might be king in terms of how much market share you can capture.

It's management's job to both help implement win-win changes as well as to set the "slider" of the quality/time tradeoff at the right spot. The trick, of course, is that it might be hard to measure this directly; there are several symptomatic things we could measure, including:

  • time spent in development
  • time spent in QA
  • time spent in deployment
  • number of outstanding bugs
  • profit for the product at a given point in time

Now really, the profit over some window of time (e.g. for a website, revenue vs. development spend for a given month) is the thing we want to optimize. The interesting idea here is for management to be able to run a series of experiments: if I increase/decrease QA or development time, how does it affect ROI for my product? How does the relative bugginess of a release affect its profitability? For certain "sliders" in the business, it is relatively simple to take a series of measurements to find a current "sweet spot".

An interesting idea here is that sometimes we work through things backwards. For example, we try to estimate "how long will it take to fully regression test a release", or "how long will it take to code up a feature", rather than "how buggy will the release be if we test it for X amount of time", or "how much of this functionality can you develop in X amount of time." In other words, rather than deriving the time-to-market from a set of estimates for all the steps, instead set the time to market by timeboxing those steps, and see what the outcome is. This is a powerful notion of metrics-based management that is hinted at (in the notion of Scrum timeboxed iterations) but which I have not seen explicitly suggested anywhere[1]. (Please post all the references to things that I've missed in the comments section--I'm sure there are plenty).

At the end of the day, however, it is hard to optimize things we can't measure. I think important metrics to gather are:

  • the levers we have available to manipulate our process (e.g. timeboxing)
  • causal metrics that affect our business (ROI, product profit)
We need to be aware which metrics are causal, and which are merely symptomatic, so that we are measuring things that directly affect the business somehow. This approach permits empirical management--adjust something you can control, see how it affects your causal metrics, rinse, repeat.


[1] Scrum timeboxes an entire iteration, but does not timebox an individual feature, so a team may be able to spend all their time on one feature, or spread their effort across many features. The closest thing I've seen here is the notion of the "Small" in INVEST user stories, where stories are limited to a certain amount of complexity. However, the story points in this case are still estimates of the work involved, rather than timeboxes around how much time to spend implementing a feature; the "small" requirement is really to permit more accurate estimation rather than to timebox the amount of effort (although it does secondarily have this effect, I've not seen this stressed in articles about this).