Saturday, November 28, 2015
VerySillyMUD: Continuous Integration

This post is part of my "VerySillyMUD" series, chronicling an attempt to refactor an old MUD into working condition[1].

In our last episode, we got an autotools-driven build set up, which gets us going along the path towards being able to use a continuous integration tool like Travis CI. In this episode, we'll see if we can get it the rest of the way there. I'll be referencing the getting started guide, the custom build instructions, and the C project instructions.

It seems like the easiest way to proceed is to try to just use the default C build, which I expect not to work, but then to massage it into shape by looking at the build errors. It seems like this minimal .travis.yml is a good starting point:

As an editorial comment, the autotools toolchain uses "make check" to run tests, but Travis CI expects you to have autotools and that the default way to run tests is..."make test". I kind of wonder how this happened; my suspicion is that most projects use make test (and so that's what Travis CI assumes) but that GNU autotools defined an opinionated standard of make check that ignored existing practice.

Anyway, back to the build. As expected, this failed because there wasn't a configure script to run! This is interesting--I had not checked in any of the files generated by autoconf or automake under the general principle of "don't check in files that get generated". Ok, we should be able to just run autoreconf to get those going:

This one fails too, with the following error from autoreconf:

Hmm, very mysterious. This seems to be due to missing the AC_CHECK_HEADER_STDBOOL macro, and at least on this mailing list post, it was suggested it could be removed, so let's try that.

Ok, that build got further, and in fact built the main executable, which is great; we just failed when trying to find the Criterion header files. Since we haven't done anything to provide that dependency, this isn't surprising. I also notice that there are a lot more compiler warnings being generated now (this would appear to be based on having a different compiler, gcc vs. clang). Now we just need to decide how to provide this library dependency. The build container is an Ubuntu system, which means it uses apt-get for its package management, but there doesn't seem to be a .deb package for it provided. The options would seem to be:

  • vendor the source code for it into our repository
  • figure out how to build a .deb for it and host it somewhere we can install it via apt-get
  • download a Linux binary distribution of Criterion
  • download the source code on the fly and build it locally

I'm not crazy about vendoring, as that makes it harder to get updates from the upstream project; I'm not crazy about the binary download either, as that may or may not work in a particular environment. My preference would be to build a .deb, although I haven't done that before. I assume it would be similar to building RPMs, which I have done. Downloading and building from source is perhaps a good initial method, as I know I could get that working quickly (I have built the Criterion library from source before). If I ever get tired of waiting for it, I can always revisit and do the .deb.

According to the Travis CI docs, the usual way to install dependencies is to customize the install step with a shell script. We'll try this one to start, based on the instructions for installing Criterion from source:

Ok, this does pretty good as far as starting some of the compilation process, but it still fails with this error:

Hmm, seems to be looking for some copyright information; we'll try copying Criterion's LICENSE file into the debian.copyright file it seems to be looking for. Ok, that build succeeded in building and installing the library, and in fact, the later ./configure found it, but wasn't able to load the shared library We need to add an invocation of ldconfig after installing the library, I think. Wow, that did it! We have a passing build!

Let's record our success by linking to the build status from so that people visiting the repo on GitHub know we have our act together!

Now, while debugging this build process, I got several notices that the project was running on Travis CI's legacy infrastructure instead of their newer container-based one, which purports to have faster build start times and more resources, according to their migration guide. It seems like for us the main restriction is not being able to use sudo; we use this in exactly three places at the moment:

  1. to install the check package via apt-get
  2. to run make install after building the Criterion library
  3. to run ldconfig after installing the Criterion library

It seems like there are built-in ways to install standard packages via apt, so then the question is whether we can directly compile and link against the Criterion library from the subdirectory where we downloaded and built it; if we can then we don't need sudo for that either. Ok, it looks like we just need to find the include files in the right place, by adding an appropriate -I flag to CFLAGS and then to find the built shared library by pointing the LD_LIBRARY_PATH environment variable to the right place. Nope. Nope. Nope. Nope. Ok, this answer on StackOverflow suggests we need to pass arguments through to the linker via -W options in LDFLAGS. Still nope. Maybe if we pass the Criterion build directory to the compiler via -L and to the runtime linker via -R? Bingo!

Ok, now we just need to see if we can install the check submodule via the apt configuration directives in .travis.yml. That works, and here's the final working CI configuration:

It seems that by setting CFLAGS explicitly here, we "fixed" the compiler warnings; I suspect we need to come back around and add -Wall to CFLAGS and then fix all those warnings too. But that can wait for the next episode...

[1] SillyMUD was a derivative of DikuMUD, which was originally created by Sebastian Hammer, Michael Seifert, Hans Henrik Stærfeldt, Tom Madsen, and Katja Nyboe.

Friday, November 20, 2015
VerySillyMUD: Adding autotools

This post is part of my "VerySillyMUD" series, chronicling an attempt to refactor an old MUD into working condition[1].

In our last episode, we got all the remaining compiler warnings fixed. While I was merging the pull request for it, I noticed GitHub was prompting me to get continuous integration set up. I had run across Travis CI before, and knew that it was free to use for open source projects. A quick peek around their documentation shows that they support C, but that they assume the use of GNU autotools for building. Since a friend had already identified weirdness from the runs of makedepend I had done on my own computer and checked in, I actually already had an issue open for this. Seems like the universe is trying to tell me something!

Conveniently, autotools comes with a sample project and a pretty good step-by-step tutorial. We also have a working Makefile that we can use for reference--for now I'm just going to make a temporary copy of it as Makefile.orig so that I can have it for easy reference, and then clean it up later during commit/PR structuring. Since automake is going to overwrite the Makefile, this will be convenient, even though I know a copy of the Makefile is safely tucked away in version control. Ok, let's start with the toplevel, which for now just has to point to the src/ directory:

Then we need another for the src/ directory. In this case, it looks like the bare minimum is to identify the final executable name and then identify the .c source files. Not sure if we need to add the .h ones or not yet; it could be that autoconf will find those later. Anyway, let's try this:

As for the in the source directory, we can adapt the sample one from here and try this:

Now, per the instructions, we're supposed to run "autoreconf --install":

Hmm, I had thought that all the CFLAGS would go in the AM_INIT_AUTOMAKE macro, but I guess not. Let's just put -Werror in there for now and try again:

Ok, this is much closer. Looks like we just have some missing files. For now, I'll create empty versions of NEWS, README, AUTHORS, and ChangeLog, and remember to create an issue to fill those in. As for COPYING, that's traditionally the license file, so we'll just make a copy of doc/license.doc and use that. Now when we run `autoreconf --install` it completes successfully! Ok, let's try running ./configure:

Wow, that worked. Ok, let's try doing a make:

Ah, failure. A quick look at the output shows we're missing some CFLAGS; this might be the source of this compilation error, since one of the compilation flags was -DGROUP_NAMES and that might be what the gname field is for. A quick look at the declaration of struct char_special_data in structs.h confirms this is the case. Ok, so we just need to figure out how to get the CFLAGS declared properly. According to this answer on StackOverflow, it seems we can just add them in the file.


In the process of looking for advice on adding the CFLAGS, I ran across a description of the autoscan tool that will scan your source code and suggest missing macro invocations for your A quick run shows that we're mostly missing detection of the header files we include and the library functions we call, so we'll just add that in too:

AC_CHECK_HEADERS([arpa/inet.h fcntl.h malloc.h netdb.h netinet/in.h
  stdlib.h string.h strings.h sys/socket.h sys/time.h unistd.h])


AC_CHECK_FUNCS([bzero gethostbyaddr gethostbyname gethostname
  getpagesize gettimeofday inet_ntoa isascii memset select socket
  strchr strdup strrchr strstr])! It builds without errors. Now, there's still a few things missing here, like actually using the defined macros in the config.h that the configure script generates; we also haven't gotten the tests running yet, or looked at what "make install" wants to do. So let's get started with the tests. The first thing we're going to want to do is pull the common source code files out into their own variable:

common_sources = comm.c act.comm.c act.move.c act.obj1.c \
 act.obj2.c act.other.c act.wizard.c handler.c \
 db.c interpreter.c utility.c spec_assign.c shop.c limits.c mobact.c \
 fight.c modify.c weather.c spells1.c spells2.c spell_parser.c \
 reception.c constants.c spec_procs.c signals.c board.c magic.c \
 magic2.c skills.c Opinion.c Trap.c magicutils.c multiclass.c hash.c \
 Sound.c Heap.c spec_procs2.c magic3.c security.c spec_procs3.c \
        create.c bsd.c parser.c intrinsics.c
dmserver_SOURCES = $(common_sources) main.c

At this point, while perusing through the automake manual to figure out how to do tests, I discovered there was a better way to define the symbols instead of adding them to CFLAGS in; there's an automake variable for this called AM_CFLAGS, so we just move the flags over to instead. But in the meantime, the next step towards tests would be to correctly find the header files and library for Criterion in the configure script, so that the generated Makefile looks for them in the right place. We can do this by adding the following to

  echo "***WARNING***: unit tests will not be runnable without Criterion library"
  echo "  See"
  echo "***WARNING***: unit tests will not be runnable without Criterion library"
  echo "  See"

After a little more perusing through the autotools manual, it turns out instead of the echo command there's a canonical way to do this using the AC_MSG_WARN macro, as in:

  AC_MSG_WARN(unit tests will not be runnable without Criterion library)

Now, when we then run make, we find the Criterion library, but the final dmserver executable gets linked with -lcriterion, which we don't want, because as you may recall, that library has a main() function in it that is going to try to run test suites, so we don't actually want the default action of AC_CHECK_LIB. Instead, we need to fake it out:

  AC_MSG_WARN(unit tests will not be runnable without Criterion library)

And now we can go ahead and build the unit test program and indicate that's what should run during "make check" by adding the following to src/

# unit tests
check_PROGRAMS = tests
tests_SOURCES = $(common_sources) test.act.wizard.c
tests_LDADD = -lcriterion
TESTS = tests

Sure enough, we can run the tests:

Nice. Ok, now these hardcoded AM_CFLAGS are still bothering me. Really, we ought to be able to opt into and out of them via feature enabling/disabling from configure. My friend Dan would probably, at this point, say "Why?" in incredulity, but this is not an exercise in practicality, per se... The way we do that is to add these flags to, which will cause configure to output them into config.h. We can do that with stanzas like the following:

        [Define as 1 to restrict each level 58+ god to one site or set of
    [turn off site-locking for gods])],

Ok, then we need to just go around and include config.h in all the .c and .h files, and then we can remove the AM_CFLAGS from Cleaner! At this point, the last thing to do is to get make install to work. It turns out the default action for the MUD server itself is the right one, but we also need to collect the data files and install them. This can be done by creating Makefile.ams in various subdirectories. For example, here's the one for the top-level lib/ directory:

Then the last thing to is to make the compiled server look there by default. The configure script takes a --localstatedir argument to customize where those data directories get created; we really want the game to have that path compiled into it as a default path. After much noodling through StackOverflow and the automake manuals, it looks like the best way to do this is a multi-step process in order to get the right amount of variable expansion. First, we have to bring our friend AM_CFLAGS back and pass the Makefile level variable $(localstatedir) in as a symbol definition:

AM_CFLAGS = -DLOCAL_STATE_DIR=\"$(localstatedir)\"

Then, we can add the following to

  [Default location to look for game data files])

...which results in the following showing up in config.h:

/* Default location to look for game data files */

This makes whatever got passed in to the --localstatedir argument of configure, which defaults to /usr/local/var, to show up as a string literal LOCAL_STATE_DIR. In C, two string literals adjacent to one another get concatenated, so this results in DEFAULT_LIBDIR being something like /usr/local/var/sillymud. At this point, we're able to run make install and run /usr/local/bin/dmserver. I think for our next episode, it's time to do a little play testing to see how well things are working and what else needs fixing!

[1] SillyMUD was a derivative of DikuMUD, which was originally created by Sebastian Hammer, Michael Seifert, Hans Henrik Stærfeldt, Tom Madsen, and Katja Nyboe.

Saturday, November 14, 2015
VerySillyMUD: Fixing the Rest of the Compiler Warnings

This post is part of my "VerySillyMUD" series, chronicling an attempt to refactor an old MUD into working condition[1].

In our last episode, we used unit tests to document a function's behavior so we could refactor it to fix a security warning. Now we are able to pick up where we left off three episodes ago: fixing compiler warnings to find any gotchas. To get started, let's re-enable the -Werror compiler flag. Now, back to fixing warnings. I'll just highlight the ones here that are novel cases we haven't seen before. First up:

Ok, let's get some context; here's the function where that warning/error appears:

So we see that d is a struct descriptor_data *. A little grepping around shows that's defined in structs.h:

Sure enough, we can see that d->host is a statically-allocated array. Now, this seems like a recipe for a buffer overrun to me, but one thing at a time. We'll maybe find or fix that later by running lint or valgrind. For now, though, I'll create an issue so we don't forget, then we can just fix this to a simpler test and move on.

Well, it turns out that was the most interesting warning/error to highlight. In almost all cases it was pretty clear how to fix the warning, and often the compiler's error message told me exactly how to fix it! Since there were a LOT of changes to get here, we'll spare you the details. At this point, the game starts without most of the warnings we used to see. Hurrah! Now the only question is what to do for the next episode...

[1] SillyMUD was a derivative of DikuMUD, which was originally created by Sebastian Hammer, Michael Seifert, Hans Henrik Stærfeldt, Tom Madsen, and Katja Nyboe.

Sunday, November 8, 2015
VerySillyMUD: Understanding a Function with Tests

This post is part of my "VerySillyMUD" series, chronicling an attempt to refactor an old MUD into working condition[1].

In our last episode, we got a basic framework for unit tests in place. Just to refresh our context (and which layer of yak we are shaving at the moment): we need unit tests so we can (a) capture the current behavior of a particular function, dsearch(), so that we can (b) refactor it safely, so that we can (c) fix a compiler warning about it, so that we can (d) finish fixing all the compiler warnings in case there are other latent bugs being hidden. Whew!

So let's take a look again at the function in question:

What can we figure out about it? Well, first, it seems to take two string arguments, helpfully named "string" and "tmp". It seems like both are potentially modified by the function here; tmp gets written on line 11 via strcpy(), and string gets written on line 24 via sprintf(). The i variable seems to be a flag indicating whether we're done, as the main while loop tests for it and the only modification to it is in line 10 where it is set to 1. So let's start with the simplest case, where we only go through the loop once; in order for that to be true, we need the call to strchr in line 9 to find no instances of the tilde ('~') character in string. In this case, it looks like we just copy string into tmp and are done, which suggests that our first unit test can be:

So let's try to run that:

Now, what about a more complicated test? It seems like tmp might be the output variable here, so let's see what happens when we feed it a string with a tilde in it, by writing an intentionally failing test:

And then run the test:

Hmm, a little less than I was hoping for; the Criterion library macro-interpolates the C source expression into the error message. In the case of string comparison, I'd rather have seen the string values here instead. At this juncture I took a quick browse of the Criterion source code to see if this was a quick fix, but it wasn't, so I decided to pass on this for now and just fix it by doing a good ol' printf to see what the value was. In this case, it looks like the behavior is to strip out the tilde and the following space, because this test passes:

A quick perusal of the source shows we must be in the default: clause of the switch statement. Now, like any good tester, let's see what happens if we have a tilde as the last character:

Ah, cool! That one passes. So now let's look at some of the other cases, starting with a "~N". I'm going to guess that this substitutes the string "$n;" in for "~N":

Indeed, this passes. Similarly, it looks like "~H" will turn into "$s". (It does, although I won't show that test here just for brevity). Ok, what should we test next? How about two tildes in a row? I think that this should just get stripped out because the second tilde won't be either 'N' or 'H':

Yep. Ok, now we have a loop, so this suggests we should be able to do multiple substitutions:

Check. That would seem to cover most of the output behavior on the second argument to dsearch(), which is labeled tmp. As we noted, though, the first parameter, string sometimes also gets written, in the sprintf on line 24. So we had better document its behavior too. However, notice that sprintf(string,tmp) essentially copies tmp into string, as long as there are no format expansions like %d in tmp, and if there are, then we don't have any arguments for them! So this is likely a bug, especially if string comes from an untrusted input source. Now, if you remember a couple episodes ago, this was the exact compiler warning we ran into:

Ok, so this means that we should pretty much just add tests that show that string has the same output behavior as tmp, and then I think we can refactor it.

At this point I feel pretty confident I understand what this function does. Let's rewrite it more simply. What we'll do is we'll build up the substituted string in tmp and then copy it back into string at the end:

Sure enough, the tests pass, so we'll call this a victory for now. In the next episode, we'll turn the -Werror flag back on and continue fixing compiler warnings as we were trying to do previously. Fortunately, we now have the beginnings of some unit tests we can use if we encounter anything that doesn't look like it has an immediately obvious fix.

[1] SillyMUD was a derivative of DikuMUD, which was originally created by Sebastian Hammer, Michael Seifert, Hans Henrik Stærfeldt, Tom Madsen, and Katja Nyboe.

Monday, October 19, 2015
VerySillyMUD: Unit Tests

This post is part of my "VerySillyMUD" series, chronicling an attempt to refactor an old MUD into working condition[1].

In our last episode, we encountered a function we needed to refactor:

Only trouble is: we don't have unit tests in place, and I don't really have any idea what this function is supposed to be doing. Looks like we have to get some tests! Now, I am used to the wonderful JUnit family of unit testing frameworks, but it's unclear which C unit testing framework to use. I decided to opt for the Criterion library, as it was xUnit-styled and seemed pretty straightforward.

The first step is to figure out how to run a basic test. In the short term, I'll have to disable the -Werror flag that treats compiler warnings as errors; in order to write and run unit tests against the current code, I'm going to need to that code to compile first! Recall that we've cleared true compilation errors already, so this should work.

Now, the way Criterion unit tests work is that you compile your code under test along with your unit tests and then link in the Criterion library into a single executable; when you run that executable it runs your unit tests. So let's try to get a basic unit test written against the dsearch function:

There's a few things to note here: first, I explicitly included the prototype for dsearch. The project currently puts all the function prototypes into a single protos.h file and includes that everywhere, but I ran into some conflicts trying to do that here. At some point once I'm in a cleaner project it will be worth going back through to move each source file's prototypes out into a separate .h file so that they can be included exactly where appropriate, and so that incremental changes don't require rebuilding everything (right now, if there is a change to protos.h we have to recompile everything). Second, Criterion's Test macro takes two arguments; the first is the name of a test suite and the second is the name for the test. I used the name of the C source file act.wizard.c as the basis for the suite name and just chose an initial test name for now. I will probably go back and rename it to something that reflects the property this test is checking once I understand dsearch a little better.

Now, let's get a make test target implemented so that it's easy for us to run the unit tests. My initial attempt at creating a test executable tried to just link act.wizard.o (the code under test) with test.act.wizard.o and -lcriterion, but it turns out that the act.wizard.c code refers to external symbols in other source code files, so linking failed. Rather than sort out exactly which object files I need, I decided to just link them all in together into one fat unit test executable. Unfortunately, -lcriterion contains a main() function in it (so it can run the test suite), so the rest of the linked object code needs not to have one in it. Right now, comm.c has the main function for the MUD, so first what we'll do is rename that function and then create a main.c file that has a main for the MUD that calls the one in comm.c; then we simply won't include main.o in the test executable.

Next, we can set up some new Makefile variables for TEST_SRCS and TEST_OBJS and then create two targets: one to build the test executable tests and one to actually run it. Finally, we need to run makedepend to update all the dependencies. I note that when I did this I get a lot more detailed dependencies than were in the Makefile originally. The way to do this nowadays would be through automake and autoconf, but I won't tackle that right now; I'll just create an issue on the Github repo.

Wow, ok. Running unit tests! The next step will actually involve diving into the guts of the dsearch function to figure out what it currently does and to document its behavior with tests, which we'll do in the next episode.

[1] SillyMUD was a derivative of DikuMUD, which was originally created by Sebastian Hammer, Michael Seifert, Hans Henrik Stærfeldt, Tom Madsen, and Katja Nyboe.

Sunday, October 11, 2015
VerySillyMUD: Cleaning Up the Warnings

This post is part of my "VerySillyMUD" series, chronicling an attempt to refactor an old MUD into working condition[1].

In our last episode, we discovered a bug that the compiler had actually warned us about, but which we ignored because we were trying to just fix errors and get on with running the game. Lesson learned here; time to go clean up all those warnings. As there are a lot of them, and they are likely going to be somewhat repetitive, I will just highlight the particularly interesting ones here. However, you can see the full set of cleanups here.

The first thing we want to do is to recompile all the object files to see what the warnings are. I was about to make clean except that the Makefile doesn't seem to have a "clean" target! Let's fix that. Then, let's make sure we don't make this mistake again by adding the -Werror flag to the CFLAGS variable to treat compiler warnings as errors. Now, the first type of error we encounter is:

Basically, we have not explicitly declared a prototype for the atoi library function, nor have we #included one. This is simply a matter of adding the right system include file in the case of C library functions. Next:

When you are not explicit about your braces, and you have nested if statements, you can have unintentional paths through your code when an else gets associated with the wrong if. We can correct these by adding explicit braces, taking indentation in the source code as a hint as to what was probably intended. Next we find that adding in some missing #include files created some conflicts:

A global variable reboot is used to track whether the game is supposed to restart itself after an intentional, non-crash shutdown. This conflicts with the reboot system call that actually reboots the whole machine! We can handle this by renaming the variable, taking care to find all locations of it in other files.

We also fixed some cases where functions were declared with a non-void return type yet there was no explicit return statement. As we saw in a previous episode, we can fix these by changing the declaration to return void if no callers ever check for a return value. Our next error involves the type of one of the arguments passed to the bind() system call:

A glance at the code shows that sa is declared as a struct sockaddr_in (an Internet socket address), but bind wants a pointer to a struct sockaddr. Here's one of those cases where C expects you to cast one struct to another with the same prefix, for as much confusion as that might cause. This is a standard C TCP/IP socket programming idiom, however. The next error complains of an implicit function definition, but is a little more involved:

With a little judicious grepping, we find that CAN_SEE_OBJ is declared in utility.c, where we see:

Ugh. This apparently was a macro at one point but was redefined as a function, although someone conveniently left the old macro definition in here for us, but compiled out with "#if 0". At any rate, we can clean that out and then add the missing prototype. For the next error, the compiler helpfully gives us options about how to fix it:

We can fix this by explicitly comparing the value of the assignment to a null character ('\0') instead. I also encountered some type errors showing up mismatches between a format string and an argument:

Again, the compiler happily provided a useful suggestion in these cases to update the format string. I think this is an area where the C compilers have improved since I was doing a lot of C programming; I'm not sure they would always catch these for me in the past. But I might also be misremembering!

Next, we have a couple spots where the developers wanted to be extra sure about order of operations:

In this case, a quick perusal of the code suggests that this was indeed meant to be a comparison and not an assignment, so we can just get rid of the extra parens.

Now, at one point I ran into some type errors surrounding function pointers:

Ugh, function pointers. I can never get the typing right on these and always have to look up how to do it. In this case, it looks like the programmers just got lazy, calling all the function pointers void (*)(), but we can propagate the correct types through the code instead.

And then I ran into this error:

Hmm, taking a look at the code shows:

Ok, I can see the problem, which is essentially some kind of nested sprintf calling. This needs to be refactored, but there's no way this function is clear enough to me to refactor it without unit tests. Guess we'll have to get that going in the next post.

[1] SillyMUD was a derivative of DikuMUD, which was originally created by Sebastian Hammer, Michael Seifert, Hans Henrik Stærfeldt, Tom Madsen, and Katja Nyboe.

Saturday, October 10, 2015
VerySillyMUD: A Lesson About Compiler Warnings

This post is part of my "VerySillyMUD" series, chronicling an attempt to refactor an old MUD into working condition[1].

In our last post, we managed to get the MUD executable to build, and attempted to run it, with somewhat lackluster results:

Fortunately, we get a file and line number with this error, so let's take a look:

Hmm, we seem to not be able to change into whatever directory is represented by dir. Scrolling up through the source we see that this can be set via a command-line argument:

...but we didn't provide one, so it must have a default value. A little further up and we see it gets set to DFLT_DIR; a quick grep gives us:

Aha! It's looking for the "lib" directory, which doesn't exist in the src directory but does exist as a top-level directory in the project. Let's try pointing it there:

Great! There are a lot of error messages in here about "unknown auxiliary codes" but the MUD seems to have started. Let's try to connect as a client:

Ok! Let's proceed and create a character to play:

A quick look over at the terminal where the server was running shows this in the log:

Oh, joy. Looks like we'll have to run this in a debugger to see where it's barfing.

Uh, this is segfaulting in a C library routine? What's going on here? The source code line where this is happening is:

A little more poking around in the debugger shows that all the pointers here (d->pwd, arg, and d->character-> are all well-formated, null-terminated strings. What's going on?

I will spare you, good reader, quite a bit of head scratching and Googling, but here's ultimately what happened: there is no function prototype for crypt() defined or included in this source code file, and the C compiler is making the assumption that it returns an int, which is a 32-bit quantity. In actuality, crypt returns a char *, which on a 64-bit machine like I have, is a 64-bit value; the return value is getting truncated, and when it is likewise passed to strncpy that expects a pointer, we're off again, and the pointer passed in is essentially garbage. Now that I've figured out what the issue is, I would not be surprised if we had a compiler warning about this. Sure enough:

Remember when I said previously that I was generally a fan of "treat compiler warnings as errors"? Looks like I should have taken my own advice. In the next episode, I'll go ahead and clean all of those up.

[1] SillyMUD was a derivative of DikuMUD, which was originally created by Sebastian Hammer, Michael Seifert, Hans Henrik Stærfeldt, Tom Madsen, and Katja Nyboe.

Saturday, October 3, 2015
VerySillyMUD: Remaining compilation errors

This post is part of my "VerySillyMUD" series, chronicling an attempt to refactor an old MUD into working condition[1].

In this edition, we'll try to fix all the remaining compilation errors. The next error we have is in handler.c:

As before, we'll fix this by converting it into a comment. The code clearly isn't used, but this is an unfamiliar codebase so I don't want to throw code away (yet). Easy enough. Next error:

Now this is truly frightening. Someone is calling malloc with the wrong number of arguments; this is almost certainly the source of some kind of nefarious memory bug. What's going on here? A glance at the source code enlightens us:

The parallelism in the code here now makes this clear once we can see the associated realloc() call, which takes a pointer to the old allocation (script_data[top_of_scripts].script). This pointer is not needed for the initial malloc call, so we can remove it. Next up we have:

Ok, that's a straightforward error, but how do we fix it? Let's look at this function:

The error occurs at the first return statement. This function appears to reference special procedures; as I recall, certain areas or creatures ("mobs") in the MUD had unusual behaviors or triggers. Looking at the rest of the body of this function, it seems to be looking for special procedures and firing them, returning a 1 if one was found. Since the first return statement doesn't seem to fire any, it seems that returning 0 there is probably the correct return value. Next error:

Here the code is using a deprecated header file (malloc.h) and we can just include stdlib.h instead. Fixing that let the compilation of utility.c get further:

Ah, our good friend "#ifdef 0" again. Let's turn you into a comment instead. Next:

Hmm, we'll have to take a look at this shop_keeper function, which presumably operates the many places in the MUD where your character could buy items.

Ok, what seems to be happening here is that the function returns TRUE when the shopkeeper has performed some special behavior, and FALSE otherwise. It seems like the three return statements causing the errors are the ones up at the top of the function doing early exits from the function; it seems reasonable to guess that we should return FALSE in these cases, but we really don't have any way to tell without a much closer understanding of the codebase, which we're not going to attempt to get yet. Next error:

Another missing return value; we'll guess we should supply another zero here for this early-exit case.

More missing return values! This time, it turns out all the return statements in the DamageMessages() function don't have return values; this leads me to suspect the return type of this function is wrong. A quick grep confirms that none of the callers capture or inspect a return value, so we'll just change the return type to void. Next:

As with the shop_keeper() function we saw above, the missing return value is on an early exit from the function, which manages characters interacting with the bank and returns TRUE if such an interaction occurs. The early exit happens if IS_NPC() is true for the character. An NPC is a non-player character, i.e. one played by the program and not by a human; presumably NPCs can't interact with the bank, so it seems returning FALSE in this case would be right. Next:

We saw log_msg() in our previous post. It's defined this way:

Note the commentary left by a developer on this one-liner, which is near the top of its file and hence looked like a function prototype. At any rate, we see that log_msg() is really shorthand for calling a different logging function, log_sev(), with a priority of 1. It seems like our error in board.c was either meant to call log_sev() directly or just accidentally added an argument:

Since the other error message here just invokes log_msg() for what would seem to be an equivalently serious error, we'll change the problematic invocation to match. Next:

More missing return values. We'll default them to returning FALSE. Next:

If we supply one more FALSE then the whole thing compiles into an executable called dmserver. Let's try to run it!

Well, not so hot, but we'll leave fixing it to the next episode.

[1] SillyMUD was a derivative of DikuMUD, which was originally created by Sebastian Hammer, Michael Seifert, Hans Henrik Stærfeldt, Tom Madsen, and Katja Nyboe.

VerySillyMUD: Compilation

This post is part of my "VerySillyMUD" series, chronicling an attempt to refactor an old MUD into working condition[1].

The first goal is just to get the thing to compile. The source code is early 1990s C, and I have this C compiler on my computer at home:

The compiler doesn't quite like the source code here; it generates 21 warnings and 20 errors on the first file when we run make! I haven't included the full output here for readability's sake, but you can peruse the full gist if you like. Our goal here is to get compilation to succeed while making only minor edits that we think are safe and do not change intended functionality.

The first error is here:

The error in question points out a conflict between the prototype included here and the system-level include file string.h. A quick grep of the source code shows that strdup() is not implemented in the source code here, so this should just be a matter of removing the prototypes where we find them in the source code and instead just including them via the system level library, as seen in this commit. The next error/warning is:

If we look at the source code in question, we see:

This could either be a mistake in the function definition (i.e. it should return void instead of int) or this is an oversight and the function ought to be returning some kind of status code as an integer. A quick grep shows this is actually only called from one place (elsewhere in comm.c) where the return value is not captured or inspected. Therefore we can take care of this in the short term by correcting the function declaration, as captured in this commit. What's next?

Ok, from looking at the error message, the compiler seems to be confusing a locally-defined log() function that outputs some kind of logging message with the one from the math.h library for computing algorithms. The easiest way to fix this is to rename the local function for clarity to something like log_msg(), as captured in this commit. Ah, rats, now we see that this function didn't actually have a prototype anywhere, or at least not in enough places:

Adding the prototype to the protos.h include file seems to fix that up. Now, I just noticed at this point the compilation of comm.c still generates 25 warnings but only 2 errors! It turns out the very first fix we did for strdup cleaned most of them up. As I am running out of time for this session, let's press on to try to clean up the last two errors and at least get one file compiling. I'm generally a proponent of the "treat compiler warnings as errors" rule of thumb, but I think we're not quite ready to add that rule. I'd really like to get to the point where we can get some unit tests in place before doing too many more modifications. The first remaining error is:

Hmm, this is a new one for me. The bind() function connects a socket to a particular address so that we can start accepting connections from clients there. The system library's bind() on Mac OS X only takes 3 arguments, as does Linux's. A look at the source code shows there's an extra 0 argument being passed. Maybe this was for an older version of bind? According to running man 2 bind on my Mac OS X computer, in the HISTORY section it says "The bind() function call appeared in 4.2BSD." It turns out if we actually look at the man page for bind() from 4.2BSD it also only takes 3 arguments! What the heck? Without really knowing what the compilation target was for the codebase, it's hard to say what to do here. I'm going to assume one of the compilation targets had a bind() call that maybe took some optional flags as a fourth argument, but the source code didn't opt into any of them (hence the zero). According to this, extra arguments to functions were probably safely ignored, so I think we can just get rid of the extra argument here and see if that works. It'll certainly clean up the error at least. Ok, the last error, then is:

I think this dates me as a C programmer, because I thought that was still a valid way to #ifdef-out a section of code. I'll just comment the code out instead. Update: when I mentioned this on Twitter, Steve Vinoski pointed out:

This was interesting because Steve was right in that I had forgotten the idiom I was used to (#if 0). But it was also interesting that this code, which apparently worked at one point, used this incorrect idiom (#ifdef 0). A little experiment shows that this is not valid in C89 ("ANSI C"):

If I attempt to compile this using C89 compatibility, I get:

So this was written for a compiler with non-ANSI C support. It doesn't even appear to be valid K&R C, according to the list of gcc incompatibilities between it and ANSI C (the version of gcc I have installed is based on the clang compiler, which doesn't support the -traditional flag, and I'm not motivated to go install an older version to try it out). There are hints in the source code that this was targeted towards Sun computers; perhaps their early-1990s-vintage proprietary system C compilers accepted this syntax. Anyway, it's an easy enough fix, so let's not dwell further on it.

(Partial) success! At this point, the make run actually builds 10 object files (including comm.o) before erroring out (full gist). Next time out we'll continue fixing compilation errors in the hopes of getting at least a clean compilation run.

[1] SillyMUD was a derivative of DikuMUD, which was originally created by Sebastian Hammer, Michael Seifert, Hans Henrik Stærfeldt, Tom Madsen, and Katja Nyboe.

VerySillyMUD: An Exercise In Refactoring: Intro

When I was in college, I played a MUD called SillyMUD, which was in turn a derivative of DikuMud, which was written by Sebastian Hammer, Michael Seifert, Hans Henrik Stærfeldt, Tom Madsen, and Katja Nyboe. I happened to mention it to one of my kids, who expressed interest in seeing/playing it.

I found an old source code archive of it, and it turns out it was distributed under an open source-style license. Unfortunately, it doesn't even compile as-is, and as I recall there were several scalability and memory bugs present in the codebase even when I played it.

Therefore, I decided that I would undertake gradually refactoring the thing into shape and perhaps modernizing some of the technology choices in the process, while chronicling it here on the blog to capture it for educational purposes as an example.

The posts in this series so far are:

  1. Compilation
  2. Remaining Compilation Errors
  3. A Lesson About Compiler Warnings
  4. Cleaning Up the Warnings
  5. Unit Tests
  6. Understanding a Function With Tests
  7. Fixing the Rest of the Compiler Warnings
  8. Adding autotools
  9. Continuous Integration

More to come!

Tuesday, July 14, 2015
Notes from QCon New York 2015

I had the opportunity to speak--and hence, attend--QCon New York last month and thought I would share some highlights and notes from some of the talks I was able to see. Overall, this remains a high quality conference, with multiple sessions where I had to make tough choices about which talk to attend. I was not able to attend talks every session due to responsibilities from the "home front" at work, and did not always take notes (sometimes I just listened and tweeted particular points). I would definitely consider both attending and speaking at QCon again.

Chris Sanden: Automating Operational Decisions in Real-time

This talk was given by Chris Sanden, Senior Analytics Engineer at Netflix, and discussed three main areas where they apply automated analysis and machine learning in their architecture. By way of context, at this point in time, Chris said Netflix has over 700 services, with thousands of total server instances distributed across 3 AWS regions. From these instances, they collect millions of different metrics, which is plenty of data to feed to your favorite machine learning algorithm.

Automated Canary Analysis. The first area they applied machine learning of sorts was to assist with their canary deployment schemes. The idea here is that new versions of software are gradually rolled out into production, with a series of checkpoints where the new version is analyzed to make a go/no-go decision for advancing to the next checkpoint. Early on, the new version takes a small percentage of the overall service traffic, eventually having more instances deployed and taking a correspondingly greater portion of the traffic.

An interesting detail here was that they did not just run "mostly old" and "a little new" at the beginning; instead, they also designated a similarly-sized group of the old version in order to have an apples-to-apples comparison. For example, they might have 88 servers being the main production group on the old version, the canary deployment running on 6 servers with the new version, but then a third control group of 6 servers also running the old version. The canary would be compared against this control group instead of to the primary production group.

The rough scoring algorithm works as: select a set of metrics; compute the mean value for each metric for the canary and control groups every X minutes; calculate the ratio of mean values between canary and control; characterize the ratio as "high", "low", or "same"; calculate the final canary score as the percentage of metrics that match or outperform the performance of the control. They then set a go/no-go threshold for the number of passing metrics in order to progress to the next stage of deployment, with a larger canary group (and a correspondingly larger control group). They found, in practice, that up to 16% of versions were assigned a 'failing' score, although humans still have the final decision on whether or not to forge ahead or roll back. At any rate, they seemed to be satisfied that this analysis was finding about the right amount of 'things to be considered by a human'.

They did note that there were several things that made this somewhat of an art: selecting the right metrics, choosing the right frequency of analysis, choosing the right number of metrics (neither so few that you miss stuff nor so many that the analysis is prohibitively expensive); deciding whether a failing score was a false positive or not. They also noted that there were some caveats: a single outlier instance in either the control or canary group can skew the analysis (because they are using means), and their analysis did not reliably work well for canary groups smaller than six instances (which meant it was mostly useful for services with a larger number of deployed instances).

Server Outlier Detection. When you have a large number of server instances, outliers can be hidden in aggregated metrics. For example, a single abnormally slow service instance may not impact mean or 90th percentile latencies, but you still care about finding it, because your customers are still getting served by it. The next automated analysis application Chris described was specifically designed to find these outliers.

Their technique was to apply an "unsupervised machine learning" algorithm, DBSCAN (density-based spatial clustering of applications with noise) to group server instances into clusters based on their metric readings. Conceptually, if a point belongs to a cluster it should be near lots of other points as measured by some concept of "distance"; the servers that aren't close enough to a cluster are marked as outliers.

In practice, their procedure looks like:

  1. collect a window of measurements from their telemetry system
  2. run DBSCAN
  3. process the results, applying custom rules (e.g. ignore servers that have already been marked as out-of-service)
  4. perform an action on outliers (e.g. terminate instance or remove from service)

Chris noted that the DBSCAN algorithm requires two input parameters as configuration and which need to be customized for each application. Since they want application owners to be able to use their system without having in-depth knowledge of DBSCAN itself, they instead ask the application owners to estimate how many outliers they expect (based on past experience) and then auto-tune the parameters via simulated annealing so that DBSCAN finds the right number of outliers.

Anomaly Detection. Here they use machine learning to identify cases where aggregate metrics--ones that apply across all instances of a server--have become "abnormal" in some fashion. This works by training an automated model on historical metrics so that it learns what "normal" means and then loading it into a real time analytics platform to then watch for anomalies. The training happens by having service owners "tag" various anomalies as they occur in the dataset. Because what is "normal" for a service can drift over time as usage patterns and software versions change, they automatically evaluate each model against benchmark data nightly, retraining the model when performance (accuracy) has degraded, and then automatically switching to a more accurate model when one is found. They also try to capture when their users (other developers) think a model has drifted and try to make it easy to capture and annotate new data for testing.

Chris identified that bootstrapping the data set and initial model can be done by generating synthetic data and then intentionally injecting anomalies. He also mentioned that Yahoo has released an open source anomaly benchmark dataset. In addition, there are now multiple time-series databases available under open source licenses. One gotcha they have run into is that because humans are the initial source of training data (via classifying metrics as anomalous or not), then can sometimes be inaccurate or inconsistent when tagging data.

Lessons Learned. Chris indicated that the "last mile" can be challenging, and that "machine learning is really good at partially solving just about any problem." It's easy to get to 80% accuracy, but there are diminishing returns on effort after that. Of course, for some domains, 80% accuracy is still good enough/useful. Finally, he suggested not letting "Skynet out into the wild without a leash"--if the machine learning system is actually going to take operational actions, you need to make sure there are safeguards in place ("Hmm, I think I will just classify *all* of these server instances as outliers and just terminate them...") and to make sure that the safeguards have been well tested!

Mary Poppendieck: Design vs. Data

Mary Poppendieck (of Lean Software Development fame) asked: How do we get generative architectural designs that evolve properly? She cited examples from physical architecture (turns out she comes from a family filled with architects), particularly Gothic cathedrals whose construction spanned decades and even centuries in some cases. Their construction certainly spanned multiple architects and master masons. In some cases, this is obvious, such as cathedrals who have towers to the left and right of their main entrances that do not look even remotely similar. In other cases (and Notre Dame de Paris was identified as one), the overall building does have an overall consistency to it. How does this work?

Mary reviewed Christopher Alexander's (yep, the pattern language guy) "Theory of Centers" that described fifteen properties of wholeness that good architecture should have. Mary proposed that ten of these--(1) levels of scale; (2) strong centers; (3) boundaries; (4) local symmetries; (5) alternating repetition (recursion); (6) echoes (patterns); (7) positive space; (8) good shape; (9) simplicity; and (10) not-separateness (connectedness)--had analogues for software architecture. Her hypothesis is:

Learning through ongoing experimentation is not an excuse for sloppy system design. On the contrary: strong systems grow from a design vision that helps maintain "Properties of Wholeness" while learning through careful analysis and rigorous experiments.
She suggested the Android Design Principles were a good example of this concept.

Mary then moved on to propose an architectural design language set up to allow for incremental learning and development while maintaining an overarching "wholeness". The main principles were:

  1. Understand data and how to use it. Data must be central to an architecture, and "a picture is worth a thousand data points". It's important to understand the difference between analysis (examining data you already collect) vs. experimentation (specifically collecting metrics to prove or disprove a hypothesis). Everyone must be on the same team, from data scientists to engineers to architects. Mary suggested these principles echoed Alexander's properties of shape, boundaries, and connectedness.
  2. Simplify the job of data scientists. Data pipelines must be wide and fast. Experiments need design and structure. Access to data must be provided through APIs that support learning and control. Alexander parallels: space, simplicity, levels of scale.
  3. See, Think, Gain Amazing Insights. Be conversant with the best tools and analytical models. Be explicit about assumptions. Make it easy to share the search for patterns and outliers. Test insights rigorously. Alexander parallels: patterns, symmetry, recursion.

Mary classified our uses of data into four main categories: monitoring, control, simulation, and prediction. A good data architecture will support all of these, and so it must provide the following set of capabilities:

  • fast pipelines
  • data wrangling
  • analytics
  • visualization
  • designed experiements
  • machine learning
  • adaptable business systems and processes (i.e. you must be ready and able to use insights gained). [Incidentally, I suspect this is the most challenging for many businesses to achieve].

In summary, Mary suggests we have to design the entire system, not just the code. i.e. Technical architecture must also account for its data by-products and the surrounding processes we need to be able to support.

Finally, Mary had some choice quotes during the Q&A period after her talk:

Additional resources: (recommended by Mary)

Kovas Boguta & David Nolan: Demand-Driven Architecture

This talk was given jointly by Kovas Boguta and David Nolan. They correctly observed that the proliferation of different clients for many API has put a lot of pressure on the server side of the API, as the server wants to present a one-size-fits-all RESTful interface, yet the clients often need customized versions of those resources to deliver polished experiences. In particular, many clients often need to present what are essentially joins across multiple resources. With N clients, you end up with N front end teams "attacking" the service team with N different sets of demands, resulting in what they described as a "Christmas tree service." The speakers suggested this was only going to get worse, not better, with the continued proliferation of mobile and IoT devices.

David observed that RDBMSes solved a similar problem previously: building a generalized interface (SQL) and allowing clients to issue requests that were queries specifying what data they wished to receive. Of course, we know well that exposing a SQL interface is rife with security problems, but perhaps the overall pattern can still be applied with a restricted "query language" of sorts that is easier to reason about.

The principles they proposed were:

  • the client must specify exactly what it wants, no more, no less, including specifying in what shape the data is returned. The request is basically a skeleton or template of what is desired for the response.
  • composition: the demand (query) is specified as a recursive data structure, which allows for variation/substitution. They proposed a JSON-based format, so it also supports batching as a core construct (via arrays).
  • interpretation: the service interprets/decides how to satisfy the specified demand; the client should not care how data is sourced behind the covers. The query language they proposed is less expressive than SQL and is hoped to be more amenable to inspection to understand security properties.

David then showed some Clojurescript source code that used the "pull syntax" from Datomic for the query language. In this code, he was able to annotate views with the queries that were needed to populate them; this allows for full client flexibility while making maintenance tractable. David pointed out that this doesn't mean you don't need a backend; on the contrary, you still need to worry about security, routing, and caching implications.

[Jon's commentary] Netflix tackled this same problem in a slightly different way, which was to build a scriptable API facade. Client teams were still able to customize device-specific APIs via Groovy scripts built on data/services that were exposed as libraries in their API facade application. This avoids exposing a more general query interface, which makes security analysis easier, although it does still require the client teams to implement and maintain the server sides of their APIs.

Additional material (via Kovas and David):

  • Datomic provides a pull syntax and an evolvable schema; the client can trivially receive change sets to keep a dataset up-to-date
  • Relay/GraphQL from Facebook: This is a layer over react.js that provides the illusion of having a monolithic application architecture (e.g. "pretend I have a single logical database").
  • JSONGraph/Falcor from Netflix: They were able to eliminate 90% of their client-side networking code by building against a more general server API.

Jesus Rodriguez: Powering the Industrial Enterprise: The Emergence of the IOT PaaS

Jesus Rodriguez, formerly of Microsoft and now a veteran of several startups, noted that Gartner says IoT is at the peak of inflated expectations (while also noting that Gartner is responsible for a lot of IoT hype!). Jesus also noted that 70% of IoT funding rounds from 2011-2013 were related to wearables, and there was almost no investment in platforms, which he saw as open territory.

Jesus suggested that enabling enterprise-scale IoT brings several challenges: large amounts of data; connectivity; integration; event simulation; scalability; security; and real time analytics. Therefore, he thought that there was a need for a new type of platform, an IoT platform-as-a-service (IoT PaaS). He thought we would see both centralized (interactions are orchestrated by some sort of centralized hub or service) and decentralized (devices interact directly) models develop, so there was a need perhaps for multiple types of PaaS here.

In the centralized model, smart devices talk to a central hub that provides backend capabilities but also manages and controls the device topology. In the decentralized model, devices operate without a central authority. Jesus felt that in this model the smart devices would host a version of the IoT PaaS itself; in this setting I presume it would be some sort of library, framework, or co-deployed process.

For the remainder of the talk, Jesus identified several capabilities that he felt ought to be provided by an IoT PaaS, as well as providing pointers to some existing technology. Since I wasn't familiar with a lot of the technologies he mentioned, this was an exercise in "write everything down and look it up later" for me (although I was the only person who raised a hand when he asked if anyone had heard of CoAP, which I had learned about via the appendix of Mike Amundsen's RESTful Web APIs book).

Centralized capabilities

  1. device management service: managing smart devices in an IoT topology; device monitoring; device security; device ping (tech:; XMPP discovery extensions; IBM IoT Foundation device management API)
  2. protocol hub: provide consistent data/message exchange experiences across different devices. Unify management, discovery, and monitoring interfaces across different devices (IOTivity protocol plugin model; IOTLab protocol manager; Apigee Zetta / Apigee Link)
  3. device discovery: registering devices in an IoT topology; dynamically discovering smart devices in IoT network (UDP - multicast/broadcast, CoAP, IOTivity discovery APIs)
  4. event aggregation: execute queries over data streams; compose event queries; distribute query results to event consumers. complex event processing. (Apache Storm, AWS Kinesis; Azure Event Hubs and Stream Analytics; Siddhi (WSO2))
  5. telemetry data storage: store data streams from smart devices; store the output of the event aggregator service; optimize access to the data based on time stamps; offline query (time series: openTSDB, KairosDB, InfluxDB; offline: Couchbase; IBM Bluemix Time Series API)
  6. event simulation: replay streams of data; store data streams that simulate real world conditions; detect and troubleshoot error conditions associated with specific streams (Azure Event Hubs; Kinesis; Apache Storm; PubNub)
  7. event notifications: distribute events from a source to different devices; devices can subscribe to specific topics (PubNub; Parse Notifications; MQTT)
  8. real time data visualizations: map visualizations; integrate with big data / machine learning. (MetricsGraphicsJS; Graphite / Graphene; Cube;; D3JS)
Jesus thought that the adoption of a centralized IoT PaaS would be realized by having a standard set of services/interfaces but multiple implementations, ideally in a hosting-agnostic package. It would be important to be extensible and allow for third party integration support, while providing centralized management and governance. Jesus thought that CloudFoundry might be a good place to build this ecosystem (or at least could serve as a good model for how to do it).

Decentralized IoT PaaS Capabilities

    P2P Secure Messaging: secure, fully encrypted messaging protocol (Telehash)
  1. contract enforcement & messaging trust: express capabilities; enforce actions; maintain a trusted ledger of actions (Blockchain; Ethereum)
  2. file sharing: efficiently sending files to smart devices (firmware update); exchange files in a decentralized model; secure and trusted file exchanges (Bittorrent)

Other capabilities

As with any PaaS system, gathering historical analytics will be important. In addition, there will be a need for device cooperation (machine-to-machine, or "M2M") which gets into agent-based artificial intelligence sorts of systems.

Jesus saw the possibility for several types of companies to bring an IoT PaaS to market:

  • PaaS companies: these are already cloud-based and could provide standalone services for specific capabilities, with a focus on being easy to use and manage
  • API and integration vendors: these would have an experience advantage for integrating APIs with IoT telemetry data. Although they are missing key elements of an IoT platform, they are relatively simple to use and set up.
  • Telecom: (e.g. the Huawei Agile IoT platform). These would have a deep integration with a specific network operator and would be optimized for devices and solutions made available by the operator. However, he thought these would be complex to use and hence might not have a lot of mainstream adoption.
  • Hardware or networking vendors: (e.g. Cisco or F5). These would have a focus on networking and security and would have support integration with a specific network hardware topology.

Orchestrating Containers with Terraform and Consul

This talk was given by Mitchell Hashimoto, CEO and co-founder of Hashicorp. Hashicorp maintains several open source projects in this space (Terraform and Consul are two of them) while also selling related commercial software products. Mitchell defined orchestration as "do some set of actions, to a set of things, in a set order." The particular goal for orchestration in the context of his talk was to safely deliver applications at scale. He noted that containers solve some problems, namely packaging (e.g. Docker Image), image storage (e.g. Docker Registry), and execution (e.g. Docker Daemon), with a sidenote that image distribution might still be an open problem here.

However, containers do not solve other problems that are nonetheless important for application delivery, namely:

  • infrastructure lifecycle and provisioning: the modern datacenter interacts with cloud-based DNS, CDNs, even databases. It needs container hosts, storage, network, and external services. Infrastructure should support container creation (easy), update (hard; even harder to do update with minimal downtime), and destroy lifecycle events. Infrastructure has its own lifecycle events: canary infrastructure changes, rolling deployments, etc.
  • monitoring: needed at multiple levels from node/host level to container level to service level. This information must be able to be not only collected but propagated, as it can have utility for later/downstream orchestration actions.
  • service discovery and configuration: where is service "foo"? How do we provide runtime configuration of a service at the speed of containers, especially in an immutable world? Mitchell suggested that Chef and Puppet don't really have a good injection point for this information when running or launching containers.
  • security/identity: need to provide for service identity for secure service-to-service communication, as well as a way to store and retrieve secrets.
  • deployment and application lifecycle: there is a need to support canary deployments, rolling deployments, blue/green deployments, and others. In an immutable server setting, this requires support for "create before destroy". Users must be able to trigger a deployment and monitor an in-flight deployment.

Even as organizations adopt containers, though, there is still a need to continue to support legacy applications; the transition from non-containers to containers isn't going to be atomic, so orchestration needs to also include support for non-containerized systems. The time period for this transition will probably be years, and what about orchestration in a post-container world someday? Mitchell quoted Gartner and Forrester (~citation needed~) as estimating that the Fortune 500 would be completing their transition to 2015, over a decade after viable enterprise-grade virtualization became available. In other words, orchestration is an old problem; it's not caused by containers. However, the higher density and lifecycle speed of containers reveals and exacerbates orchestration problems. Modern architectures also include new patterns and elements like public cloud, software-as-a-service (SaaS), and generally a growing external footprint. Orchestration problems will continue to exist for the foreseeable future.


Terraform solves the infrastructure piece of an overall orchestration solution, providing the ability to build, combine, and launch infrastructure safely and efficiently. As a way of illustrating what problems Terraform solves, Mitchell asked:

  • What if I asked you to create a completely isolated second environment to run an app? (e.g. QA or staging)
  • What if I asked you to deploy or update a complex application?
  • What if I asked you to document how our infrastructure is architected?
  • What if I asked you to delegate some operations to smaller teams (e.g. the distinction between "core IT" and "app IT"). Mitchell noted it is too easy to launch stuff "around" your Tech Ops teams these days, resulting in "shadow ops", so rather than fight it, find a way to achieve it well.

Terraform permits you to create infrastructure with code, including servers, load balancers, databases, email providers, etc., similar to what OpenStack Heat provides. This includes support for SaaS and PaaS resources. With Terraform, there is a single command that is used for both creating and updating infrastructure. It allows you to preview changes to infrastructure and save them as diffs; therefore, code plus diffs can be used to treat infrastructure changes like code changes: make a PR, show diffs, review, accept and merge. Terrform has a module system that allows subdividing your infrastructure to allow teamwork without risking stability; the configuration system allows you to reference the dynamic state of other resources at runtime, e.g. ${digitalocean_droplet.web.ipv4_address}. Its configuration system is human-friendly and JSON compatible; as a text format it is version-control friendly. Since the configuration is declarative, this allows Terraform to be idempotent and highly parallelized; the diff-based mechanism means that Terraform will only do what the plan says it will do, allowing you to examine what it will do ahead of time ("make -n", anyone?) as a clear, human-readable diff.


Consul is "service discovery, configuration, and orchestration made easy." It is billed as being distributed, highly-available, and datacenter-aware. In a similar fashion to his description of Terrform, Mitchell identified several questions that Consul can answer for you:

  • Where is service foo?
  • What is the health status of service foo?
  • What is the health of machine or node foo?
  • What is the list of all currently running machines?
  • What is the configuration of service foo?
  • Is anyone currently performing operation foo?

Consul offers a service lookup mechanism with both HTTP and DNS-based interfaces. For an example of the latter:

$ dig web-frontend.service.consul. +short
Consul can work for both internal and external services (the external ones can be manually registered), and incorporates failure detection/health-checking so that DNS won't return non-healthy services or nodes (the HTTP interface offers more detailed information about the overall health state of the managed catalog). Health checks are carried out via local agents; the health check can be an arbitrary shell script. Participating nodes then gossip health information around to each other.

Consul can work for both internal and external services (external ones can be explicitly registered). Consul incorporates failure detection so that DNS lookups won't return non-healthy services or nodes. The HTTP API also includes endpoints to list the full health state of the catalog of services and nodes. Health checks are run locally and take the form of executing a shell script.

Consul provides key/value storage that can be used for application configuration, and watches can be set on keys (via long poll) to receive notification of changes. Consul also provides for ACLs on keys to protect sensitive information and allow for multi-tenant use. Mitchell suggested that the type of information best suited for Consul should power "the knobs you want to turn at runtime" such as port numbers or feature flags. There is an auxiliary project called consul-template that can regenerate configuration files from templates whenever underlying configuration data changes.

Consul provides multi-datacenter support as well, although from what I understood, each datacenter is essentially its own keyspace, as the values are set by the strongly consistent Raft protocol, which generally don't run well across wide-area networks (WANs). Key lookups are local by default but the local datacenter Consul masters can forward requests to other datacenters as needed, so you can still view keys and values from all datacenters within one user interface.

In addition to basic key/value lookup, Consul also supports events that can be published as notifications, as well as execs, which are conceptually a "scalable ssh for-loop" in Mitchell's words. He said there are pros and cons to using each of events, execs, and watches, but that when used in appropriate settings they have found they can scale to thousands of Consul agents.

Camille Fournier: Consensus Systems for the Skeptical Architect

Camille Fournier, CTO of Rent the Runway (RTR), subtitled her talk "Skeptical Paxos/ZAB/Raft: A high-level guide on when to use a consensus system, and when to reconsider". Camille rhetorically asked: if new distributed databases like Riak, Cassandra, or MongoDB don't use a standalone consensus system like ZooKeeper (ZK), etcd, or Consul, are the latter consensus systems any good? She pointed out that the newer distributed databases are often focused on: high availability, where strong consistency is a tradeoff; fast adoption to pursue startup growth (e.g. "don't ask me to install ZooKeeper before installing your distributed database"); and were designed from the ground up as distributed systems by distributed systems experts. She also shared that RTR does not use a standalone consensus system, largely because their business needs and technical environment either don't require or aren't suitable for such a system. In the remainder of the talk, Camille shared some evaluation criteria that teams and organizations can use when trying to decide if systems like ZK et al. are a good fit.

First, we should evaluate where the system would run. If the environment does not require operational support for rapid growth and rapid change, then a standalone consensus system may be overkill. Consensus systems are often used to provide distributed locks or service orchestration for large distributed systems, but in Camille's words, "you don't always have an orchestra; sometimes you have a one-man band." Simpler alternatives to distributed service orchestration include load balancers or DNS; locks can be provided databases (just use a transactional relational database or something like DynamoDB that supports strongly consistent operations).

Second, we should consider what primitives are needed for our application. Consensus systems provide strong consistency and several have support for ephemeral nodes (disappear when a client session disconnects) and notifications or watches. Different consensus systems provide these to different degrees, and later in the talk Camille summarized some of these tradeoffs, which I captured below. Perhaps her strongest point in this section is that consensus systems are not really a key/value store per se; they are designed to point to data, not to contain it. You can use them for limited pub-sub operations, can also fix things with duct tape and bailing wire.

Camille also provided some analysis about the similarities and differences between ZK and etcd to illustrate some of the subtleties involved with choosing one or the other. Both obviously use a proven consensus algorithm (ZAB for ZooKeeper, Raft for etcd) to provide consistent state. With ZooKeeper, clients maintain a stateful connection to the cluster. While this can be powerful, it can be hard to do right--the ZooKeeper client state machine is complicated, and Camille recommended using the Curator library for Java instead of writing your own. This ensures a single system image per client. On the other hand, etcd has an HTTP-based interface, which is easy to implement and does not require complex session management. However, you must pay for the overhead of the HTTP protocol, and if you use temporary time-to-live nodes, you have to implement heartbeats/failure detection in your clients; achieving a "single system image" requires more work. On the other hand, ZK watches do not guarantee that you will see the intermediate states of a watched node that undergoes multiple changes, whereas the etcd watches are provided via longpoll and can also show the change history within a certain timeframe.

Camille then wrapped up with a number of common challenges that can be faced when deploying consensus systems.

Odd numbers rule
Use 3 or 5 cluster members; there's no value in just having four, as you still need 3 available to gain a majority. This requires more servers to be up than with 3 cluster members while tolerating fewer failures than 5 cluster members--the worst of both worlds.
Clients can run amok
Camille also phrased this as "multi-tenancy is hard" and "hell is other developers." She suggested potentially not sharing the same consensus system deployment to guarantee resource isolation, doing lots of code review, and providing wrapper libraries for clients to ensure good client behavior.
Garbage collection (GC) and network burps
This is a warning about lock assumptions. Many distributed locks are advisory and are based on the concept of temporary leases that must be successfully renewed by the holder or the lock gets released. In some cases, a GC pause or network partition can exceed the lock lease timeout, which can result in two systems thinking they both hold the lock. Dealing successfully with this challenge requires validating lock requests at the protected resource in order to detect out-of-order use. Despite strong consistency, the realities of the physical world mean that ZK et al. can only provide advisory locking at best.
Look for the blood spatter pattern
Bugs will be in the features no one uses or the things that happen rarely. Camille shared a story where when she first tried to use ACLs and authentication in ZooKeeper--a documented feature--she found it didn't work at all, because no one actually used it!
Consensus Owns Your Availability
Per the CAP theorem, there can be an availability tradeoff for consensus systems during partitions, even if the individual cluster members are up. If you make your application's availability dependent on the consensus system's availability, consensus can become a single point of (distributed!) failure.