Friday, August 13, 2010

RESTful Refactor: Combine Resources

I've been spending a lot of time thinking about RESTful web services, particularly hypermedia APIs, and I've started to discover several design patterns as I've begun to play around with these in code. Today, I want to talk about the granularity of resources, which is roughly "how much stuff shows up at a single resource". Generally speaking, RESTful architectures work better with coarser-grained resources, i.e., transferring more stuff in one response, and I'll walk through an example of that in this article.

Now, in my previous article, I suggested taking each domain object (or collection of domain objects) and making it a resource with an assigned URL. While following this path (along with the other guidelines mentioned) does gets you to a RESTful architecture, it may not always be an optimal one, and you may want to refactor your API to improve it.

Let's take, for example, the canonical and oversimplified "list of favorite things" web service. There are potentially two resource types:

  • a favorite thing (/favorites/{id})
  • a list of favorite things (/favorites)
All well and good, and I can model all sorts of actions here:
adding a new favorite
POST to /favorites
removing a favorite
DELETE to the specific /favorites/{id}
editing a favorite
PUT to the specific /favorites/{id}
getting the full list
GET to /favorites
Fully RESTful, great. However, let's think about cache semantics, particularly the cache semantics we should assign to the GET to /favorites. This is probably the most common request we'd have to serve, and in fact it ought to be quite cacheable, as in practice (as with a lot of user-maintained preferences or data) there are going to be lots of read accesses between writes.

There's a problem here, though: some of the actions that would cause an update to the list don't operate on the list's URL (namely, editing a single entry or deleting an entry). This means an intermediary HTTP cache won't invalidate the cache entry for the list when those updates happen. If we want a subsequent fetch of the list by a user to reflect an immediate update, we either have to put 'Cache-Control: max-age=0' on the list and require validation on each access, or we need the client to remember to send 'Cache-Control: no-cache' when fetching a list after an update.

Putting 'Cache-Control: max-age=0' on the list resource really seems a shame; most RESTful APIs are set up to cross WAN links, and so you may be paying most of the latency of a full fetch that returned a 200 OK even if you are getting a 304 Not Modified response, especially if you have fine-grained resources that don't have a lot of data (and a textual list of 10 or so favorite items isn't a lot of data!).

Requiring the client to send 'Cache-Control: no-cache' is also problematic: the cache semantics of the resources are really supposed to be the server's concern, yet we are relying on the client to understand something extra about the relationship between various resources and their caching semantics. This is a road that leads to tight coupling between client and server, thus throwing away one of the really useful properties of a REST architecture: allowing the server and client to evolve largely independently.

Instead, let me offer the following rule of thumb: if a change to one resource should cause a cache invalidation of another resource, maybe they shouldn't be separate resources. I'll call this a "RESTful refactoring": Combining Resources.

In our case, I would suggest that we only need one resource:

  • the list of favorites
We can still model all of our actions:
adding a new favorite
PUT to /favorites a list containing the new item
removing a favorite
PUT to /favorites a new list with the offending item removed
editing a favorite
PUT to /favorites a list containing an updated item
getting the full list
GET to /favorites
But now, I can put a much longer cache timeout on the /favorites resource, because if a client does something to change its state, it will do a PUT to /favorites, invalidating its own cache (assuming the client has its own non-shared/private cache). If the resource represents a user-specific list, then I can probably set the cache timeout considering:
  • how long am I willing to wait for another user to see the results of this user's updates?
  • if the same user accesses the resource from a different computer, how long am I willing to allow those two views to stay out of sync? (bearing in mind that the user can usually, and pretty intuitively, hit refresh on a browser page that looks out of date)?
Probably these values are a lot larger than the zero seconds we were using via 'Cache-Control: max-age=0'. When you can figure out how to assign longer expiration times to your responses, you can get a much bigger win for performance and scale. While revalidating a cached response is probably faster than fetching the resource anew, not having to send a request at all to the origin is waaaaaaay better.

The extreme case, here, of course, would be a web service where a user could just get all their "stuff" in one big blob with one request (as we modelled above). There are many domains where this is quite possible, and when you factor in gzip encoding, you can start to contemplate pushing around quite verbose documents, which can be a big win assuming your server can render the response reasonably quickly enough.