Saturday, October 31, 2009

NoSQL East 2009 Redux

I just attended NoSQL East down it Atlanta over the last two days. This was a fantastic conference, well-organized, with not only great content and speakers, but also a very well-educated audience. There was a telling moment in the first non-keynote talk where the speaker asked the audience "How many people have read the Dynamo paper?" and easily 95% of the audience put their hands up.

I'd divide the major focus areas into the following groups:

As one of the speakers put it, which one you need depends on the shape of your data; these are all isomorphic in one way or another to each other, just as Turing complete languages are all essentially complete. But you still want to choose the right tool for the job. One nice touch was that a number of the systems were introduced not by one of the project committers, but rather by people who were using them to get stuff done. It was a very practical focus, much appreciated.

There were a couple of interesting tidbits from the conference, including a rollicking, confusing talk by John "My first network address was '12'" Day about network architecture and how TCP/IP had really gotten it all wrong. This guy was about 3 planes of existence above what I could follow at 9:15am--I had a distinct feeling he knew exactly what he was talking about, and was probably right, but there was no way he was making sense even to a very technical audience. Then there was the poor guy from Microsoft Research who had done some interesting distributed IDE work...for the .NET framework. In a sea of Mac and Linux laptops across the audience, he had a bunch of people who really couldn't make practical use of what he'd done.

Key-Value Stores

I've got to give a pretty big nod to Riak here for a straight-up key-value store; Justin Sheehy basically gave a talk about decentralized, scalable, lights-out system design that might well have been a good keynote, and it was clear that Riak was designed with high priority on those aspects. Major advantages over Voldemort in my mind are: ability to add/subtract nodes from running clusters, tunable (per-bucket, per-request!) quorum parameters (N,R,W), plus support for pluggable consistent hashing modules, partition distribution modules, and conflict resolution modules. For most of my use cases, with N=3, I'd probably do R=1,W=3 to optimize for read latency, as most of our writes would either be a user clicking "save" or would be the result of an asynchronous background process. On the other hand, if I wanted to build something like SQS on top of this (which I conjecture is possible), I'd probably do W=1,R=3 to optimize for the "post a message" latency.

While I was there, I had a great talk with them where we went over how there were going to expose the conflict resolution out up to the client via their HTTP/JSON interface. There's no out-of-the-box support for multiple datacenter awareness, although it seemed possible to add it via the pluggable partition distribution module, although their inter-node communication leverages Erlang's native networking, meaning some kind of VPN/IPSec tunnel would be the main way to make this work across sites. I'm pretty sure no matter what you would want to use in this space you'll probably have to end up using a VPN over WAN to give the appearance of a seamless LAN cluster (although the Cassandra guys piped up in the conference IRC channel that Cassandra has multi-site support already).

I can't tell whether being written in Erlang is a plus or a minus. On the one hand, it won't plug in well to our JMX-based monitoring frameworks, and I'm pretty sure very few folks within our organization have ever built or deployed an Erlang system. On the other hand, Erlang is the perfect choice for highly concurrent, fault-tolerant programming, and would probably be right up the alley of several of our programmers (I'm proud to say that while we are primarily a Java shop, we have a lot of other language proficiencies in-house spanning Ruby, Python, Clojure, and Scheme. Based on my brief escapade with it last night, I'm confident that (1) we have a number of folks who could pick it up easily and (2) who would jump at the chance). This is probably a wash.

Very interesting in my mind is that some of the key-value stores have brought over some features from other design spaces; Riak allows some basic link structure in their values, including a version of targeted MapReduce that can follow links, which starts to make it feel like a graph-oriented database like Neo4j. Similarly, Cassandra has support for column-oriented indices like HBase does. It's clear there's probably a project out there to scratch your particular itch.

Massive-scale data mining

We saw a couple of talks from folks who were using Hadoop for analyzing very large data sets. But furthermore, they were using frameworks like Pig and Cascading on top of Hadoop to do a lot of ad-hoc queries. Twitter in fact uses Pig to do all of their interesting web analytics, and they've taken all their analytics BAs who are used to doing ad-hoc SQL queries and trained them up with little problem in Pig. This is probably somewhere on our horizon, although there are larger cats to skin at the moment.

Column-oriented datastores

Cassandra and HBase were the major players here. Digg is moving all of their backend storage over to Cassandra after some successful pilots, and we saw a great talk by Mark Gunnels about how he is using HBase because it makes it "simple to get sh*t done". Apparently recent releases of HBase have made strides in fault tolerance (there is now an election algorithm for some of the special name nodes) and latency (apparently performance optimization was a major focus of the 0.20 release). There's an interesting article that describes some wide-area replication schemes that are available with HBase that sound intriguing (although once you are beyond two datacenters, I am convinced that you are better off with the VPN LAN solution if you want to have any hope of achieving eventual consistency).

Document-oriented databases

While the column-oriented datastores are good at organizing semi-structured data, the document-oriented guys are really all about organizing largely unstructured documents, and focus on doing some wild ad-hoc MapReduce queries. I still need to be convinced about the scaling, replication, and geographic distribution capabilities in this space, so it may be a while before we dip our toes in here.

Other

There was a very interesting talk about Tin ("the database so tiny they had to shorten its name"). This basically seemed to be a very clever approach for delivering stock data that leveraged a basic filesystem with range queries and rewriting rules via Sinatra. Turns out web servers serving static files goes pretty fast, if that's a suitable representation for your stuff (seems like his primary use case is for delivering read-only out to clients, where the data gets updated asynchronously in the background by the system). We actually have some datasets that are like that, so this is intriguing!

There was a talk from Neo4j, a graph-oriented database, which I just can't wrap my head around yet. I think I need to read up on some of the background research papers in this area. Certainly, the notion of a datastore based around the notion of link relations would likely be easy to expose as a REST HTTP interface, which is attractive. Our particular domain model can actually be nicely normalized, however (we are currently running off a traditional RDBMS after all), so I'm not sure we need the full semantic flexibility this offers.

There was also a talk from Redis; this is primarily an in-memory storage system with blazing-fast speeds, and the ability to write out to disk is really an afterthought. During the talk he showed a screen snapshot of a "top" running on a cloud-hosted virtual node with 60GB of memory. I cannot make up my mind whether this is ultimately just a badass memcached, or if he's on the cusp of something: if you have enough memory to hold your whole dataset, and you are sufficiently fault-tolerant, why bother writing to disk at all? Especially if you can easily get a periodic snapshot out to disk in the background that can be properly backed up for disaster recovery.

Conclusion

As folks pointed out, "NoSQL" might better be written "NOSQL" to mean "Not Only SQL". I didn't sense a lot of MySQL hatred in here; quite the contrary, many people were very complementary that there were certain things it does *really* well, and that the MySQL hammer was something that was staying firmly in their toolboxes. However, it is clear that there is a maturing community of very practically-minded folks that are looking for a new set of tools to drive their particular screws. Although to be sure (and this is something that I think some of the conference tweets corroborated), this also implies we all have a screw loose or two....

10 comments:

pcdinh said...

I believe that Redis should belong to "key-value stores" category

Tim Anglade said...

More info on tin: slides and soon, code. Cheers ;)

Ismael Juma said...

"Major advantages over Voldemort in my mind are: ability to add/subtract nodes from running clusters, tunable (per-bucket, per-request!) quorum parameters (N,R,W), plus support for pluggable consistent hashing modules, partition distribution modules, and conflict resolution modules."

I think this is a bit misleading. Voldemort has a pluggable RoutingStrategy that decides what nodes should receive a given request. The default is ConsistentRoutingStrategy.

Also, it has a pluggable InconsistencyResolver interface and the default is a chained resolver with TimeBasedInconsistencyResolver over a VectorClockInconsistencyResolver.

Regarding quorum parameters, it's true that Voldemort only allows you to set it per store. I also find that a bit limiting and filed an issue a while ago about allowing it per request (the issue was for gets, but maybe it should be allowed for everything). Internally, it's quite easy to allow for this, most of the work is purely mechanical.

Best,
Ismael

AnĂ­bal Rojas said...

I agree with pcdinh that Redis should be included in the key-value category. It is blazing fast and the configurable async persistence mechanism is very flexyble.

Different from Memcached it provides Lists, Sets, and many atomic commands like INCrease. You can model real problems with this structures.

Also in the Data Structures side a Sorted Set (ZSET) is under heavy development.

Master Slave replication is blazing fast, 10MM keys (237M dump file and 1.5GB of RAM) takes just 21 seconds in a EC2 instance.

(And yes I like it very much ;)

Brian Bulkowski said...

Jon - what was your impression of Citrusleaf ?

Sam Johnston said...

+1 for NOSQL...

Jon Moore said...

@Ismael: Thanks for the updates and corrections on Voldemort. I was probably a little too loose when writing that paragraph; that single sentence you quoted, in my mind, started off with things Voldemort didn't have and trailed over to things I knew Riak had that I was excited about.

It's good to hear Voldemort has support for the pluggable strategies, although I think in my mind not being able to add/subtract nodes on the fly from a cluster is the biggest omission from an operational point of view (at least for my use cases).

Jon Moore said...

@pcdinh, @anibal: I think I would put Redis in the key-value store category if it weren't explicitly for the asynchronous persistence. i.e. when I get a PUT come back with a successful response code, I want that to mean "it's been written to disk somewhere." There are, for sure, a lot of applications where "it'll be written to disk soon" is good enough, but it isn't for my use cases.

Also, it is expensive to scale for large datasets (disk is still much cheaper than memory).

Now, as I alluded to in the discussion of Redis, I'm open to the possibility that I need to rethink the need for to-disk persistence in a world of distributed, fault-tolerant architectures, but I'm not quite there yet (thus why I put Redis in the "not sure what to think about it yet" category).

Jon Moore said...

@Brian: Unfortunately, I didn't take in the Citrusleaf presentation due to a combination of "unconference" discussions going on and at-home work issues.

Ismael Juma said...

Hey Jon,

"It's good to hear Voldemort has support for the pluggable strategies, although I think in my mind not being able to add/subtract nodes on the fly from a cluster is the biggest omission from an operational point of view (at least for my use cases)."

Everyone seems to agree and that is Alex Feinberg's main priority as far as I understand. He recently joined LinkedIn and Voldemort is one of the main projects he will be working on.

Ismael