March 25, 2010
In the previous installment we looked at how your reasons for picking Riak affect how your schema should be designed, and how you might go about structuring your data at the individual object level. In this post we’ll look at how to design relationships on top of Riak.
Relationships? I thought Riak was key-value.
An even mildly-complicated application is going to have more than one type of data to store and manipulate. Those data are not islands, but have relationships to one another that make your application and its domain more than just arbitrary lists of things.
Yes, at its core, Riak is a key-value store or distributed hash-table. Because key-value stores are not very sophisticated at modeling more complicated relationships, Riak adds the concept of links between objects that are qualified by “tags” and can be easily queried using “link-walking”.
Now, the knee-jerk reaction would be to start adding links to everything. I want to show you that the problem of modeling relationships is a little more nuanced than just linking everything together, and that there are many ways to express the same relationship — each having tradeoffs that you need to consider.
Key correspondence
The easiest way to establish a relationship is to have some correspondence between the keys of the items. This works well for one-to-one and some one-to-many relationships and is easy to understand.
In the simplest case, your related objects have the same key, but different buckets. Lookups on this type of relationship are really efficient, you just change the bucket name to find the other item. How is this useful? Why not just store them together? One of the objects may get updated or read more often than the other. Their data types might be incompatible (a user profile and its avatar, for example). In either case, you get the benefit of the separation and fast access without needing link-walking or map-reduce; however, you really can only model one-to-one relationships with this pattern.
For one-to-many types of relationships, you might prefix or otherwise derive the key of the dependent (many) side of the relationship with the key of the parent side. This could be done as part of the bucket name, or as a simple prefix to the key. There are a couple of important tradeoffs to consider here. If you choose the bucket route, the number of buckets might proliferate in proportion to your data quantity. If you choose to prefix the key, it will be easy to find the parent object, but may be more difficult to find the dependent objects. The same reasons as having equivalent keys apply here — tight cohesion between the objects but different access patterns or internal structure.
De-normalization / Composition
A core principle in relational schema design is factoring your relations so that they achieve certain “normal forms”, especially in one-to-many sorts of relationships. This means that if your domain concept “has” any number of something else, you’ll make a separate table for that thing and insert a foreign key that points back to the owner. De-normalizing (or composing) your data often makes sense, both for the sake of performance and for ease of modeling.
How does this work? Let’s say your relational database had tables for people and for addresses. A person may have any number of addresses for home, work, mailing, etc, which are related back to the person by way of foreign key. In Riak, you would give your person objects an “addresses” attribute, in which you would store a list or hash of their addresses. Because the addresses are completely dependent on the person, they can be a part of the person object. If addresses are frequently accessed at the same time as the person, this also results in fewer requests to the database.
Composition of related data is not always the best answer, even when a clear dependency exists; take for instance, the Twitter model. Active users can quickly accrue thousands of tweets, which need to be aggregated in different combinations across followers’ timelines. Although the tweet concept is dependent on the user, it has more conceptual weight than the user does and needs to stand by itself. Furthermore, performance would suffer if you had to pull all of a user’s tweets every time you wanted to see their profile data.
Good candidates for composition are domain concepts that are very dependent on their “owner” concept and are limited in number. Again, knowing the shape of your data and the access pattern are essential to making this decision.
Links
Links are by far the most flexible (and popular) means for modeling relationships in Riak, and it’s obvious to see why. They hold the promise of giving a loose graph-like shape to your relatively flat data and can cleanly represent any cardinality of relationship. Furthermore, link-walking is a really attractive way to quickly do queries that don’t need the full power of map-reduce (although Riak uses map-reduce behind the scenes to traverse the links). To establish a relationship, you simply add a link on the object to the other object.
Intrinsically, links have no notion of cardinality; establishing that is entirely up to your application. The primary difference is whether changing an association replaces or adds/removes links from the associated objects. Your application will also have to do some accounting about which objects are related to other objects, and establish links accordingly. Since links are uni-directional, stored on the source, and incoming links are not automatically detected, your application will need to add the reciprocal links when traversals in both directions are needed (resulting in multiple PUT operations). In some cases, especially in one-to-many relationships where the “many” side is not accessed independently, you might not need to establish the reciprocal link. Knowing how your data will be accessed by the application — both reads and writes — will help you decide.
Links have a few other limitations that you will need to consider. First, although the tag part of the link can technically be any Erlang term, using anything other than a binary string may make it difficult for HTTP-based clients to deal with them. Second, since links are stored directly with the object in its metadata, objects that have many links will be slower to load, store, and perform map-reduce queries over. In the HTTP/REST interface as well, there are practical limitations simply because of the method of transport. At the time of writing, mochiweb — the library that is the foundation of webmachine, Riak’s HTTP interface — uses an 8K buffer for incoming requests and limits the request to 1000 header fields (including repeated headers). This means that each Link: header you provide needs to be less than 8K in length, and assuming you use the typical headers when storing, you can have at most about 995 individual Link: headers. By the time you reach the approximately 150,000 links that that provides, you’ll probably want to consider other options anyway.
Hybrid solutions
At this point, you might be wondering how your data is going to fit any of these individual models. Luckily, Riak is flexible, so you can combine them to achieve a schema that best fits your need. Here’s a few possibilities.
Often, either the number of links on an object grows large or the need to update them independently of the source object arises. In our Twitter example, updating who you follow is a significantly different operation from updating your user profile, so it makes sense to store those separately, even though they are technically a relationship between two users. You might have the user profile object and list of followed users as key-correspondent objects, such as users/seancribbs and following/seancribbs (not taking into account your followers, of course).
In relational databases you typically use the concept of a “join table” to establish many-to-many relationships. The intermediary table holds foreign keys back to the associated objects, and each row represents one individual association, essentially an “adjacency list”. As your domain becomes more complex and nuanced, you might find that these relationships represented by join tables become domain concepts in their own right, with their own attributes. In Riak, you might initially establish many-to-many relationships as links on both sides. Similarly to the “join table” issue, the relationship in the middle might deserve an object of its own. Some examples that might warrant this design: qualified relationships (think “friends” on Facebook, or permissions in an ACL scheme), soft deletion, and history (tracking changes).
Key correspondence, composition and linking aren’t exclusive ways to think of relationships between data in your application, but tools to establish the semantics your domain requires. I’ve said it many times already, but carefully evaluate the shape of your data, the semantics you want to impose on it, and the operational profile of your application when choosing how you structure your data in Riak.