September 12, 2013
Superfeedr provides a real-time API to any application that wants to produce (publishers) or consume (subscribers) data feeds without wasting resources or maintaining an expensive and changing infrastructure. It fetches and parses RSS or Atom feeds on behalf of its users and new entries are then pushed to subscribing applications using a webhook mechanism (PubSubHubbub) or XMPP. The Google Reader replacement is an example of a popular API built by Superfeedr that has backed up much of Google Reader.
Riak is used by Superfeedr to store the content from all feeds so users can retrieve past content (including the Google Reader API replacement), even if the feeds themselves may not include these entries anymore. This Riak datastore is referred to as “the cave.”
When Superfeedr first built “the cave” datastore, they opted for a cluster of large Redis instances (five servers with 8GB of memory each) due to its inherent speed. However, they realized that a more durable system was needed and the need to manually shard feeds across the cluster made it difficult to scale beyond storing a couple entries per feed. The scaling problem turned into an even larger issue because the average size of a stored entry was 2KB. Now, they had nearly 1,000 items per feed and 50 million feeds, translating to over 93TB of data and quickly growing.
They chose to move “the cave” to Riak due to its focus on availability (as delivering stale data was more important than delivering no data) and ease-of-scale. According to Superfeedr Founder, Julien Genestoux, “Riak solves the scalability problem elegantly. Through consistent hashing, our data is automatically distributed across the cluster and we can easily add nodes as needed.” While Riak does have a lower read performance than Redis, this proved to not be a problem as they found it easy to put caches in front of Riak if they needed to serve content faster.
Though Superfeedr found it easy to set up their Riak cluster, the default behavior for handling conflicts had to be adjusted for their use case. By working with Riak and the Riak community, they were able to find the right settings and optimize their conflict resolution algorithm. For more information on Riak’s configurable behaviors, check out our four-part blog series.
Superfeedr went into production in two phases: they started storing production data in the beginning of 2013 and began serving that data about two months later. During this period, Superfeedr was able to design their cluster infrastructure and thoroughly performance test it with actual production data.
Two types of objects are stored in Superfeedr’s Riak datastore: feeds and entries. Feeds are stored as a collection of internal feed ids, which correspond to the entries and include some meta-information, such as the title. Entries correspond to feed entries and are indexed by feedID-entryID, allowing them to store multiple entries for each feed. This indexing scheme allows entries to be retrieved, even if they lose track of the feed element, through a MapReduce job.
At write time, Superfeedr writes both the feed element and the entry element. When they query for a feed, they issue a MapReduce job to read both the feed element and the desired number of entry items. They also use a pagination mechanism to limit the resources consumed for each request, with an arbitrary limit of 50 entries.
Today, Superfeedr has served over 23 billion entries, with nearly one million more being published every hour. Their six-node Riak cluster (built on 16GB Linode slices) has allowed them to horizontally scale their cluster as their content and user base grows. “Riak is the right tool for us due to its scalability and always on availability,” said Genestoux. “We have refined it to fit our needs and can rest-assured that no data will ever be lost in our Riak ‘cave.’”
If you’re looking for a Google Reader replacement or interested in learning more about Superfeedr, check out their site: superfeedr.com/. For other examples of Riak in production, visit: riak.com/riak-users/