May 11, 2011
I’m on a plane to Goto Copenhagen from our electric post-Kill-Dash-Nine Board meeting in Washington, DC and, afterwards, an intense client meeting. I went to watch Pete Sheldon, our new Director of Sales, and Justin Sheehy at work. I finally had a chance to sit and study a proposal for the Riak product roadmap for the next year. This roadmap is both breathtakingly ambitious and oddly incremental, quotidian even.
In the next year we will solve problems endemic to distributed systems – groundbreaking work of the sort careers are surely made — and yet at the same time, these problems seem incremental and iterative; part of an ongoing process of small improvements. They seem both astounding and inevitable.
This led me to an interesting insight — doing this is not easy.
What we are doing is like digging a canal through bedrock. We are hacking away at hard problems — problems others encountered and, either died trying or, mopping their brows with their handkerchiefs, threw down their shovels and went shopping. A lot of cool companies are hacking away, too, so it is not like we are alone, but the honorable and the diligent are not what this post is about.
This post is about the ugly truth I have to call out.
To put it bluntly, if you are claiming the architectural challenges presented by applications with high write loads spread across multiple data centers are easy, you are lying. You do not, as Theo Schlossnagle remarked recently to us, “respect the problem.” You must respect the problem or you disrespect the necessary tradeoffs. And if you disrespect the tradeoffs, you disrespect your user. And if you disrespect your user, you are, inevitably, a liar. You say _this_ is easy. You promise free lunches. You guarantee things that turn out to be impossible. You lie.
What our technology generation is attempting is really hard. There is no easy button. You can’t play fast and loose with the laws of physics or hand-wave around critical durability issues. You can sell this stuff to your venture capitalist, but we’re not buying it.
Immutable laws are not marketing. And therefore, marketing can’t release you from the bonds of immutable laws. You can’t solve the intractable problems of distributed systems so eloquently summarized with three letters – C-A-P – by Google’s cloud architect (and Riak Board member) Dr. Eric Brewer (a man both lauded and whose full impact on our world has not yet been reckoned), with specious claims about downloads and full consistency.
To wit:
- Memory is not storage.
- Trading the RDBMS world for uptime is hard. There are no half-steps. No transitional phases.
- The geometry of a spinning disk matters for your app. You can’t escape this.
- Your hardware RAID controller is not perfect. It screws things up and needs to be debugged.
- Replication between two data centers is hard, let alone replication between three or 15 data centers.
- Easily adding nodes to a cluster under load impacts performance for a period determined by the amount of data stored on the existing nodes and the load on the system…and the kind of servers you are using…and a dozen other things. It looks easy in the beginning.
These are all sensible limitations. Like the speed of light or the poor quality of network television, these are universal constants. The point is: tradeoffs can’t be solved by marketing.
To be sure, there are faster databases than Riak. But do they ship with default settings optimized for speed or optimized for safety? We *ache* to be faster. We push ourselves to be faster. We tune and optimize and push. But we will never cross the line to lose data. While it is always tempting to set our defaults to *fast* instead of *safe*, we won’t do it. We will sacrifice speed to protect your data. In fact, if you prefer speed to preserving data, *don’t use Riak*. We tell the truth even if it means losing users. We will not lie.
Which is why others who do it make me ball my fists, score my palms, and look for a heavy bag to punch. Lying about what you can do – and spreading lies about other approaches – is a blatant attempt to replace the sacrifice of hard-core engineering and ops with fear, uncertainty, and doubt – FUD.
People who claim they are “winning NoSQL” with FUD are damaging our collective chance to effect a long-overdue change to the way data is stored and distributed. This opportunity is nothing short of a quantum shift in the the quality of your life if you are in development, operations, or are a founder who lives and dies by the minute-to-minute performance of your brainchild/application.
The FUD-spreaders are destroying this opportunity with their lies. They are polluting the well by focusing on false marketing – on being the loud idiot drunk – instead of solving the problem. They can screw this up with their failure. It is time for us to demand they drop the FUD – drop the “F” bomb – and stop lying about what they can do. Just tell the truth, like Riak does — admit this is a hard problem and that hardcore engineering is the answer. In fact, they should do the honorable thing and quit the field if they are not ready to invest in the work needed to solve this problem.
If we, collectively, the developer and sysadmin community running the infrastructure of the world economy, allow people to replace engineering with marketing lies, to trade coffee mugs for countless hours of debugging, and in doing so, to destroy the reputation of a new class of data storage systems before they have had a chance to gain a foothold in the technology world, we all lose.
There are many reasons why the FUD spreaders persist.
There are the smart folks who throw our hands up and cynically say that liars are by their nature better marketers. But marketing need not be lies, cynically accepted.
Then there are some of us who are too busy keeping projects or businesses afloat to really dig into the facts. But we sense that we are being lied to, and so we detach, saying this is all FUD. This can’t help us. Tragically, we miss the opportunity to make a big change
Most of us simply want to trust other developers and will believe claims that seem too good to be true. If we do this, we are in a small but serious way saying that our hard-won operational wisdom is meaningless, that anyone who has deployed a production application or contributed to an open-source project has no standing to challenge the loud-mouth making claims that up-time is easy.
Up-time is not easy. Sleeping through the night without something failing is a blessing. Do not – *do not* – let VCs and marketers mess up our opportunity to take weekends off and sleep through the night when we are on call. The database technologies of 1980 (and their modern apologists in NoSQL) should not shape the lives of technologists in 2011.
In Conclusion…
In the briefest terms, Riak won’t betray this revolution because we keep learning big lessons from our small mistakes. We are our harshest critics.
We will deliver a series of releases that allow you to tune for the entire spectrum of CAP tradeoffs – strong consistency to strong partition tolerance – while making clear the tradeoffs and costs. At the same time Riak will provide plugins for Memcache, secondary indices, and also a significant departure from existing concepts of MapReduce that allows for flexible, simple, yet massively distributed computation, and much more user-friendly error reporting and logging. (Everyone reading this understands why that last item merits inclusion on any list of great aspirations – usability can damn or drive a project.)
We will deliver these major innovations, improvements, and enhancements, and they will be hard for us and our community to build. And it will take time for us to explain it to people. And you will find bugs. And it will work better a year after it ships than it does on day one.
But we will never lie to you.
We call on others to please drop the FUD, to acknowledge the truth about the challenges we all face running today’s infrastructure, and to join us in changing the world for the better.