February 10, 2014
Kiyoto Tamura is Developer Evangelist at Treasure Data, the provider of cloud-based, managed service for data. When he is not busy spreading the word about Fluentd and Treasure Data, he updates the documentation and answers questions on the mailing list and Twitter for Fluentd.
Today, I want to introduce you to Fluentd, the open source data stream processor, and how it can be used to store a variety of data in your Riak cluster in near real-time.
What’s Fluentd?
Fluentd was originally authored by Treasure Data to help acquire log data for their customers. Its key features are:
- Log everything in JSON: Fluentd subscribes to the philosophy that logs should be for both machines and humans and all incoming data is transformed into well-structured JSON by input plugins.
- Tag-based event routing: In Fluentd, each piece of data (called Event) has a tag that tells Fluentd what to do with it. This tag-based event routing makes data stream transfer/filtering/aggregation more modular and manageable.
- Reliability Matters: Fluentd takes reliability and data integrity seriously, supporting file-based buffering as well as a high availability setup via redundancy.
- Keep the Core Small, Augmented by Plugins: Fluentd’s core program is small and very lightweight. It handles basic inputs and outputs like tailing a file and HTTP/TCP out of the box, but its full power comes from 200+ community contributed plugins that let you integrate Fluentd with a variety of systems, including Riak!
Collecting Apache Logs into Riak with Fluentd
To show Fluentd’s flexibility and ease of use, I will demonstrate how to collect Apache web server logs into Riak using Fluentd.
We assume that…
- You are on OSX or Linux
- Riak is already installed and running
- You have an Apache web server log
Installing Fluentd and the Riak Output Plugin
If you already have gem
(Rubygems command line tool) installed, you can install Fluentd with the following command.
gem install fluentd
Also, Treasure Data packages Fluentd as td-agent
for the deb and rpm packaging systems.
We also need to install the Riak plugin so that we can output data to a Riak node. This is as easy as running
gem install fluent-plugin-riak
(If you are using td-agent, you can run /usr/lib/fluent/ruby/bin/fluent-gem install fluent-plugin-riak
to install the same plugin.)
Configuring Fluentd
Create a configuration file called fluent.conf
and copy and paste the following lines into it:
The <source>...</source>
section tells Fluentd to tail an Apache2-formatted log file located at /var/log/apache2/access_log
. Each line is parsed as an Apache access log event and tagged with the riak.apache
label.
The <match riak.**>...</match>
matches all events whose tags start with riak
. and sends it to a Riak node located at localhost:8087
. By specifying multiple nodes like nodes host1 host2 host3
, Fluentd will try host2
if host1
is unavailable, host3
if host2
is also unavailable, and so forth.
Finally, launch Fluentd with the following command:
$ fluentd -c fluentd.conf
Do make sure you have correct file access permissions so that you can read the Apache log file and write to /var/log/fluentd/apache2.access_log.pos
(sudo-ing might help).
Now, you should start seeing data into your Riak cluster. We can check that by hitting Riak’s HTTP API:
There it is! (The response JSON is formatted for readability)
What’s Next?
You can learn more about Fluentd on our website and documentation page. For Riak CS enthusiasts, our S3 output plugin is worth checking out.
If you have any question, feel free to ask questions on our active mailing list or Twitter.
Happy logging!