Riak’s Engineering team is proud to announce that version 1.6.0 of the Spark-Riak connector has been released. Over the past several months we have added new features, upgraded existing features, and fixed bugs to enable our customers to take full advantage of the combined power of Riak and Apache Spark.
We’ve listened to our Community and included several key features:
- Python support for Riak KV buckets
- Support for Spark Streaming
- Failover support for when a Riak node is unavailable during a Spark job
We also added performance and testing enhancements, upgraded example applications and added documentation. A full list of changes can be found in the CHANGELOG. All new features will work with TS tables and KV buckets in Riak TS 1.3.1+, and support for Riak KV will come with the release of Riak KV 2.3.[updated October 19, 2016]
Let’s take a peek at one of the more exciting features of this release: Python support for Riak KV buckets. For the purposes of this quick demonstration, [we will assume there is a Riak TS 1.4 node at 127.0.0.1:8087] and that the code will be run in a Jupyter notebook.
Before we can use Spark, we must set up a Spark context using the Spark-Riak connector:
import findspark findspark.init() import pyspark import pyspark_riak import os os.environ['PYSPARK_SUBMIT_ARGS'] = "--packages com.basho.riak:spark-riak-connector:1.6.0 pyspark-shell" conf = pyspark.SparkConf().setAppName("My Spark Riak App") conf.set("spark.riak.connection.host", "127.0.0.1:8087") sc = pyspark.SparkContext(conf) pyspark_riak.riak_context(sc)
Now that the Spark context has been properly initialized for use with Riak, let’s write some data to a KV bucket named ‘kv_sample_bucket’:
sample_data = [ {'key0': {'data': 0}}, {'key1': {'data': 1}}, {'key2': {'data': 2}} ] kv_write_rdd = sc.parallelize(sample_data) kv_write_rdd.saveToRiak(‘kv_sample_bucket’)
Now that Spark has written our data in parallel to a KV bucket, let’s pull that data out with a full bucket read:
kv_read_rdd = sc.riakBucket(‘kv_sample_bucket’).queryAll() print(kv_read_rdd.collect()) # prints [ {'key0': {'data': 0}}, {'key1': {'data': 1}}, {'key2': {'data': 2}} ]
There are several other ways to read data from a KV bucket including simple key queries, 2i tag queries, and 2i range queries. Additionally, with 2i range queries, custom partitioning and parallelization can be used to increase read performance. More information on python support for the Spark-Riak connector can be found in the docs.
To get started using the new version of the Spark-Riak connector, we encourage you to visit the github repository and start playing around with all the new features.
Korrigan Clark
@kralCnagirroK