Introduction to Riak

Riak is a distributed database designed to deliver maximum data availability by distributing data across multiple servers. Riak is an open-source, distributed key/value database for high availability, fault-tolerance, and near-linear scalability.

Riak Components

Riak is a Key/Value (KV) database, built from the ground up to safely distribute data across a cluster of physical servers, called nodes.

Riak functions similarly to a very large hash space. Depending on your background, you may call it hashtable, a map, a dictionary, or an object. But the idea is the same: you store a value with an immutable key, and retrieve it later.

1) Key and value

2) Buckets

Key and Value

Key/value is the most basic construct in all of computerdom.

Buckets

Buckets in Riak  provide logical namespaces so that identical keys in different buckets will not conflict.

Buckets are so useful in Riak that all keys must belong to a bucket. There is no global namespace. The true definition of a unique key in Riak is actually        bucket + key.

For convenience, we call a bucket/key + value pair as an object, sparing ourselves the verbosity of “X key in the Y bucket and its value”.

Replication and Partitions

Distributing data across several nodes is how Riak is able to remain highly available, tolerating out-ages and  partitioning. Riak combines two styles of distribution to achieve this: replication and partitions.

Replication

Replication is the act of duplicating data across multiple nodes. Riak replicates by default.

The obvious benefit  of replication is that if one node goes down, nodes that contain replicated data remain available to serve requests. In other words, the system remains available with no down time.

The downside with replication is that you are multiplying the amount of storage required for every duplicate. There is also some network overhead with this approach, since values must also be routed to all replicated nodes on write.

Partitions

A partition is how we divide a set of keys onto separate physical servers. Rather than duplicate values, we pick one server to exclusively host a range of keys, and the other servers to host remaining non-overlapping ranges.

With partitioning, our total capacity can increase without any big expensive hardware, just lots of cheap commodity servers. If we decided to partition our database into 1000 parts across 1000 nodes, we have (hypothetically) reduced the amount of work any particular server must do to 1/1000th.

There’s also another downside. Unlike replication, simple partitioning of data actually decreases uptime.

If one node goes down, that entire partition of data is unavailable. This is why Riak uses both replication and partitioning.

Replication+Partitions

Since partitions allow us to increase capacity, and replication improves availability, Riak combines them. We partition data across multiple nodes, as well as replicate that data into multiple nodes.

n_val

The Riak team suggests a minimum of 5 nodes for a Riak cluster, and replicating to 3 nodes (this setting is called n_val, for the number of nodes on which to replicate each object).

The Ring

Riak applies consistent hashing to map objects along the edge of a circle (the ring).

Riak partitions are not mapped alphabetically (as we used in the examples above), but instead a partition marks a range of key hashes (SHA-1 function applied to a key). The maximum hash value is 2160 , and divided into some number of partitions—64 partitions by default (the Riak config setting isring_creation_size).

The Ring is more than just a circular array of hash partitions. It’s also a system of metadata that gets copied to every node. Each node is aware of every other node in the cluster, which nodes own which vnodes, and other system data.

N/R/W Values

N

With our 5 node cluster, having an n_val=3 means values will eventually replicate to 3 nodes, as we’ve discussed above. This is the N value. You can set other values (R,W) to equal the n_val number with the shorthand all.

R

Reading involves similar tradeoffs. To ensure you have the most recent value, you can read from all 3 nodes containing objects (r=all). Even if only 1 of 3 nodes has the most recent value, we can compare all nodes against each other and choose the latest one, thus ensuring some consistency. Remember when I mentioned that RDBMS databases were write consistent? This is close to read consistency. Just like w=all,however, the read will fail unless 3 nodes are available to be read. Finally, if you only want to quickly read any value, r=1 has low latency, and is likely consistent if w=all.

W

But you may not wish to wait for all nodes to be written to before returning. You can choose to wait for all 3 to finish writing (w=3 or w=all), which means my values are more likely to be consistent. Or you could choose to wait for only 1 complete write (w=1), and allow the remaining 2 nodes to write asynchronously, which returns a response quicker but increases the odds of reading an inconsistent value in the short term. This is the W value

Since Riak is a KV database, the most basic commands are setting and getting values. We’ll use the HTTP interface, via curl, but we could just as easily use Erlang, Ruby, Java, or any other supported language. The basic structure of a Riak request is setting a value, reading it, and maybe eventually deleting it. The actions are related to HTTP methods (PUT, GET, POST, DELETE).


 

PUT

The simplest write command in Riak is putting a value. It requires a key, value, and a bucket. In curl, all HTTP methods are prefixed with -X. Putting the value pizza into the key favorite under the food bucket is done like this:

The -d flag denotes the next string will be the value. Declaring it as text with the proceeding line -H ‘Content-Type:text/plain’

This declines the HTTP MIME type of this value as plain text. We could have set any value at all, be it XML or JSON—even an image or a video. Riak does not care at all what data is uploaded, so long as the object size doesn’t get much larger than 4MB.

GET

The next command reads the value pizza under the bucket/key food/favorite.

This is the simplest form of read, responding with only the value. Riak contains much more information, which you can access if you read the entire response, including the HTTP header. In curl you can access a full response by way of the -i flag.

POST

Similar to PUT, POST will save a value. But with POST a key is optional. All it requires is a bucket name, and it will generate a key for you.

Let’s add a JSON value to represent a person under the people bucket. The response header is where a POST will return the key it generated for you.

You can extract this key from the Location value. Other than not being pretty, this key is treated the same as if you defined your own key via PUT.

Body

You may note that no body was returned with the response. For any kind of write, you can add the returnbody=true parameter to force a value to return, along with value-related headers like X-Riak-Vclock and ETag.

DELETE

The Final basic operation is deleting keys, which is similar to getting a value, but sending the DELETE method to the url/bucket/key.

A deleted object in Riak is internally marked as deleted, by writing a marker known as a tombstone. Unless configured otherwise, another process called a reaper will later finish deleting the marked objects.

read/write ratios.

Lists

Riak provides two kinds of lists. The first lists all buckets in your cluster, while the second lists all keys under a specific bucket. Both of these actions are called in the same way, and come in two varieties.

The following will give us all of our buckets as a JSON object.

And this will give us all of our keys under the food bucket.

Adjusting N/R/W to our needs

N is the number of total nodes that a value should be replicated to, defaulting to 3. But we can set this n_val to less than the total number of nodes.

Any bucket property, including n_val, can be set by sending a props value as a JSON object to the bucket URL. Let’s set the n_val to 5 nodes, meaning that objects written to cart will be replicated to 5 nodes.

Symbolic Values

A quorum is one more than half of all the total replicated nodes (floor(N/2) + 1). This figure is important, since if more than half of all nodes are written to, and more than half of all nodes are read from, then you will get the most recent value (under normal circumstances).

Hooks

Another utility of buckets are their ability to enforce behaviors on writes by way of hooks. You can attach functions to run either before, or after, a value is committed to a bucket.

Functions that run before a write is called precommit, and has the ability to cancel a write altogether if the incoming data is considered bad in some way. A simple precommit hook is to check if a value exists at all.

I put my custom Erlang code files under the riak installation ./custom/my_validators.erl.


 

Then compile the file.(You need to install erlang before installing Riak)

erlc my_validators.erl

Install the file by informing the Riak installation of your new code via app.config (restart Riak).

Then you need to do set the Erlang module (my_validators) and function (value_exists) as a JSON value to the bucket’s precommit array {“mod”:”my_validators”,”fun”:”value_exists”}.

If you try and post to the cart bucket without a value, you should expect a failure.

Siblings

Siblings occur when you have conflicting values, with no clear way for Riak to know which value is correct. Riak will try to resolve these conflicts itself if the allow_mult parameter is configured to false, but you can instead ask Riak to retain siblings to be resolved by the client if you set allow_mult to true.

Siblings arise in a couple cases.

We used the second scenario to manufacture a conflict in the previous chapter when we introduced the concept of vector clocks, and we’ll do so again here.

Resolving Conflicts

When we have conflicting writes, we want to resolve them. Since that problem is typically use-case specific, Riak defers it to us, and our application must decide how to proceed.

For our example, let’s merge the values into a single result set, taking the larger count if the item is the same. When done, write the new results back to Riak with the vclock of the multipart object, so Riak knows you’re resolving the conflict, and you’ll get back a new vector clock.

Successive reads will receive a single (merged) result.

Will share more on this arctile soon.