Tuesday, July 19, 2011

Messing up with NoSQL - Cassandra

Brief About NoSQL: -

No doubt that invention of NoSQL was definitely a breakthrough in the Data storage models.

It not only provided a shift from typical mindset of Relational Model but also provided a way-out to store unstructured data into some Persistence storage area and that too in an efficient query-able format.

If we dig history of Data storage then definitely we will find many such models which were better than the RDBMS and works very well for many solutions
XML Databases, Object Databases were the first which adopted the concept of NoSQL and provided a different structure to store the complex and semi structure data.

It definitely was a breakthrough and involved a typical shift in mindset which was used to imagine everything in the form of rows and columns.

Replacement of RDBMS: -

At the same time people got many misconceptions and tried to implement NoSQL DB's for all kind of solutions, which itself was the biggest mistake made by the developers.

Relational Models have their own world and NoSQL never meant to be a competitor to that.

The idea of NoSQL was plainly to provide a solution which can solve the disadvantages provided by Relational Models and is also helpful in storing unstructured or semi-structured data which is difficult to fit into the traditional relational models.

Going further in this Article we will take up a popular NoSQL DB's - Cassandra and will analyze how it solves the various issues of relational Model.


Introducing Cassandra: -

Cassandra is another such NoSQL concept invented by the facebook folks for their social networking website and than they open sourced it to Apache under LGPL License.

Cassandra is a peer-to-peer, columnar and massively scalable data store.

Data Storage in Cassandra: -

Cassandra introduced the concept of storing the data in an associative Array which is sorted at the time of storage itself.

Associative Array - "An associative array is an abstract data type which consists of unique keys and collection of values. Each key is associated with one or more than one value."

For better and easy understanding here is way the terms used in Cassandra are mapped to the RDBMS: -

Keyspaces - same as Schema in Oracle
ColumnFamilies - Same as different Tables in a Schema
Rows - Same as Rows of a Table
Columns - Same as Columns referenced by different Rows


Here is how the keyspaces and Column Families are organized in Cassandra

Keyspaces have ColumnFamilies - Usually 1 KS per application
ColumnFamilies have Rows - Dozens of CFs per KS
Rows contain Columns - Many per CF
Columns - Contain name:value:timestamp - Many per Row

Now we will look at a typical JSON's style Structure and will see how the data is actually organized in Cassandra: -

Vehicles = {    // this is a ColumnFamily
 Camary:{  // this is the key to this Row inside the CF
      Capacity:"4",     // All these are columns and their values
      Colour: "Black", 
      Extra-Tyres: "1",
      Push-Back-seats: "Y",
      Power-Steering: "Y"           
   },
 Honda: { // this is the key to this Row inside the CF
      Capacity:"4", 
      Colour: "Black", 
      Extra-Tyres: "1",
      Push-Back-seats: "Y",
      Power-Steering: "Y"           
   }

 }     


There is also a term called "super-column" but for the sake of simplicity i will skip it for the time being and will move ahead with Cassandra internals and Architecture.

For more details and explanations please refer this Article which beautifully explains about the way the data is being organized in Cassandra.

Casandra Architecture: -

Over the period of time Cassandra have addressed the various Architecture concerns very well. All these Architecture concerns not only meet the requirements of a distributed environment but it also works very well in a Single site deployment.

Below are few of the Architecture Concerns which are very well handled in Cassandra: -

High Availability: -

All Nodes in Cassandra are independent nodes and *can* serve the user request independently (Going forward we will see how it can be achieved).
We did develop a strategy where each of the Cassandra nodes was capable of responding to user request irrespective of the state of other nodes in the cluster.

Providing low Data consistency levels (we will read about this later) gives us a highly available environment where even a Single node is self sufficient to serve user request.

CAP (Brewer's theorem): -

Consistency, Availability, Partitioning - As per the CAP theorem there is no Application which can provide all the 3 attributes at the same time.
Every application can provide only 2 attributes.

Though there are many RDBMS which provides a highly consistent and partitioned environment at the cost of availability but Cassandra at its best provides a Highly Available environment (i.e. "A") and partitioning of data (i.e. "P") in a cluster but at the same time it also provide the flavor of consistency by implementing the concept of "eventual consistency".

Eventual Consistency - Cassandra provides an excellent mechanism of replication and conflict resolution, which sync's the other nodes of a cluster in an asynchronous mechanism providing an eventually consistent environment.

The biggest benefit of eventually consistent environment is that the user response is not dependent upon the syncing of other nodes in a cluster. User gets an immediate response as soon as data is written to at-least 1 node (this is controllable...we will later in this article) and rest all nodes are being synced in an asynchronous manner.

So in other words Cassandra provides an eventual consistent environment where Data would be replicated and eventually all nodes would be having the same copy of data.

Conflict resolution: -

Cassandra provides a timestamp based conflict resolution Strategy.

Every column is associated with a timestamp and in a situation where we have different data written on different nodes for a same key the data which have higher timestamp wins

Though this is not the optimum algorithm to resolve the conflicts and surely there are better ways like Vector Clocks which provides a n efficient way to manage conflicts. Cassandra code does have flexibility to support any custom built conflict resolution mechanism.
Many of such instances and specially in a distributed environment where we need to control the concurrent read/ writes Cassandra users can use Zookeeper or cages like libraries which provide an out of the box solution and also an elegant way to synchronize the distributed transactions.

Eventual Consistency: -

RDBMS uses strong consistency (as they work on ACID properties) where data is not available to read until and unless its being replicated to all nodes in a cluster and this is done at the cost of performance and availability but as mentioned earlier that Cassandra provides an eventual consistent environment where data across the nodes/ clusters/ Data Centers will be eventually consistent which in turn provides us a Highly available environment and definitely there is big performance boost.

There are 3 different ways by which the data is being made consistent across the clusters/ Data Centers: -

AntiEntropy - Compares all the replicas and updates each replica to the newest version.
Read Repairs – Queries are compared against all the replicas of the key and most recent version is pushed to out-of-date replicas.
HintedHandOff – In case (for writes) replica node for the key is down, Cassandra will write a hint to a live replica node indicating that the write needs to be replayed to the unavailable node.

Cassandra do provides the flexibility to define various consistency levels, so that the users can be sure that read and write is marked as success only after defined consistency levels are met across the clusters/ Data Centers and rest all nodes can be synced in an asynchronous manner.

Write Consistency levels: -

ZERO - Ensure nothing. A write happens asynchronously in background. Until CASSANDRA-685 is fixed. 
If too many of these queue up, buffers will explode and bad things will happen.
ANY - Ensure that the write has been written to at least 1 node, including HintedHandoff recipients.
ONE - Ensure that the write has been written to at least 1 replica's commit log and memory table before responding to the client.
QUORUM - Ensure that the write has been written to N (ReplicationFactor) / 2 + 1 replicas before responding to the client.
LOCAL_QUORUM - Ensure that the write has been written to  / 2 + 1 nodes, within the local datacenter (requires NetworkTopologyStrategy)
EACH_QUORUM - Ensure that the write has been written to  / 2 + 1 nodes in each datacenter (requires NetworkTopologyStrategy)
ALL - Ensure that the write is written to all N replicas before responding to the client. Any unresponsive replicas will fail the operation.

Read Consistency levels: -

ZERO - Not supported, because it doesn't make sense.
ANY - Not supported. You probably want ONE instead.
ONE - Will return the record returned by the first replica to respond. 
A consistency check is always done in a background thread to fix any consistency issues when ConsistencyLevel.ONE is used. 
This means subsequent calls will have correct data even if the initial read gets an older value. (This is called ReadRepair)
QUORUM - Will query all replicas and return the record with the most recent timestamp once it has at least a majority of replicas (N / 2 + 1) reported. Again, the remaining replicas will be checked in the background.
LOCAL_QUORUM - Returns the record with the most recent timestamp once a majority of replicas within the local datacenter have replied.
EACH_QUORUM - Returns the record with the most recent timestamp once a majority of replicas within each datacenter have replied.
ALL - Will query all replicas and return the record with the most recent timestamp once all replicas have replied. Any unresponsive replicas will fail the operation.

Users should keep in mind that higher the consistency levels less will be the availability and performance, so we need to be very careful will defining the consistency levels for any application.

Also it would be nice to see that Cassandra do provide the clients to define the consistency levels, so the server code accepts the consistency levels provided by the client Application.

Failure detection, Replication, Scalability, Partitioning : -

Cassandra works very well in an distributed kind of environment where reads and writes are happening across the locations. It natively enables replication across multiple datacenters.

One of the best Strategies what cassandra layed out is the RackAwareStrategy (NetworkTopologyStrategy), where user defines the layout of a cluster which can consists of various nodes arranged in same Racks and same/ different Data Centers.

It works on Ring topology and each Cassandra node knows about the other nodes in the Ring.

Internally it uses Gossip to track the live nodes across the clusters which may or may not be distributed over the different geographical locations.
It is capable of re-establishing synchronization between two or more datastore and once the failure nodes are live than it automatically replicates the data to it, so that it can also become a part of nodes serving the user request.

The only difference between adding of new nodes bringing back the failure nodes is the initial token assignment for new nodes, which will define its placement in a ring.

The token assignment can be done manually or we can also have system to assign the token automatically to the new Nodes, which itself provides me a highly scalable system in which we can add/ remove nodes without any hiccups.

Linear Scaling - Cassandra runs on the commodity servers, there is no need for specialized servers, so we have advantage to deploy N number of Cassandra nodes in a Single Cluster


The replication mechanism is configurable and can be defined per keyspace and per Data Center. User's have flexibility to define the number of replica's to be present in a particular Data Center.

Now which nodes in a Data Center would be containing the replica depends upon how the data is partitioned...

Partitioning - As of now Cassandra supports 2 different partitioning schemes
1. RandomPartitioner(RP)
2. OrderPreservingPartitioner (OP)

In Both Schemes Cassandra tend to evenly distribute individual keys and their corresponding rows over the nodes in the cluster.

Every time a new node is added to the cluster, it will assign a range of keys to that node such that it takes responsibility for half the keys stored on the node that currently stores the most keys (more on options for overriding the default behavior later).

OP takes up the keys from the different Column Families and distribute over the cluster nodes and in case the distribution of keys used by individual column families is different, their sets of keys will not fall evenly across the ranges assigned to nodes which will lead to a hot spots and un-balanced cluster where one Node might be serving more number of hot keys than the other nodes in the cluster.

RP works on a continuous hashing mechanism and it uses MD5 hash of keys as the real keys for location key and data in a cluster. The distribution of he keys occurring within the individual column family doesn't matter.

It randomly maps any input key to a point between 0...2**12 range. The result is that the keys from each individual column family are spread evenly across the ranges/nodes, meaning that data and access corresponding to those column families is evenly distributed across the cluster.

The only drawback of RP is that the Range Scans over the keys is not possible but thanks to Lucandra/ Solandra, which provides a pretty good solution for the same.

Users do have the flexibility to re-define their replication strategy in future without loosing any of the data but in case of Partitioners it is very important that we decide upon the right data Partition strategy because partition strategy cannot be changed without erasing all the data from nodes.

Performance: -

Cassandra is a highly performant system for reads and writes.

Though you will see that the Cassandra out performs in case of writes compared to reads because of the fact that read_repairs are performed during the reads.

But in case of reads also it provides various levels of optimization which can be fine tuned based on the usage of the various column families

Based on the data usage patterns there are various parameters like KeyCache/ RowCache which needs to be fine tuned for each column family.

Cassandra itself provides different ways to improve the performance like: -
1. Before any disk reads it uses Bloom Filters for all look-ups.
2. All writes are written to commitlogs and from commitlogs it gets flushes periodically (or based on size) it is flushed to SSTables
3. Data Compaction, indexing and deleting tombstones is done for large files at regular intervals

There are many other parameters which are exposed to the users and can be tuned based on the analysis done on the data usage patterns.

Another aspect i.e. Size of data - It doesn't matter much as Cassandra is proven to handle >1TB of data without any issues.
Though identification of data usage patterns, defining Hot Column Families and providing right configuration to various column families (key cache, row cache, memtables throughput, readers/ writers threads etc) should be done in order to efficiently handle large volumes to data.

Avoiding Data Loss - Data Backup's: -

One of the important drawbacks of NoSQL DB's is the data back policies and strategies, which are at present missing and are ignored by most of the NoSQL providers.
Though in Cassandra we have various ways like JSON Backup, Binary backup but it is still in very early stages and does not include the concept of incremental backups.

But we do have a reasonable workaround to minimize the risk of data loss and that is by providing a higher replication factor across the clusters/ Data centers.

Though there is no doubt that in future NoSQL providers (including Cassandra) needs to provide different data back up strategies which are in sync with the relational databases.

Where to Use/ Not to Use: -

In contrast to other NoSQL DB's Cassandra is meant to provide a highly available cluster which if configured carefully can provide the availability of 5-9s.
This is one of the major feature of Cassandra which can be considered whenever we want to implement a high availability systems.

Couple of general questions which need to be answered before we even think of replacing RDBMS with any NoSQL DB: -

1. Do we need a transactional system?
2. Do we need highly consistent data?
3. Do we have a Distributed environment in which writes/ read would be happening from all locations?
4. Do we need Data replication?
5. What is the size of data?
6. Kind of data? Structured, semi-structured?
7. Is the data relational in nature and do we need to maintain the relations along with the data?
8. Does Cost matters?
9. Do we need an unlimited Scalability on the commodity hardware’s?
10. Do we need some sort of Analytic's?

Answer to all the above questions in comparison with the features provided by NoSQL DB will definately help us to identify the need for RDBMS or NoSQL database.

I am sure that folks would now be agree that the NoSQL definitely solves many issues of RDBMS but they are not a replacement of RDBMS and looking at the feature set that is provided by Cassandra, it would be the right choice for a distributed environment where High Availability is one of the major deciding criteria.


Appendix: -

Clients for Cassandra

Cassandra Architecture

Cassandra Wikki

NoSQL Glossary

Download Cassandra

Comparing Cassandra and other NoSQL DB's

Cassandra - A Decentralized Structured Storage System

No comments: