12 Mar Cassandra 101 : Understanding what is Cassandra
As some of you may know, in my current role at Pythian, I am tackling OSDB and currently Cassandra is in my radar. So one of the things I have been trying to do is learn what Cassandra is, so I’m going to share a bit of what I have been able to obtain in these series.
According to the whitepaper “Solving Big Data Challenges for Enterprise Application Performance Management” , Cassandra is a “distributed key value store developed at Facebook. It was designed to handle very large amounts of data spread out across many commodity servers while providing a highly available service without single point of failure allowing replication even across multiple data centers as well as for choosing between synchronous or asynchronous replication for each update.”
Cassandra in layman’s terms is a NoSQL database, developed in JavaOne, of the many benefits that Cassandra has, is that is an open source DB with a deep developer support, it is also a fully distributed DB, meaning that there is no master DB, unlike Oracle or MySQL, so this allows this database to have no point of failure, also it touts to be linearly scalable , meaning that if you have 2 nodes and having a throughput of 100,000 Transactions per second, if you were to add 2 more nodes, you would now get 200,000 transactions per second, and so forth.
Cassandra is based on 2 core technologies, Google’s Big Table and Amazon’s Dynamo, which Facebook used to power their Inbox Search feature and released it as an open source project on Google Code and then incubated at Apache, and is nowadays a Top-Level-Project. Currently there exists 2 versions of Cassandra:
- Community Edition.- This is distributed under the Apache™ License
- Enterprise Edition .- This is distributed by Datastax
Since Cassandra is a distributed system, it follows the CAP Theorem,which is awesomely explained here, and it states that, in a distributed system you can only have two out of the following three guarantees across a write/read pair:
- Consistency.- A read is guaranteed to return the most recent write for a given client.
- Availability.-A non-failing node will return a reasonable response within a reasonable amount of time (no error or timeout).
- Partition Tolerance.-The system will continue to function when network partitions occur.
Also Cassandra is a BASE (Basically Available, Soft state, Eventually consistent) type system, not an ACID (Atomicity, Consistency, Isolation, Durability) type system, meaning that the system is optimistic and accepts that the database consistency will be in a state of flux, not like ACID which is pessimistic and it forces consistency at the end of every transaction.
Cassandra stores data according to the column family data model where:
- Keyspace is the container for your application data, similar to a schema in a relational database. Keyspaces are used to group column families together. Typically, a cluster has one keyspace per application.It also defines the replication strategy and data objects belong to a single keyspace
- Column Family is a set of one,two or more individual rows with a similar structure
- Row is a collection of sorted columns, it is the the smallest unit that stores related data in Cassandra, and any component of a Row can store data or metadata
- Row Key uniquely identifies a row in a column familiy
- Column key uniquely identifies a column value in a row
- Column value stores one value or a collection of values
- Row Key uniquely identifies a row in a column familiy
Also we need to understand the basic architecture of Cassandra, which has the following key structures:
- Node is one Cassandra instance and is the basic infrastructure component in Cassandra. Cassandra assigns data to nodes in the cluster , each node is assigned a part of the database based on the Row Key. Usually corresponds to a host, but not necessarily, specially in Dev or Test environments.
- Rack is a logical set of nodes
- Data Center is a logical set of Racks, a data center can be a physical data center or virtual data center. Replication is set by data center
- Cluster contains one or more datacenters and is the full set of nodes which map to a single complete token ring
Conclusion
Hopefully this will help you understand the basic Cassandra concepts, in the next series, I will go over the next architecture concepts of what is a Seed node ,the purpose of the Snitch and topologies, what is the Coordinator node, replication factors, etc
Sorry, the comment form is closed at this time.