Skip to main content

NOSQL

Introduction:

We all know RDBMS & we were pretty much happy with them. Transactions in RDBMS are well protected, recovery was good & we are able to come back from failures pretty well & the row-column structure was a very good data model which kind of structured the data well while loading it so we can retrieve the data easily. So what was the need for a new theory called NoSQL? Well the motivation really was due to the huge unstructured data building up. And loading this huge data into an RDBMS with a schema structure was a challenge. And secondly, RDBMS systems scaled up well (you provide more memory & CPU resources it performs well) but they really could not scale out (horizontally by adding more machines to them). And lastly, RDBMS focus more on data consistency rather than the performance. When you stress more on consistency, there is an impact on performance. This blog discussed some basic theories NoSQL systems are built upon. This blog also kind of sets a base for generating further interest into these systems.


CAP Theorem:

Let us get into something called CAP theorem. Consistency, Availability & Partitioning. These are some of the dimensions which provide an insight into a particular DBMS system. NoSQL systems kind of fall into the category of Availability-Partitioning. So they are not really worried about data consistency which the RDBMS are very careful about.













Eventual Consistency:

RDBMS systems follow the ACID (atomicity, consistency, isolation, durability) rule for a transaction. The system will ensure that any data modification made by a transaction is consistent & everyone sees the same data by implementing locks eventually leading to deadlocks. They follow different isolation levels to ensure this. In contrast, NoSQL systems say that your data will be eventually consistent. What this means is, the data updated at a single point in time may not be consistent across all copies but eventually it is consistent. But it ensures that the data is available & your NoSQL instance is scalable to multiple nodes unlike RDBMS systems which do cannot scale out to thousands of servers. Like ACID, NoSQL systems follow a not very popular theory called BASE (basically available, soft state & eventually consistent). So your data is available even after multiple failures, abandons the consistency requirement of RDBMS & is eventually consistent at some point in time without a guarantee to when that time would be.

Major impacting NoSQL systems:

Some of the systems made a major impact in the NoSQL systems & most of the systems follow the same model from these three systems:
Memcached – First demonstrated the idea that in memory indexes can be highly scalable, distributing & replicating to multiple nodes. It uses consistent hashing technique.

Dynamo – Developed at Amazon, demonstrated the idea that eventual consistency is a way to achieve high availability & scalability.

BigTable – Developed by Google, demonstrated that persistent record storage can be scaled to thousands of nodes.

Cassandra:

Apache Cassandra is kind of a hybrid DB which takes the idea of Dynamo for distributed design & follows the data model of BigTable.  
Major features of Cassandra include:

  •        Decentralization
  •     Linear scalability
  •     Tunable consistency
  •     Map – reduce support (also supports pig & hive)

Comparison between Cassandra & RDBMS

Apache Cassandra is an open source NoSQL, distributed data system. Some of the features are given as below:

CASSANDRA

RDBMS (MSSQLServer)

Can scale out to thousands of servers to store huge amounts of data which does not fit in a single server
Only can scale up for better performance. Meaning, performance is dependent on RAM & CPU but cannot have multiple servers handle huge amounts of data
Support primary indexing based on which data gets partitioned between different nodes
Primary indexing is supported. A B tree index structure is created
Supports secondary indexes for faster data retrieval. Used in memory indexing concept to store index details in memory
Secondary indexes are supported again as B tree index structure
Columnar data model.
Row oriented data models
Does not support ACID properties of a transaction.
ACID properties are supported
Data is generally de-normalized for better performance
Normalized data to avoid redundancy.
Highly tunable. Cassandra can be tuned at any level based on application requirement
Can be tuned at some level without compromising on ACID properties of a transaction
Not recommended if data is expected to be high consistent like financial / banking data.
Recommended if consistency is required
Does not support Adhoc queries from application side. The queries need to be planned & known from the design
Supports Adhoc queries
No single point of failure. Application can still work if nodes are down. (highly available by design)
Server is the single point if failure. If server goes down due to some reason, DB is down. (need to setup standby for high availability)

Conclusion:


So after reading all this one thing is pretty sure. RDBMS is not going away as there are many applications which require the data to be consistent & cannot afford to rely on inconsistent data at any point in time. NoSQL systems do a great job in scaling & providing great performance with fault tolerance.

Comments

Popular posts from this blog

Cloud burst

CLOUD, BIG DATA & ANALYTICS are the buzz worlds in today's tech world.  But I clearly feel & see that cloud computing is definitely the game changer in today's IT world The reason I feel this way is due to the fact that everything is now getting distributed. With so many distributed softwares & platforms around us, cloud computing is enabling us to realise all our needs with easy accessibility to various resources. Resources like cpu, RAM & storage. First what is cloud computing? There are many definitions out there & many which I have read but for me, cloud computing is the ability to provide accessibility, scalability, self provisioning & adaptability to the end users. I would like to clearly explain what I mean by each of the above points that I quoted Accessibility : by this I mean that all the services which the underlying infrastructure is providing, should be accessible by users equally. Various infra services like compute, netwo...

Openstack Swift

Storage plays an important role due to the dependencies that various sub systems have. Now a days due to the types of distributed systems available in market, storage gained more relevance. As data gets replicated with a minimum replication factor of 3, the space used by these systems is going up to make the applications more and more fault tolerant.  Block storages have been the most used storage types for all types of work loads as data can be accessed in blocks & it is easier for the sub systems to operate. But block storages use up a lot of space due to the distributed nature of various sub systems. For example, for storing a backup file the block storage replicates the same block thrice & essentially uses 3 times of the actual file size. For such types of usages, object storage is getting more popular as it stores the metadata of data rather than the actual data. This allows it to use far lesser space than what object stores use. However, object stores may not be...