In this article, we are going to learn about NoSQL concepts in high level as it's family plays vital role in the BigData market. We will also discuss what is Cassandra which is the market leading NoSQL DB, Cassandra architecture, key components and its use cases in detail.
1. Introduction to NoSQL Databases
The term "NoSQL" refers to a database method that differs from typical relational database management systems (RDBMS). To understand NoSQL, it's necessary to first understand SQL, which is a query language used by relational databases.
Tables, columns, rows, and schemas are used to organise and retrieve data in relational databases. NoSQL databases, on the other hand, do not rely on these structures and instead use more flexible data models.
NoSQL databases have been adopted by mainstream organisations as RDBMS have progressively failed to match the performance, scalability, and flexibility requirements of next-generation, data-intensive applications.
"Not SQL" or "not only SQL" are two terms that can be used to describe NoSQL.
Unstructured data increases at a much faster rate than structured data and does not fit into RDBMS relational structures, thus NoSQL is particularly useful for storing unstructured Data.
User and session data; chat, messaging, and log data; time series data such as IoT and device data; and huge items such as video and photographs are all examples of unstructured data.
2. Types of NoSQL Databases
2.1. Key Value Store
This is often thought of as the most basic form of NoSQL database. This schema-free data model is arranged as a dictionary of key-value pairs, with a key and a value for each item.
It's frequently used for caching and storing user session data like shopping carts. When you need to pull numerous records at once, though, it's not optimal.
To provide scalability and availability, data is partitioned and replicated throughout a cluster. As a result, key value storage rarely supports transactions.
They are quite good at scaling applications that deal with high-speed, non-transactional data. Amazon DynamoDB, Redis and Riak are popular Key-value Store Databases.
Document databases take the key-value database concept a step further by grouping documents into collections. They allow queries on any attribute within a document and support nested key-value pairs.
Data is mainly represented as an object or JSON-like document in application code because it is an efficient and intuitive data paradigm for developers.
MongoDB and Amazon DocumentDB are example for Document databases.
2.3. Column Oriented
Column-oriented databases focus on columns, and each column is addressed separately. When querying across specific columns in the database, column-oriented databases are more efficient at storing data and querying across rows of sparse data. They perform well on aggregate queries like SUM, COUNT, AVG, MIN since the data is readily available in a column.
HBase, Apache Cassandra are best examples of column-oriented database.
2.4. Graph based
A graph database keeps track of both entities and their connections. The entity is represented as a node, while the connections/relationships are represented as edges.
An edge establishes a connection between nodes. A unique identifier is allocated to each node and edge. Graph database is multi-relational in nature.
Neo4J, Infinite Graph, OrientDB are popular Graph Databases.
3. What is Apache Cassandra?
Apache Cassandra is a Opensource, Column-oriented & distributed database management system that can handle large quantities of data across multiple data centers and the cloud.
Its ability to manage large volumes makes it especially useful for big businesses. As a result, several huge companies, including Apple, Instagram, Facebook, Uber, Twitter, Cisco, eBay, and Netflix, are now using it.
Highlights of Cassandra:
- Open Source
- Extremely scalable
- High Availability
- No Single Point of failure
- High Performance
- Fault tolerant
4. Cassandra Architecture
Apache Cassandra is built to manage large data workloads on several nodes with no single point of failure. Its design is built on the assumption that system and hardware failures are inevitable.
Cassandra solves the problem of failures by deploying a peer-to-peer distributed system between homogeneous nodes and distributing data across the cluster.
Cassandra's architecture is ring-based and it doesn't have any master nodes or single points of failure.
- Using the peer-to-peer gossip communication protocol, each node often transmits state information about itself and other nodes across the cluster. To ensure data durability, each node has a sequentially written commit log that records write activities.
- The data is subsequently indexed and written to a memtable, an in-memory structure that mimics a write-back cache. The data is written to disks in an SSTables data file whenever the memory structure is complete.
- All writes are partitioned and duplicated/replicated throughout the cluster automatically. Cassandra uses a process called compaction to combine SSTables on a regular basis, eliminating obsolete data marked for deletion using a tombstone. Various repair procedures are used to ensure that all data across the cluster remains consistent.
- Cassandra is a partitioned row store database in which rows are grouped into tables and a primary key is required. Using the CQL, any authorised user can connect to any node in any datacenter and access data. CQL has a comparable syntax to SQL and works with table data.
- Any node in the cluster can receive client read or write requests. When a client sends a request to a node, that node becomes the coordinator for that specific client operation. Between the client application and the nodes that own the data being requested, the coordinator functions as a proxy. Based on how the cluster is designed, the coordinator selects which nodes in the ring should receive the request.
5. Key components of Cassandra
- Node: It is the server where the data are being stored, Cassandra's infrastructure is built around it.
- Data Centre: Collection of the servers.
- Cluster: One or more datacenters make up a cluster.
- SSTable: The Sorted Strings Table is a file containing key/value string pairs that have been sorted by keys. For each Cassandra table, SSTables are appended solely and saved sequentially on disc.
- Commit log: For the sake of durability, all data is initially written to the commit log. Once all of its data has been drained to SSTables, it can be archived, removed or recycled.
- Mem-Table: A memory-resident data structure is known as a mem-table. Data would be written to the mem-table after the commit log. There may be many mem-tables for a single-column family at times.
- Bloom filters: Bloom filters are used by Cassandra to determine whether any of the SSTables is likely to have the desired partition key without having to read their contents.
6. Where to use Cassandra?
Cassandra has shown to be extremely useful in a variety of applications.
The following are some of the key considerations when pitching into Cassandra:
- We have to prefer Cassandra when the nature of application is 'write' intensive while compare to 'read'. Data distribution between nodes is fast, and writes are inexpensive.
- Cassandra is also suitable for data distribution across various data centers and cloud availability zones.
- Cassandra can be a powerful 'backbone' for real-time analytics when used in conjunction with Apache Spark and other tools. It also scales in a linear fashion. So, if you expect your real-time data to grow, Cassandra is unquestionably the best option.
7. Ideal use cases of Cassandra
- Messaging: As Cassandra can handle enormous amount of data, it is best option of the Messaging services.
- Sensor applications: Cassandra is suitable to handle the data coming from various devices of sensors with high velocity.
- Ecom applications: Many retailers rely on Cassandra for reliable shopping cart safety and quick product catalogue input and output.
- Recommendation Engines: Cassandra is especially well-suited to consumer analysis and recommendations, making it a popular choice among online businesses and social networking platforms.
In this article, we have gone through the overview of NoSQL database concepts and basic architecture of Cassandra. We will walk through the installation of Cassandra in upcoming articles.