Apache Cassandra’s Columnar database approach
Lately, I have been using Apache Cassandra as main database in a recent project and loved the way it has been designed.
Cassandra is widely used in many tench-firms. It’s in the likes of CERN, Netflix, Instagram etc.
The advantage of Cassandra lies in the fact that we can write data very fast in Cassandra. So, if we are dealing with a huge amount data being ingested into the application then Cassandra comes to the rescue. The reason for this efficiency is that Cassandra uses LSTM rather than B+ tree used by SQL dbs. For more details around it, please refer this article.
It’s basically a columnar database, which makes it reads very efficient.
In a normal row-oriented database, like MySQL when we make a select query it goes for an entire row-by-row scan, which is stored as a file on File system.
A columnar database attempts to make this scan faster.
The performance of any query is proportional to the amount of data it’s scanning. So, in a normal row-oriented database although in our where clause we just have 2–3 columns, but we need to iterate through entire set of columns to give it’s results, which makes the read as inefficient.
The data on hardware is stored on a disk. So, when we make a query in a normal row-oriented database, we have to pass through each row-record of that disk to complete the read operation. The higher the number of columns the slower will be the query now as the data is stored sequentially row-by-row.
A columnar database on the other hand, stored entire column in a separate file altogether. So, even if our table contains 100 columns, and we are dealing with only 3 columns. We are now dealing with a lesser amount of data thereby making the read really very fast.
The next big advantage of Columnar database, is the data compression which happens by combining consecutive same column values together
For ex: If we have rows like BMW,BMW,BMW, AUDI,AUDI,MERZ . It will be stored in DB as BMW*3,AUDI*2,MERZ, which implies that this row and next 2 rows have BMW as value, so we don’t need to iterate through them.
This data compression also makes these queries further efficient.
In a normal row-oriented database, if the table is very huge any “ALTER TABLE ADD COLUMN” command will become very very flow. However, in case of a columnar database, it’s just creation of a new file into the file system which makes it faster.
The real advantage of columnar database comes into picture when we have huge amount of data and a lot of columns. For a limited set of data, a row-oriented database is a perfect fit.
When to use a columnar database?
It’s mainly recommended for OLAP use-cases where we want to generate analytics from the given set of data. The performance of joins is not really great in columnar database. For OLTP use-cases, it’s always better to go for a normal row-oriented database.