Thursday, February 26, 2015

BASE

We know about acid and base. they are opposite in nature...

I am not here to learn about chemistry. Talk to me something related to database.

I am not talking chemistry either but the database engineers have nicely used the terminology BASE in contrast to ACID. ACID is the most discussed term in database world. But it is the concept the world of Relational Database Management Systems deals with. While in contrast NoSQL movement moved the pH of the database transactions to BASE.

Database transaction pH Scale
Here, I would like to discuss a bit on the CAP Theorem. This theorem was originally developed by Eric Brewer in 2000. Hence this theorem is also known as Brewer's Theorem.

CAP Theorem: This theorem deals with three desirable properties of distributed system. These are:


  • Consistency: A read sees all previously completed writes.
  • Availability: Reads and writes always succeed.
  • Partition tolerance: Guaranteed properties are maintained even when network failures prevent some machines from communicating with others.
Now CAP theorem states that a distributed system can never guarantee all three of them simultaneously.

CAP Theorem

In reality, its always a choice for 'two out of three' - CP, CA or AP.

Two out of three
NoSQL is a distributed system. So, it also has this limitation. So, they follow a pattern known as BASE.
This moves the pH of database transaction to higher pH values.


You have already taken this term a lot of times earlier. Would you bother to let us what BASE is ?
Well, NoSQL relies on a strategy to stick to Brewer's Theorem. It consists of the following properties.

  • Base Availability: This means that, the data will be available even in presence of errors in the system.
    This is achieved by distributed computing. Instead of storing the data in a single store and trying to maintain fault tolerance, NoSQL spreads the data across multiple storage systems with high degree of replication. This ensures availability of data and complete outage events are very unlikely to occur.
  • Soft state: Soft state indicates that the state of the system may change over time, even without input. This is because of the eventual consistency model.
  • Eventual Consistency: This states that data will be consistent over time. NoSQL systems ensure that at some future point in time the data assumes a consistent state.
    BASE is optimistic and accepts that the database consistency will be in a state of flux.
So, ACID is better. We have all the desired properties there.
Every technology has its own trade off. So has NoSQL. It is upto the developer, who needs to design his/her system accordingly. As a developer, you have to analyze what is required in your system, do the feasibility study on it and chose technology accordingly.
For example, NoSQL is a bad choice where strong consistency is required (banking applications) while systems which have a need to store and access big data with no strong consistency requirement, NoSQL is a good choice (social networking applications).

We have talked a lot about CAP, we have some real world examples for the CP, CA and AP databases, we can have a look at the following diagram,
Databases in CAP intersection
I have a chart as well with some addtional examples,

  • Bigtable by google - CP
  • Hbase by Apache - CP
  • DynamoDB by Amazon - AP
  • SimpleDB by Amazon - AP
  • Voldemort by LinkedIn - AP
  • Cassandra by Facebook - AP

Well, that's all for now, we'll look into more  in next articles.

If this article gave you some more knowledge, would you like to share this with your network ?

Wednesday, February 11, 2015

Introduction to NoSQL

We already have got the idea about Big Data and we understand why we need to take care and not ignore Big Data.

First let us know why it is tough to handle Big Data in traditional RDBMS way.
  1. We all do that in RDBMS, we define relations in RDBMS. So interrelated objects are basically different tables joined together in RDBMS.
    For example, an object of User can have attributes like firstName, lastName, and an object of address. Address in turn can have attributes like  zip, city, state. Now, if we want to define the same in RDMS way, we get two different tables USER INFO and ADDRESS INFO. Now again, we need to join the same using foreign key relations.

     
    Data representation in RDBMS and Memory

      So, think of read-write,
      i. While storing the data, we need to retrieve records from two different tables, merge two different data into one and represent in memory

      ii. While writing data, we have to split the data from memory, create two different representations, save it back to data store
      Unnecessarily doing some extra operations. In real life scenarios, things are more complicated with lots of joins in the query.
    1. We all do this one too, we define the structure of the data before we can work on that. In RDBMS, we need to define the data structure before hand. So, we can not add anything dynamically. Suppose we need to add contact to the User object. We need to again add another table, establish relation between Contact and User and then only we can effectively can use Contact in User. Nowadays, data mostly don't follow any structure. So, RDBMS fails to process this data.
    2. RDBMS is linearly dependent on the processing power of a single machine. If we need to process more data, we need to have more powerful machine. Vertical scalability can be achieved upto a certain limit. So at certain point we are bound to be saturated. Also having a more powerful system needs more knowledge to maintain. Cost is also a key factor here. More powerful machine is more costly.


    Now, what to do with the big data ?
    Well, technology has evolved much more in a sensible way. So, we can think of a solution to this problem in much more sophisticated way.

    NoSQL to rescue !!!

    NoSQL is  the new concept to process Big Data, which defines data in a more logical way and dealing with them is so sensible in NoSQL.

    So, what  is this NoSQL ?
    I can not define NoSQL. No formal definition can be provided here. It is a new age concept to deal with new age problem Big Data.

    OK, does this deal with the problems ?
    Yes, the major problems to deal with Big Data in RDBMS is dealt in NoSQL in a finer way possible.


    1. NoSQL stores data in the same way we define it in memory. So no extra processing required while retrieving or storing data
    2. NoSQL does not need to define the structure of data beforehand. We can simply store it as we want it to be. Although we can get what we need as and when required. (More on this is coming up in the way)
    3. NoSQL simply works on distribute framework, where in need we just have to add more similar systems to the existing network. No more powerful system is required to handle more data. NoSQL scales out instead of scaling up. So cost and maintenance is less in the NoSQL case. The following diagram depicts the difference,
      Vertical vs Horizaontal Scalability
    Well, good to know. But as I know, nothing is 100% accurate. There must be some shit in NoSQL as well.
    Yes, nothing is perfect. So do NoSQL too. Consistency is the issue that needs special care while working with NoSQL. I assume you know about ACID. But NoSQL really lacks this compared to RDBMS. RDBMS is built on top of ACID but NoSQL is quite different in this area.

    What NoSQL deals with is what we call it as 'BASE'. Well, a specialized version of NoSQL show ACID.

    'BASE'. Now, what is that ?
    We'll discuss this one in our next discussion.


    Prev    Next
    Palash Kanti Kundu

    Monday, February 9, 2015

    BIG BIG BIG DATA !!!!

    OOOOOOooooo, enough is enough. Every website, I get to see BIG DATA, even you are also taking the name here. Can you please explain what is it ?
    Big data is very big.

    Just this much ??? Can you explain with BIG DATA ?
    Well, Big Data is big data that is tough to handle.

    Again a one liner ???
    Big Data is data that can't be handled with the use of traditional data management systems. 80% - 90% of Big Data contains unstructured data which is tough to analyse.

    UNSTRUCTURED ???
    That means, you can not put them in a relational form.

    Wait here, there something more to get into track, the 3V's.

    Go On !
    Well, Big Data is completely explained through 3 V's (Volume, Velocity, Variety). No one term can define Big Data completely.

    OK, sounds awkward, how can data have speed ?
    Yes, it can have velocity. Let me explain each term in Big Data way !

    Volume: The quantity of Data, that is being stored and processed. For example, a 5 MB photo you upload in Instagram.
    Velocity: At the speed data is being added to the digital eco-system. In my earlier article, you got an overview of  it. But to make things more clear, every day 2.9 Quintillion (18 zeros) data is created every day.
    Variety: Every day lots of different kind of media is being published in the digital eco-system. The media content can be text, document, binary, image, video or anything it may be.

    Let's dig deeper,
    • When we are talking about volume, you can think of what 2.9 Quintillion means.
    • The speed is another key factor of Big Data. Millions of social networking users, billions of machineries create data every second.
    • Almost everything you want to share on Faceboook, may be its a new medal you have own or its a good song you are listening or its a good blog post you are reading. Internet really has become huge information store.
    Where the hell this much data coming from ?
    There was a time in the earlier 1960's when people used to work on Computers that would cost a building. Then scientists were deployed to process a few bits of information.

    Gone are those days. Bid them good bye with advanced computing placed almost in every pocket. Thanks to technology and the era of smartphones.
    Yes, we live in better days with lots of facilities brought in hand.

    BUT every thing has its own pros and cons. Same is true for technology too. With the advancement of computing the data footprint has gone larger, creating a TSUNAMI of data.

    HEY !!! Technology is advancing from almost at the beginning of Human kind. Why now Big Data is here ?
    I agree to your point. But dear reader think back, Computers are not so old, so not data. Even in the early days of computing data processing was tough task to do. More than one people would be required to process few bytes of data. Storing the data was a nightmare with Punched Cards and Floppy disk (If you are unaware of these things, I encourage you to quickly perform a Google Search and get the knowledge). It was almost after 1990's when data processing became easier with advanced technologies in place. After that, growth of Internet, advancement in different technology and the reduced cost of hardware evolved the world to produce more and more data. Nowadays, even human is required to create data, data is created  by machines itself.

    So we can think in this way,
    In the very beginning of computing era only dedicated employees could create data, after that with advancement in place general users were able to create and nowadays machines can create data as well.

    Some of the places, where data is created in BIG(Volume, Velocity and Variety) way,
    Retailer Database: Like, different shopping malls, online shopping carts etc.
    Industry: Logistics, Healthcare, IT etc. Every Industry tracks record of its products, consumers, delivery etc. to facilitate the end user.
    Social media and cloud: We become happy to share good news, we are good at gossiping, we get a sigh of relief talking to friend in worse situations, we feel safe uploading our photos on Google Drive/One Drive.  IN A NUTSHELL, WE LOVE TO USE THE SOCIAL MEDIA AND CLOUD.
    Internet of Things: Many of today's smart technology gadgets deploy different sensors and are connected to cloud. All these gadgets create data to store different information. Not only that, different satellites deployed around the earth also contribute to the BIG DATA platform.
    Scientific Data: Advancement can not be in place if there is no research ongoing. May it be in the field of software, may it be in medical, may be its a toy industry. Whatever it is, there is some research going on and with advance computing, we create data in digital Eco-System.

    So, you can get a finer detail why Big Data is today's burning issue and not a decade old problem. If not, I think you should get a coffee and re-read the article from the beginning.

    OK, good. Won't it be simple to just delete the data instead of storing ?
    Hmm, possibly a good idea but step back. Why do we even bother to delete when we have plenty of resources available to store them ?
    Hardware has become very much cost effective these days. Anyone carry more than the amount of music he/she can even listen in one year.
    So, data storage is really not a problem in today's world but processing them is a real challenge.

    OK, so leave the data in the storage and forget it.
    Another brilliant idea but really not appreciated in today's world. We live in Information Era. We can get information on most of the things these days. So, just storing the data and forgetting about it is really not an appreciable solution.

    So, what are you going to do with 80% to 90% UNSTRUCTURED DATA ?
    What about processing and analysing the data ?

    WHAT ARE YOU TALKING ABOUT AND HOW ARE YOU GOING TO PROCESS ?
    Let's think of WHY first, then WHAT and HOW.
    A simple example can put some light. You have 2 students, one is excellent and one is poor. You think of monitoring the poor one and get good routine, nutritious food etc. fr him. This poor student grows and scores well in the next exam.
    Have you got the idea ?

    NO.
    OK, you are basically creating Big Data for your poor student and then you process and analyse the big data to get a good result.

    Similarly, processing the Big Data, we can get good results.
    For example, Google gives search opinion.

    Wikipedia gives some interesting stories of Big Data Analysis here.

    Well, this depicts the fact and this video depicts the scenario in a more real context.

    Next we are going to talk about a bit of processing Big Data. Till then enjoy the videos.
    Hope you enjoyed this article. Don't forget to share it with your friends !!!

    Prev      Next
    Palash Kanti Kundu

    Sunday, February 8, 2015

    Introduction to Big Data

    Nothing is static in this dynamic world...

    palash Kanti Kundu palash Kanti Kundu palash Kanti Kundu palash Kanti Kundu palash Kanti Kundu palash Kanti Kundu

    So is the world of Computers. Specially, the advanced and active research on Internet is the biggest contributor to this.

    palash Kanti Kundu palash Kanti Kundu palash Kanti Kundu palash Kanti Kundu palash Kanti Kundu palash Kanti Kundu

    For reference, check the statistics of Internet usage statistics,

    palash Kanti Kundu palash Kanti Kundu palash Kanti Kundu palash Kanti Kundu palash Kanti Kundu palash Kanti Kundu

    Image Courtesy: Wikipedia
    Now, with the advancement of Internet, the data usage statistics has also gone high with time. Thanks to different social media (like Facebook, Twitter, Youtube etc.)

    Statistics to share is as follows, For every minute, you can find the following,

    palash Kanti Kundu palash Kanti Kundu palash Kanti Kundu palash Kanti Kundu palash Kanti Kundu palash Kanti Kundu

    • YouTube users upload 48 hours of video
    • Facebook users share 684,478 pieces of content
    • Instagram users share 3,600 new photos
    • Tumblr sees 27,778 new posts published.
    • The global Internet population now represents 2.1 billion people
    For more information, find below,

    Data usage every minute
      With every website browsed, status shared, and photo uploaded, we leave a digital trail that continually grows the hulking mass of big data.

      So what ? What am I supposed to know about all this ?
      From user perspective, you are not supposed to know anything. You are free to use Facebook, Twitter, Youtube or any other website you like to use. But from a developer's perspective, you have to know a little about these statistics. So that you can relate things when the term Big Data comes to application and you have to handle it in your code.
      OK, but what if I don't have to deal with Big Data in my application. Cause I know, my application is a small one and it has only a hundred customer and will not grow more than that. The data size will never reach more than a few GB and I am not interested to know what Facebook/Twitter/Youtube does for me.
      Well, then you are free to ignore, choice is yours.

      So, what am I supposed to get here ?Very elementary knowledge of Big Data, a little on the lack of processing Big Data in RDBMS, basic idea of NoSQL evaluation, difference between NoSQL and RDBMS and very basics of Big Data processing.
      OK, after that ?
      If you are interested in Big Data, you can grow up you knowledge from Very Basic to Vary Advanced and I am really no one to tell you how to do that. Its solely depends on you how and where to do that.

      palash Kanti Kundu palash Kanti Kundu palash Kanti Kundu palash Kanti Kundu palash Kanti Kundu palash Kanti Kundu

      Who are you to discuss Big Data ?
      Well, I am really not a well known person to talk about in the Big Data field, I am developer who is also learning big data and its processing.  I don't dare to teach you anything, I just want to share what I am learning day by day and this blog is a just like an online diary to maintain the technical learning.

      palash Kanti Kundu palash Kanti Kundu palash Kanti Kundu palash Kanti Kundu palash Kanti Kundu palash Kanti Kundu

      If its your technical diary, then why are you sharing this ?
      Answer is here.

      NextPalash Kanti Kundu