Hadoop, which was named after a toy elephant, is a serious framework for doing distributed data processing across thousands of nodes and petabytes of data. It’s loosely based on some big time Google projects, MapReduce, Big Table, and Google File System. A key thing to understand about Hadoop is that it’s not just one monolithic project. Hadoop has a number of different pieces that all work together to help you crunch some very big data.
Hadoop is all about processing data, and Hadoop’s MapReduce engine is what makes all this processing possible. MapReduce is a process that Google came up with for churning through petabytes of data on commodity hardware. Here’s how it works.
There is a master node that takes a problem and divides it into smaller problems that can be handled by slave nodes. This is the “Map” step. As the slave nodes finish their smaller problems, they hand their answers back to the master node, and then the master node combines all these smaller answers to get the initial problem’s solution. This is called the “Reduce” step.
Now to support all of this data processing and map reducing, you have to store your data somewhere, and Hadoop is ready to help in that department. If you were processing a large number of files, you would use HDFS (Hadoop Filesytem). HDFS is an official Hadoop subproject, and it provides you with a highly fault tolerant filesystem that is distributed over a multitude of nodes with replication builtin. Now, you have a distributed fault tolerant way to convert 50PB of images from their originals to thumbnails or some other other file format entirely!
Not all your data is files, so Hadoop also has a database called HBase. HBase is the official Hadoop database. You can use HBase to store data like our user comments and come up with answers to how many likes your users have gotten. Now this is easy to do with a SQL database when it’s relatively small, but when you are Twitter or Facebook, SQL just isn’t going to cut it.Read More...