There is a lot of jargon about BigData. Here is a fast track version (my version) of how Hadoop MapReduce sophisticated algorithm solves the big data issue with an example in each jargon.
I have given a use case of aggregating SYSLOG data coming from thousands of Cisco ASA appliance and then perform data analysis on them using Hadoop MapReduce. The data size of ASA log in this example is 2 GIG!!
SR# | Keyword / FrameWORK |
Explanation |
---|---|---|
1 |
HDFS (core component of Hadoop) | Hadoop Distributed file system |
2. |
Hadoop (the Down) | Java based codes, stores data on HDFS |
3. |
Hbase (core component of Hadoop) | It provides fast key value lookup and built on top of HDFS |
4. |
Sqoop | Data integration from SQL to Hadoop |
5. |
Pig programming | Developed by yahoo to analyse big data sets.Pig programming language is designed to handle any kind of data—hence the name! Pig is made up of two components: the first is the language itself, which is called PigLatin, and the second is a runtime environment where PigLatin programs are executed. Think of the relationship between a JVM and a Java application. |
6. |
Hive | SQL over Hadoop |
7. |
MapReduce | Hadoop MapReduce is a programming model and software framework for writing applications that rapidly process vast amounts of data in parallel on large clusters of compute nodes. It is used for processing structured, non-structured and semi structured data. Its output is human readable. It works with key-value pairs and have two major steps – Map and Reduce (see below) |
Input | Reads data storage – files, DB, images, logs.Input key = file name or could be null, therefore; Input value: the line itself example: let’s say we have a large dataset of ASA firewall logs files.——Input log file asa.log-2TB.input —– 20111011 /urlYYY #Line1 20111011 /urlTTT #Line3 20111012 /urlYYY #Line2 |
|
Slicer | Chopped up big data (damn this buzz word) into smaller chunks if asa.log file received in 2 TB in size, it will be chunked into smaller pieces!.———- slice asa..log-2TBinput it up into smaller chunks ——-asa.log.2gig-slice01.log asa.log.2gig-slice02.log asa.log.2gig-slice03.log asa.log.2gig-slice04.log and so on and then writes them on the HDFS |
|
Map | Transfers the input data for easy processing and then aggregate it. Here is how this algorithm worksExample1: multiple dataset, mapping of data happens here.——-transfer asa.log.2gig-slice*.log data-set#1..2..3..n —– 20111011 /urlYYY #Line1 20111011 /urlTTT #Line3 20111012 /urlYYY #Line2Total 3 lines but mapped data becomes something like this: Key1: urlYYY value: 20111011 Still the above 3 lines, but mapped into key-pairs (key and value). |
|
Partition or Sort | Sorts the output of the Map operation to transfer to reducers, here we simply sort the above output to transfer the data for further processing.——- ready to transfer asa..logsorted—— Key1: urlYYY value: 20111011 Key2: urlYYY value: 20111012 Key3: urlZZZ value: 20111011linux sort -n command basically! |
|
Reduce | Aggregates the data. Here we process the sorted data to handover onto merger.——- aggregate now, asa..logaggregated—— Key1: urlYYY value: 20111011, 20111012 (from dataset1) Key3: urlZZZ value: 20111011 (from dataset1) Key9: urlYYY value: 20111011, 20111012 (from dataset2)You see, key 1 and key2 are reduced down to one line. New line key9 came off the other dataset#2. |
|
Merger | Merges two or more outputs of Reduces; Since we’ve a large asa.log file (2TB of it), we’ll have a large data sets and there will be duplicate datasets after reduce function (above): We further merge duplicate dataset into one.——- Merge asa.log.merge—— Key1: urlYYY value: 20111011, 20111012 Key3: urlZZZ value: 20111011You noticed, key 1 and key9 (from reduce process) are now merged into one. That is how Hadoop smart processing reduces the storage required for the big data problem. |
|
Output | Writes the output of the MapReduce to Disk IO in human readable format.——- output asa.log.output—— Key1: urlYYY value: 20111011, 20111012 Key3: urlZZZ value: 20111011Now programmers can use Pig programming language to write programs to dive deeper into the data. |
|
8 |
Use Case – who uses Hadoop MapReduce and where | Linkedin – if you are a Linkedin user, you may notice the section “you may also know these people”. A hue that comes off Hadoop MapReduce processing. eHarmony – find your matches (weekend fun) Facebook – recommendation section Yahoo – big user – search assist and normal search engines |