BigData: Jargon Dictionary and How Hadoop Algorithm Solves Data problem

Posted: May 24, 2012 in BIGDATA, LINUX *NIX

There is a lot of jargon about BigData. Here is a fast track version (my version) of how Hadoop MapReduce sophisticated algorithm solves the big data issue with an example in each jargon.

I have given a use case of aggregating SYSLOG data coming from thousands of Cisco ASA appliance and then perform data analysis on them using Hadoop  MapReduce. The data size of ASA log in this example is 2 GIG!!

SR# Keyword / FrameWORK

Explanation

1

HDFS (core component of Hadoop) Hadoop Distributed file system

2.

Hadoop (the Down) Java based codes, stores data on HDFS

3.

Hbase (core component of Hadoop) It provides fast key value lookup and built on top of HDFS

4.

Sqoop Data integration from SQL to Hadoop

5.

Pig programming Developed by yahoo to analyse big data sets.Pig programming language is designed to handle any kind of data—hence the name! Pig is made up of two components: the first is the language itself, which is called PigLatin, and the second is a runtime environment where PigLatin programs are executed. Think of the relationship between a JVM and a Java application.

6.

Hive SQL over Hadoop

7.

MapReduce Hadoop MapReduce is a programming model and software framework for writing applications that rapidly process vast amounts of data in parallel on large clusters of compute nodes. It is used for processing structured, non-structured and semi structured data. Its output is human readable. It works with key-value pairs and have two major steps – Map and Reduce (see below)
Input Reads data storage – files, DB, images, logs.Input key = file name or could be null, therefore;
Input value: the line itself example: let’s say we have a large dataset of ASA firewall logs files.——Input log file asa.log-2TB.input —–
20111011 /urlYYY #Line1
20111011 /urlTTT #Line3
20111012 /urlYYY #Line2
Slicer Chopped up big data (damn this buzz word) into smaller chunks if asa.log file received in 2 TB in size, it will be chunked into smaller pieces!.———- slice asa..log-2TBinput it up into smaller chunks ——-asa.log.2gig-slice01.log
asa.log.2gig-slice02.log
asa.log.2gig-slice03.log

asa.log.2gig-slice04.log and so on and then writes them on the HDFS

Map Transfers the input data for easy processing and then aggregate it. Here is how this algorithm worksExample1: multiple dataset, mapping of data happens here.——-transfer asa.log.2gig-slice*.log data-set#1..2..3..n —–
20111011 /urlYYY #Line1
20111011 /urlTTT #Line3
20111012 /urlYYY #Line2Total 3 lines but mapped data becomes something like this:

Key1: urlYYY value: 20111011
Key2: urlYYY value: 20111012
Key3: urlZZZ value: 20111011

Still the above 3 lines, but mapped into key-pairs (key and value).

Partition or Sort Sorts the output of the Map operation to transfer to reducers, here we simply sort the above output to transfer the data for further processing.——- ready to transfer asa..logsorted——
Key1: urlYYY value: 20111011
Key2: urlYYY value: 20111012
Key3: urlZZZ value: 20111011linux sort -n command basically!
Reduce Aggregates the data. Here we process the sorted data to handover onto merger.——- aggregate now, asa..logaggregated——
Key1: urlYYY value: 20111011, 20111012 (from dataset1)
Key3: urlZZZ value: 20111011 (from dataset1)
Key9: urlYYY value: 20111011, 20111012 (from dataset2)You see, key 1 and key2 are reduced down to one line. New line key9 came off the other dataset#2.
Merger Merges two or more outputs of Reduces; Since we’ve a large asa.log file (2TB of it), we’ll have a large data sets and there will be duplicate datasets after reduce function (above): We further merge duplicate dataset into one.——- Merge asa.log.merge——
Key1: urlYYY value: 20111011, 20111012
Key3: urlZZZ value: 20111011You noticed, key 1 and key9 (from reduce process) are now merged into one. That is how Hadoop smart processing reduces the storage required for the big data problem.
Output Writes the output of the MapReduce to Disk IO in human readable format.——- output asa.log.output——
Key1: urlYYY value: 20111011, 20111012
Key3: urlZZZ value: 20111011Now programmers can use Pig programming language to write programs to dive deeper into the data.

8

Use Case – who uses Hadoop MapReduce and where Linkedin – if you are a Linkedin user, you may notice the section “you may also know these people”. A hue that comes off Hadoop MapReduce processing. eHarmony – find your matches (weekend fun)
Facebook – recommendation section
Yahoo – big user – search assist and normal search engines
Advertisements
Comments
  1. test says:

    I’m really enjoying the design and layout of your blog. It’s a very easy on the eyes which makes it much more pleasant for me to come here and visit more often. Did you hire out a developer to create your theme? Fantastic work

Leave a Reply , I will reply ASAP

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s