Archive for the ‘BIGDATA’ Category

There is a lot of jargon about BigData. Here is a fast track version (my version) of how Hadoop MapReduce sophisticated algorithm solves the big data issue with an example in each jargon.

I have given a use case of aggregating SYSLOG data coming from thousands of Cisco ASA appliance and then perform data analysis on them using Hadoop  MapReduce. The data size of ASA log in this example is 2 GIG!!

SR# Keyword / FrameWORK



HDFS (core component of Hadoop) Hadoop Distributed file system


Hadoop (the Down) Java based codes, stores data on HDFS


Hbase (core component of Hadoop) It provides fast key value lookup and built on top of HDFS


Sqoop Data integration from SQL to Hadoop


Pig programming Developed by yahoo to analyse big data sets.Pig programming language is designed to handle any kind of data—hence the name! Pig is made up of two components: the first is the language itself, which is called PigLatin, and the second is a runtime environment where PigLatin programs are executed. Think of the relationship between a JVM and a Java application.


Hive SQL over Hadoop


MapReduce Hadoop MapReduce is a programming model and software framework for writing applications that rapidly process vast amounts of data in parallel on large clusters of compute nodes. It is used for processing structured, non-structured and semi structured data. Its output is human readable. It works with key-value pairs and have two major steps – Map and Reduce (see below)
Input Reads data storage – files, DB, images, logs.Input key = file name or could be null, therefore;
Input value: the line itself example: let’s say we have a large dataset of ASA firewall logs files.——Input log file asa.log-2TB.input —–
20111011 /urlYYY #Line1
20111011 /urlTTT #Line3
20111012 /urlYYY #Line2
Slicer Chopped up big data (damn this buzz word) into smaller chunks if asa.log file received in 2 TB in size, it will be chunked into smaller pieces!.———- slice asa..log-2TBinput it up into smaller chunks ——-asa.log.2gig-slice01.log

asa.log.2gig-slice04.log and so on and then writes them on the HDFS

Map Transfers the input data for easy processing and then aggregate it. Here is how this algorithm worksExample1: multiple dataset, mapping of data happens here.——-transfer asa.log.2gig-slice*.log data-set#1..2..3..n —–
20111011 /urlYYY #Line1
20111011 /urlTTT #Line3
20111012 /urlYYY #Line2Total 3 lines but mapped data becomes something like this:

Key1: urlYYY value: 20111011
Key2: urlYYY value: 20111012
Key3: urlZZZ value: 20111011

Still the above 3 lines, but mapped into key-pairs (key and value).

Partition or Sort Sorts the output of the Map operation to transfer to reducers, here we simply sort the above output to transfer the data for further processing.——- ready to transfer asa..logsorted——
Key1: urlYYY value: 20111011
Key2: urlYYY value: 20111012
Key3: urlZZZ value: 20111011linux sort -n command basically!
Reduce Aggregates the data. Here we process the sorted data to handover onto merger.——- aggregate now, asa..logaggregated——
Key1: urlYYY value: 20111011, 20111012 (from dataset1)
Key3: urlZZZ value: 20111011 (from dataset1)
Key9: urlYYY value: 20111011, 20111012 (from dataset2)You see, key 1 and key2 are reduced down to one line. New line key9 came off the other dataset#2.
Merger Merges two or more outputs of Reduces; Since we’ve a large asa.log file (2TB of it), we’ll have a large data sets and there will be duplicate datasets after reduce function (above): We further merge duplicate dataset into one.——- Merge asa.log.merge——
Key1: urlYYY value: 20111011, 20111012
Key3: urlZZZ value: 20111011You noticed, key 1 and key9 (from reduce process) are now merged into one. That is how Hadoop smart processing reduces the storage required for the big data problem.
Output Writes the output of the MapReduce to Disk IO in human readable format.——- output asa.log.output——
Key1: urlYYY value: 20111011, 20111012
Key3: urlZZZ value: 20111011Now programmers can use Pig programming language to write programs to dive deeper into the data.


Use Case – who uses Hadoop MapReduce and where Linkedin – if you are a Linkedin user, you may notice the section “you may also know these people”. A hue that comes off Hadoop MapReduce processing. eHarmony – find your matches (weekend fun)
Facebook – recommendation section
Yahoo – big user – search assist and normal search engines

If you live in Australia you probably have heard about the Mining Boom, so was the famous “DotCOM bubble” in way back in 199x . Recently, this new buzzword “BigData” in IT data mining space seems to be a new technology trend for many enterprises and Telco’s. They need bigdata to be implemented to solve their traditional issues ASAP. As we all know Cloud computing has made life so easier and now it looks like we’ll be doing clouds forever! End of the day who want to wait for the delivery of their servers , rack-n-stack and then build. Takes forever!.  Obviously, this is totally new era in application space after the cloud boom that bring up new IT jobs and solves traditional problem of big data sets. Hue..? new jobs! good for IT industry, we’ll never be out of the job. Most big data products are based on Hadoop, Splunk, Cloudera and uses smarter algorithm to index the data and present it onto human readable format to IT analysts. The Hadoop dominate the big data space. SPlunk and Cloudera both top products that are availabile today are based on Hadoop. I have some experience with Hadoop deployment. A year ago (before this buggle started) I had an opportunity to deployed 37 nodes Hadoop cluster for parallel processing of un-structured data. I know the hurdles and challenges we went through in deploying in this domain. I think Hadoop codes have matured over the period of time. IT engineers who works in big data domain have the following titles/role:
1. DATA SCIENTIST  (probably eq. to CCIE?)
2. DATA ANALYST (probably eq. to CCNA)

Data scientist – you may be joking here! Nope! Keep reading…
Here are my thoughts on these newly defined roles for folks who are or will be working on Big Data domain. Who are these data scientists working on Big Data? An industry accelerated PhD’s?, with multiple masters degree?, folks who have no university degrees? Well, the simple answer is “anybody”, it doesn’t need a PhD degree to get a title of “Scientist”. Funny but this very true! Personally I love the job title or term “Data Scientist”. It has certainly made the folks who are really smart and working in IT industry (without any degrees) job title glamorous! It has given both name and fame to the role. Don’t get me wrong here but many organizations have started hiring a data scientist to solve their structured and non-structured data problem. Mostly, data scientists work on futuristic products. New product development requires some data to correlate inputs that comes from big data co-relation.
DJ Patil , the co-inventor of this term defines data scientist as:
“A data scientist is that unique blend of skills that can both unlock the insights of data and tell a fantastic story via the data.”
Jake Porway,
Data without Borders and the New York Times defines it as:
“A data scientist is a rare hybrid, a computer scientist with the programming abilities to build software to scrape, combine, and manage data from a variety of sources and a statistican who knows how to derive insights from the information within. S/he combines the skills to create new protoypes with the creativity and thoroughness to ask and answer the deepest questions about the data and what secrets it holds.” Data Scientist needs strong data skills, strong knowledge of statistics and ability to program algorithms.
Anyone invests in BigData products – Big Banks, Telecos, Manage IT service companies etc.
Data Scientists are not cheap either! Obviously, you get what you pay.

Data Analyst – Every big data problem doesn’t need a data scientist. Now if you are just starting in big data domain, you might have to start your career as a Data Analysist. Remember your career on a NOC engineer role, first few months at job, night shifts and tame at looking at the screen for SNMP/SYSLOG Traps? Well this role will be a bit upgraded version of that but you’ve opportunity to become a data scientist. Not to mention but an opportunity to learn from data scientist as well. Every analyst needs to be able to tell and sell his story from the insights that come out of big data analysis. A data analyst is not expected to having programming skills to build algorithms, but needs strong SQL skills in addition to good understanding of analytics packages. Typically data analysis engineer is cheaper than data scientist.


1. Online Platform companies.
2. Content sites
3. Big Banks – fraud detection, app logs, data correlation et all.
4. Parallel processing of data that can not be processed by traditional databases (SQL,Oracle, Informix et. all)
5. Share market – and a list goes on! Infinite possibilities.
Not to mention but the original user or abuser of Hadoop are using it for ages! – Yes LinkedIN, FaceBook, Google, Yahoo.
Please leave your comments. What do you think of this new buzzworld! In the next post (when? probably when this bubble is gone), I will cover building career in big data domain.

Cheers, Push
2xCCIE (Voice/Security)