Wednesday, November 30, 2011

What is Big Data

A new Buzz world which is catching up now days and all top notch companies are craving to showcase their different offerings in the BigData Domain.

Did i say "Domain"? Yes BigData is a Domain and not a technology.

I am fortunate enough to be working in this domain for nearly a year and have been keeping a track of the updates happening on the technology front and market dynamics which is leading most of the big companies to adopt BigData Domain as their next Business Strategy.

Dealing with the different kind of BigData clients have given me a different exposure about the domain, team and off-course technology too.

Going forward in this article I have defined my views about BigData, which would not only help the new bees but also the existing folks to understand this Domain and would also provide a perspective which one should adopt for facing the challenges thrown by BigData.

Brief about the BigData: -

BigData as such is not a technology, it is a domain which requires much deeper understanding of the problems/ solution and technology and that too from a perspective which is far different from the normal web based kind of Architectures.

As the name itself defines - "Large Data" and it is really large, which is beyond the capabilities of an normal web based kind of Architecture to handle.

BigData is spanned over 3 different dimensions, popularly known as 3 V's of BigData: -

1. Volume - BigData is really big, it is really large and size ranging from TB's to PB's to EB's
2. Velocity - It’s really time sensitive, most of the times it is streaming and needs to be analyzed as quickly as possible, so that maximum value can be taken out of it.
3. Variety - It extends beyond structured data and also includes unstructured data of all varieties: text, audio, video, click streams, log files and more.

The 4th "V" -
Recently heard that the 4 "V" have been added, which states "Veracity" - it means "the doubtful data".
It states something which is in doubt about its authencity or correctness or it cannot be validated.

Though somewhere it reflects and extends the concepts of data warehousing which in primitive days (Not more than 2 years back but 2 years is almost stone age for IT) used ETL concepts and high priced data warehousing solutions like informatica, Pentaho, IBM Infoshphere etc.

BigData is much more than the concepts of Data warehousing/ ETL.

It have taken up these concepts to an different heights where analytical capabilities can be applied to continuous and free flowing structured/ un-structured data sizing in PB's or XB.

Not only the technologies have used the capabilities of parallel processing but they have also provided the data processing and analytics in much shorter time and that too on the commodity boxes (leveraging cloud deployments) and removed the need of any dedicated Boxes or the Data Centers.

BigData Technology stack is aim to provide a scalable solutions where computations can be parallelized and distributed over the thousands of nodes, utilizing the processing power of each commodity box and at the same time is capable of detecting and supporting failures.

If this is not enough - all these are available in Open Source world which have tremendously brought down the cost of overall solution.

Technology used in BigData: -

As said earlier - BigData solutions are built on carefully chosen technology Stack and above all it is really important to fit the right technology and that too at the right time, using technology just to make someone happy or just because one knows, ruins the whole solution and adds more complexity which is really hard to get rid off later in the implementations

Still few of the prominent technology players who have made it possible are: -

Hadoop - Based on the M/R concept developed by Google, Hadoop have emerged as a leading solution for large data processing. Though it was developed by Apache and was named as "Apache Hadoop" but at the same time many companies chimed in and did provide their compatibility Stack of technologies over their own hadoop variants (extending Apache Hadoop itself)
e.g. Cloudera, Yahoo (recently yahoo spinned off a new company by name of horton works for their BigData offerings) etc.
There is much more to talk about hadoop, its architecture, tuning, HDFS etc...but i will talk more about his in my next Article

NoSQL - I already explained a bit about NoSQL in my last Article and yes that is already a part of BigData technology stack.
Based on the concept of Google Bigtable/ Amazon Dynamo, It not only have provided a 10x increase in the retrieval and inserting but also have provided something different and unique Approach for varied kind of needs

Some of the popular NoSQL DB's are: -

1. Cassandra - Please refer to my last article

2. Hbase - Built over HDFS (Hadoop File system) and leverages HDFS itself as it underlying Data Store, it provides a tight integration with Hadoop and enjoys a lot of tuning parameters which hadoop uses.

3. GraphDB's - As the name itself defines - A graph database uses graph like structures with nodes, edges, and properties to represent and store data.
A graph database is any storage system that provides index-free adjacency. Nodes are very similar in nature to the objects that we use in object-oriented programming.
Graph databases don’t have rigid schemas and are popularly used in modeling the schemas which are evolved over the period of time.
They do not require expensive join operations, thus can scale more naturally with Large Data sets.
A typical use case would be a Data model of social networking website.

Couple of popular GraphDB's are: -

Neo4j, InfoGrid, hypergraphDB, Bigdata

4. Document DB - Databases which specializes in storing the documents or semi-structured information in form of documents.

Though the implementation of each document database differs but in general, they all assume documents and encode documents in some standard format/ encodings. it could be XML, YAML, JSON, BSON and binary formats like PDF, DOC, Excel etc.
In comparison to relational models they are less rigid and are not required to adhere to standard schema.

Couple of popular Document-DB's are: -

CouchDB, MongoDB

Connectors - There are several connectors which are providing to transfer data from the source and dump it into HDFS (Apache Hadoop File System) for further processing.
for e.g. imagine webserver logs producing millions of logs in each minute and this data needs to be seamlessly transferred to HDFS for further processing and perform analytics.

Couple of popular Connectors are: -

Sqoop - Used to extract data from structured datastores.
Flume - works in a distributed mode and is used for collecting, aggregating, and moving large amounts of log data

Workflows - The next thing in this technology stack was to have a workflow engine which can be used to schedule the jobs (time based or data availability), create a workflow of M/R jobs and also keep a track of all these workflows

Couple of popular Workflow Engines are: -
Oozie - Oozie is a scalable, reliable and extensible system, works very well and haven't been replaced by any other wrokflow engine’s till now

Querying API's: -
Very well integrated with hadoop provides an efficient way to query records lying on HDFS.

Couple of popular Querying API's are: -
Pig, Hive

Analytics - The most important piece of BigData technology Stack!

Analytics basically deals with the predictions by carefully analyzing the past trends e.g.
1. Predicting the behavior of each user
2. Recommending the user with the various products (see recommendations provided by Amazon)
3. Analyzing the Effectiveness of an advertisement.
4. Analyzing the sentiments of public from social networking websites (Facebook, linked-in, G+, Orkut etc.)

And lot more..........

All the above information helps Analyst with the various kinds of trends available, so that they can strategies the future

It is easy to get analytics on structured data but it is difficult to get the Analytics on the semi-structured or un-structured data and here is the real value created by Mahout or NLP like API's.

Couple of popular Analytics API's: -

Mahout, NLP, R

MPP (Massively Parallel Processing Frameworks) - This is something new have emerged in the BigData technology stack, though they used the same concepts but the whole stack is tightly integrated to provide better performance and analytics too.

Couple of popular MPP's are: -

Greenplum,Paracel DB

BigData Clients: -

BigData clients are pretty different from the normal clients which we used to have in a typical web based systems.
Here clients are pretty technical in nature or it would not be wrong to say that here we need to deal with the Enterprise Architects who are: -

1. Highly technical and can talk about Architecture in middle of the night.
2. Know their business very well.
3. Is aware of the next steps for their TB/ PB’s of data.
4. They all have done few POC’s with Hadoop/ MPP or at-least they are well aware what Hadoop/MPP can/ cannot provide.

Now if these customers are already have identified the needs and also aware of technology/ Domain than what they need from us?

Yes, that's really a big question in the world of BigData and here is what i have experienced about their needs: -
“They need people who understand/ implements and are expert in BigData solutions and can provide analytics over structured/ semi-structured/ unstructured TB’s of data for timely decisions, forming new strategizes, acquiring focusing new business/ markets and customers on an ongoing basis”
In next section we will talk more about the team expertise and skill set required to meet the above needs.

Expertise required by BigData Team: -

The team working on bigdata solutions not only requires to be technically competent but should also be well versed with the challenges of large data (in TB or PB etc.).

Any candidate chosen for the BigData domain requires following 3 things: -

Expert in technology: -

1. Technically competent (Software/ Hardware/ Networks) and understand the Large data and is also possess a good experience in dealing with the Architectural issues/ decisions encountered/ taken in processing and parsing of Large Data.
2. Innovation – That’s the key behind any technology creator or implementer - Don’t Limit your thoughts and don’t do things just because someone have told that, put your minds, bring up new ideas and make things work in a better way.

Analytics: -

1. Understands the world of Analytics and also well versed of different solutions available for providing Analytics over the Large Data. This also requires the understanding of Dynamics and constraint imposed by Data.
2. Should not be language specific (can be in python/ shell/ C or C++). Rather it should be focused towards the problem.

Domain/ business Significance: -

1. Should have worked on some domain where he have dealt and understand the dynamics/ behavior and role of the Data in the solutions, at-least he should have dealt with some kind of reporting problems arise due to the size of data and complexity of the reports.
2. implement cases where technology is used to solve the real world problems in the various verticals (Finance/ Insurance/ RiskManagement/ Hospitality etc. ) and not any hypothetical cases.

Competitors in BigData: -

BigData market is still emerging and that’s the reasons nobody have defined list of competitors but every single company dealing or dealt with the Data challenges and having right set of folks to provide solutions over it can be termed as Competitors in BigData domain.

Few of the major players, who have already established themselves as the leaders in this domain are: -

1. Cloudera - One of the early entrants in the BigData market. It also offers its bigdata compatibility stack - CDH
2. Hortonworks - Backed up by Yahoo and Benchmark Capital , it is providing its own BigData compatibility Stack - Hortonworks Data Platform
3. EMC - Quite renowned company in the world of IT infrastructure and have tied up with couple of major players to provide its offerings in BigData.
4. Oracle
5. IBM

Even the majority of the Data warehousing players have also realized the future and have already started their offerings: -

1. Pentaho
2. Informatica
3. JasperSoft

The discussion doesn't ends here, BigData is still a focus area for most of the companies and the future of 2012 looks much brigther and competitive.


Vikas Srivastava said...

Great post Sir,

Sumit said...

Thanks Vikas

yogi said...

Very nicely written..
Great work