home | non-tech | cs | ee | about
Abstract New

Haduping the datachondriacs

Lest I get a ton of hate mails, let me get this out of the way - there are a lot of benefits to using Hadoop. Hadoop will eventually bring about world peace. Sarcasm! Let me start again. Hadoop is an awesome technology framework/ecosystem for distributed storage and processing of large data. Like everything else in the universe, Hadoop is useful only where it is applicable. Better?

Two issues need addressing:
  • businesses craving for Hadoop without rhyme or reason a.k.a datachondriacs
  • technologists using lame and often inappropriate examples to push Hadoop

What follows, is a sample conversation that goes on between a customer and a nameless Hadoop enthusiast - HE (or the Serious Hadoop Enthusiast - SHE for the politically correct).

  • HE: How much data do you have?
  • Customer: SQL server with 2 GB of data. Been accumulating over the last 15 years.
  • HE: So clearly, you are not planning for the explosion of data.
  • Customer: What data?
  • HE: You know, the usual - petabytes of information... um...Facebook say and Twitter....
  • Customer: Something concrete?
  • HE: You know like word count, analytics
  • Customer: I don't get it but it sorta kinda sounds good. I have always wanted to Hadoop.
  • HE: Do you want your business to succeed?
  • Customer: Yes
  • HE: Then say it with me - Hadoop! Hadoop! Hadoop!
  • Customer: Hadoop! Hadoop! Hadoop!

I have, unfortunately, seen many such conversations. So, where do we begin? We can begin by thinking about Hadoop the right way. The right way to think about Hadoop is as a platform. A platform for

  • distributed storage and parallel computing
  • leveraging commodity hardware to give you fault tolerant scale out solutions

It also is a platform that

  • needs to be learnt and administered
  • is open source
  • typically is used to crunch clean and prep large (read petabytes or at the least many terabytes) amounts of data

An important distinction to make here is that Hadoop is a platform and not a solution. The adoption of Hadoop in your organization largely depends on the business and the technologists in your payroll.

A tip for businesses

Get a reality check on your business needs and your data strategy. Nothing enlightening here, but this is the place where a ton of slip ups happen. Do you currently, or in the future, plan to generate/capture tremendous amounts of data (at least terabytes), both structured and unstructured? More importantly, does this data help your business? Without a sound data strategy, it is usually useless to jump into the Hadoop bandwagon. If you feel the word data strategy itself sounds like hot air, you already have a leg up on building a successful business.

A tip for technologists

A scalable architecture can be designed and maintained without Hadoop. There are scale out strategies, scale up strategies that can be put together without involving Hadoop. In fact, it can become easier if you decide to adopt cloud ecosystems. Different cloud ecosystems already have solutions for distributed storage of data - think Azure table storage, AWS DynamoDB etc. Even for real-time applications, an architect can always use sharding, queues and jobs on commodity hardware. I hear the Hadoop Enthusiast is screaming for me to use Storm.

Finally, another word of caution. Machine learning and Hadoop is the opiate of choice for technologists. Before you taste the heady brew, consider this for a second - A million patient records take up about 1.5 GB of data. If all you want to do is analyze that, you probably do not need Hadoop.

You say - "Yes yes yes. What about patient info from Facebook and Twitter and all that stuff?" I hear you. So I shall bin/stop-all.sh

No comments:

Post a Comment

© 2014 - 2015 abstract new. All rights reserved.