What so big about Hadoop?

The Real Reason Hadoop Is Such A Big Deal In Big Data Hadoop is the poster child for Big Data, so much so that the open source data platform has become practically synonymous with the wildly popular term for storing and analyzing huge sets of information.

While Hadoop is not the only Big Data game in town, the software has had a remarkable impact. But exactly why has Hadoop been such a major force in Big Data? What makes this software so damn special – and so important?

Sometimes the reasons behind something success can be staring you right in the face. For Hadoop, the biggest motivator in the market is simple: Before Hadoop, data storage was expensive.

Hadoop, however, lets you store as much data as you want in whatever form you need, simply by adding more servers to a Hadoop cluster. Each new server (which can be commodity x86 machines with relatively small price tags) adds more storage and more processing power to the overall cluster. This makes data storage with Hadoop far less costly than prior methods of data storage.

datacenter

Spendy Storage Created The Need For Hadoop We’re not talking about data storage in terms of archiving… that’s just putting data onto tape. Companies need to store increasingly large amounts of data and be able to easily get to it for a wide variety of purposes. That kind of data storage was, in the days before Hadoop, pricey.

And, oh what data there is to store. Enterprises and smaller businesses are trying to track a slew of data sets: emails, search results, sales data, inventory data, customer data, click-throughs on websites… all of this and more is coming in faster than ever before, and trying to manage it all in a relational database management system (RDBMS) is a very expensive proposition.

Historically, organizations trying to manage costs would sample that data down to a smaller subset. This down-sampled data would automatically carry certain assumptions, number one being that some data is more important than other data. For example, a company depending on e-commerce data might prioritize its data on the (reasonable) assumption that credit card data is more important than product data, which in turn would be more important than click-through data.

Assumptions Can Change

That’s fine if your business is based on a single set of assumptions. But what what happens if the assumptions change? Any new business scenarios would have to use the down-sampled data still in storage, the data retained based on the original assumptions. The raw data would be long gone, because it was too expensive to keep around. That’s why it was down-sampled in the first place. Expensive RDBMS-based storage also led to data being siloed within an organization. Sales had its data, marketing had its data, accounting had its own data and so on. Worse, each department may have down-sampled its data based on its own assumptions. That can make it very difficult (and misleading) to use the data for company-wide decisions.

More Hadoop Benefits

Hadoop also lets companies store data as it comes in – structured or unstructured – so you don’t have to spend money and time configuring data for relational databases and their rigid tables. Since Hadoop can scale so easily, it can also be the perfect platform to catch all the data coming from multiple sources at once. Hadoop’s most touted benefit is its ability to store data much more cheaply than can be done with RDBMS software. But that’s only the first part of the story. The capability to catch and hold so much data so cheaply means businesses can use all of their data to make more informed decisions.