Virtualizing Hadoop in Large Scale Infrastructure

 

An interesting federation whitepaper recently hit my inbox and the subject line grabbed myimage attention immediately.  It was a whitepaper called “Virtualizing Hadoop in Large-Scale Infrastructure”.  It was written in conjunction with our customer Adobe Systems out of Utah and it was right in line with some of the comments I’ve been making recently with customers and at conferences/VMUGs.  This paper further convinces me that Hadoop and Object based Storage (think Amazon s3 storage) is the future as it relates to storage consumption.  If you are in IT, and you plan to be in IT for the next 3 or more years, now would be a great time to start getting up to speed on Hadoop as well as Object Based storage.  While I certainly think Block and File based storage will be around for a long time, it appears the new “cool kids” on the block is HDFS and Object. 

Anyway, back to this whitepaper.  The focus of the white paper is around Adobe’s IT department wanting to be imagemore agile and responsive to the business needs.  They specifically called out as a key objective to: “Build a virtualized HDaaS environment to deliver analytics through a self-service catalog to internal Adobe customers”. They wanted to utilize their Cisco UCS Blades, EMC VNX and EMC Isilon (Isilon was used for the Hadoop Storage – more on that in a future blog post) as well as VMware’s “Big Data Extensions” (BDE). In addition Adobe is convinced (as am I) that the companies can gain significant competitive advantage mining the vast amount of information they collect on their site.  To the tune of over 8PB (PETABYTE !!!).  THAT’S CRAZY!.  This is mostly collected from site vists and web traffic, and then it’s tied back to revenue.  It’s just one of the examples they used in the document.

The whitepaper outlines some of the key objectives of a HDaaS offering as well as their sincere desire to figure out what the possible performance consequences were of virtualizing and then scaling Hadoop.  It also points out some of the lessons learned, or what I like to call “banging your head against the wall” issues.  When looking through the whitepaper it’s clear that memory settings were really important.  The paper does an excellent job of also sharing the various other whitepapers and documents used for guidance and recommendations.

So if you are interested in learning more about Hadoop, or even have already started down the process of implementing Hadoop, take a moment to read through this customer whitepaper.  If nothing more, you might get some ideas on how you might be able to utilize Hadoop in your environment.

 

2 thoughts on “Virtualizing Hadoop in Large Scale Infrastructure

  1. == Disclaimer: Pure Storage Employee ==

    Tommy,

    Great post. While Hadoop, Spark and other big data elements are maturing I think we in the storage industry have seen the struggles early adopters have faced with shared-nothing architectures at scale (like HDFS on DAS). My comments may stoke the passion of some in this space but I expect as big data technology matures we will see a move in the storage infrastructures that are best adapt to storing and serving the needs of multiple services above and beyond big data engines.

    I suspect momentum is building in this space.

    — cheers,
    v

  2. I have been directed to this site by our local EMC SE. I was one of the main drivers and participants in this project at Adobe. It was a ton of work, we learned so much, and had a great experience working with EMC resources (including Isilon, VMWare, Pivotal, etc…).

    I’m really excited to see such a positive response to our work and especially to the whitepaper. We continually discovered that we were figuring things out or assimilating ideas from various Hadoop knowledge sources that we were very excited share the lessons learned with anybody else who might be interested. This isn’t even close to the end, having figured it all out, but rather it represents the beginning of what will likely be a continuing journey to provide a powerful and flexible environment for the business to discover hidden insights in the troves of data stored in our lake.

    For some more particulars, one of the participants on the team (and a BDE guru) has recorded and shared some of the specific settings and configurations we had to discover:

    http://virtualelephant.com/category/hadoop/

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s