FastCache-the answer to the Automated Storage Tiering Problem

This is sort of a rant blog post 🙂  Not like the ones Chuck Hollis does, but more of a nit that I hope to shed light on.  I’m a HUGE fan of EMC’s FAST Cache feature as part of our FAST Software Suite but I don’t think we at EMC do a good enough job of explaining to customers just how powerful this feature can be.  Especially when we spend most of our presentations on how FAST saves customers money by moving stale data down to lower costs/lower performing media.  While this is a great feature, it’s really only half the great story of the FAST Software Suite.  My other rant is when we do get around to talking about FAST Cache its usually in the discussion around Virtual Desktops.  In fact, the slide for FAST Cache typically shows the value being closely aligned to reducing spindle count in Virtual Desktop deployments (by a lot in most cases) and while thats great and all, it’s not what I think is REALLY REALLY cool about FAST Cache !!! 

So what is FAST Cache you might ask, and why am I talking about it in this blog?  EMC’s @StorageZilla did a pretty good overview of it here: “FAST Cache for EMC Unified Storage”   Essentially FAST Cache offers a “I/O TurboBoost” to your busiest workloads.  It’s really easy to get up and running on your CX and VNX platforms.  You just add in a couple of Enterprise Flash Drives (EFD or SSD as others call it) and set this up as a FAST Cache pool.  Then you can go to different LUNS in Unisphere and enable the ability for that LUN to utilize this SSD/EFD space for read and write caching (Unlike other solutions that are READ only).  Easy Peasy so lets get into the discussion of why I think we are missing the boat on the positioning of it.

First, let’s give credit where credit is due !!  Compellent created and pioneered something they called “Data Progression” (DP) EMC calls this F.A.S.T.(Fully Automated Storage Tiering).  If you are CML Pictnot familiar with DP or FAST its “Automated Storage Tiering” (AST) or simply the ability to move blocks of data up and down different tiers of storage.   Jeramiah Dooley did a great overview of it from a Service Provider perspective on his blog “FAST and FAST Cache for the Service Provider“.  Most storage companies that support AST recommend mixing Tier 1 (15k RPM or SSD), Tier 2(15k/10k RPM) and Tier 3 (NL-SAS, 7.2k RPM drives) into one big pool and then they use their software to migrate the blocks of data up and down the stack.  Now, where we may differ is the block size that moves up and down.  In VNX, FAST VP is at a more controller friendly 1GB movement (VMAX is at 764kb – wow!! ) and in Compellent’s its something like 2MB.  From a competitive positioning, it’s always funny to see “who’s is better”.  The net-net is we all essentially do it the same way, and we all have a set policy on when this process takes places and how long it takes to move it down.

The value of this feature, and the way it is typically positioned is an ability to take stale (or infrequently accessed) data and move it down to lower cost, slower media and allow more frequently/important data to stay on Tier 1.  Before you unload on me, I do not think frequently accessed data = important data but that’s how it is sometimes positioned or implied so don’t shoot the messenger.

tiering down the stack

The drawing to the right is a a good example of how this process works.  Data is normally written in at Tier 1 and then every X amount of time, in most cases 24 hours but it can be adjusted, ( btw – you want to pay attention to the toll it takes on your controller resources) the blocks are reviewed and if they haven’t been touched, are marked to progress down to a lower tier of storage that night.  Each X amount of time this process starts up again and the data moves down the stack based on access time.  In practice this feature is AWESOME and EMC, Compellent and various other solutions on the market are doing really well with it.

My biggest rant with this feature is its always positioned, or talked about on the ability to move data from Tier 1 down to Tier 3 which is fantastic but what happens when that data sitting on Tier 3 – RAID5 (in some cases (not EMC), an RAID 5 9 Drive stripe SATA set) becomes “hot” again?  If the process runs every X amount of time that means it could take 2 or more days to move back up to Tier1.  Anyone see a problem with that?  If my Oracle, SQL, X-Database, Exchange, Virtual Desktop Recompose process kicks off I’m going to be sitting on some of the worst performance you can imagine.  RAID5, SATA.  UGH. 

This is where EMC could do a better job of positioning FAST Cache.  Right now, we positioned this as an ability to “absorb” front-end read and write bursts from hosts/server/applications which allow us to design storage arrays with “steady state” performance and not “peaks”.  Again, VDI being a great use case for this (think BootStorms, LogonStorms, AV Storms).  BUT I think the better way to look at this is the “Oh crap, my data has progressed down to Tier 3 and it needs performance (response time, IOPS etc) quickly”.  Think about it, you need to run a report, crunch some numbers on blocks that haven’t been touched in a couple of days/weeks and has started the progression process or is already sitting on Tier 3.  Do you simply live with the performance of it for a couple of days while the data works its way back up (think 100’s of Milisecond response times)?  Do you run to the storage console and kick off a manual move?  I don’t think so.  That means you have 2 choices.  You can live with the performance impact for a couple of days, or you can simply not turn AST on for that volume.  Option 2 doesn’t sound very appealing since it would seems counter to the reason you wanted that feature in the first place !! 

So the answer to this issues is to use something like FAST Cache to bridge theFastCache Effect performance impact of the AST process.  Simply put, if the block of data you need is sitting in Tier 3 and it gets “touched” 3 times it will be promoted directly into the FAST Cache pool (EFD/SSD) and the response times and IOPS capability goes from the outhouse to the penthouse with ZERO user intervention.  It is the turbo boost your application needs at that immediate time.  Once that data gets stale again, it moves back down the stack.  EASY FREAKING PEASY and the way FAST Cache should be talked about !!

As you can see, Automated Storage Tiering (FAST VP or Data Progression) is really only positioned as an ability to save CAPEX on Tier 1 storage and is usually only talked about in the sense of moving the data out of Tier 1 and down into Tier 2 and Tier 3 to help save money.  That’s really misleading in the sense that when the data does get “hot” again, it takes a while to move back up and I think that is where we do a disservice to our customers in not explaining the performance impact of data sitting in Tier 3.  Features like FAST Cache bridge that huge performance gap and help solve that performance issue.

So, if you are a HUGE fan (as I am) of Automated Storage Tiering you really need to not get so wrapped up in the progression down of the blocks because at this point everyone does it.  You really need to understand and ask things like “in a hands off scenario, how long or better yet, what is the process to move the data back up the stack and typically how long could that take and what is the performance impact during that time”.  If the answer is “well, it depends”then you may want to do some customer reference checking on what happens in the real world!!

Okay, that’s it for my rant 🙂

@vTexan

20 thoughts on “FastCache-the answer to the Automated Storage Tiering Problem

  1. Tommy,

    Great post. Didn’t XIOTECH also do some tiering as well, since the founders of compellent were also part of the core founders group of XIO?

  2. back in the good ol days of the Magnitude (Classic) we did have the ability to do that – in fact if memory serves me correctly they played around with it in engineering but we never rolled it out and i think it died on the vine once the founders left.

  3. Great article! Explains a lot about FastCache that I did not know. Absolutely love this line: “response times and IOPS capability goes from the outhouse to the penthouse”

  4. Great post. Question: You mention that FastCache is read/write while others are read-only. Why did EMC go read/write and what tradeoffs are you making? Why are others read-only?

  5. Great post! Hope you don’t mind a short question.

    I wonder what’s your view on IT operation / cause for a large data set on SAN to “become “hot” again “, and what is a probability of such an event to be unmanned?

    When the server/application requests the ‘cooled’ data, we normally speak about some amount of pages (N megabyte size * M pages = total dataset to heat up). I just can’t imagine a conscious IT operation that would heat up a large dataset… IT administration _should_ move the data up as per change request, – as in your article you say “kick off a manual move”. Otherwise, if “do nothing” is the case, we speak about only M pages that by chance became ‘hot’, and it impacts the average latency (a every of those N pages may add its 100ms), but is it significant in grand schema of things?

    Thanks again, – great article!

    1. That’s the point I think- data ‘heat’ is not controlled by IT, it’s by the business. If they are doing an acquisition and want to run some financial numbers that have been cold, how is IT supposed to see that in their crystal ball? Unless you have a AST technology like FAST in the Symmetrix VMAX that constantly moves the data, you need something to absorb that spike. Also how do you manually move subsets of LUNs (pages) proactively? How do you know which pages contain what data? You don’t.

      I can’t picture that scenario- “Hi Bob in IT, this is Gary in accounting. I may need to run some queries in relation to an audit on some older data. It’s in these tables which are on these pages. If you could promote that data tonight so it’s ready for tomorrow, that would be great.”

      1. @VMTyler you can take FAST to the next level with ISV’s that have leveraged the FAST API’s on VMAX and VNX to do exactly what your asking… ie predictive FAST movements of blocks to prepare for the new workload after the next scheduled move. http://www.precise.com for them FAST = Fully Aware Storage Tiering. 🙂

  6. Nice post Tommy! Compellent’s block size is 2 MB by default but can be changed to 512 KB or 4 MB depending on needs.

    1. Ya, i was aware they can change it higher or lower but i still feel way to much time / effort / marketing material focuses on the migration down the stack and no one really speaks about what happens when/if the blocks get hot again.

  7. “My biggest rant with this feature is its always positioned, or talked about on the ability to move data from Tier 1 down to Tier 3 which is fantastic but what happens when that data sitting on Tier 3 – RAID5 (in some cases (not EMC), an RAID 5 9 Drive stripe SATA set) becomes “hot” again? If the process runs every X amount of time that means it could take 2 or more days to move back up to Tier1. Anyone see a problem with that?”

    Yeah I see a problem with that. And the problem is that EMC is recommending that we purchase FAST Cache to solve the problem that EMC’s, and in particular VNX’s AST implementation shortcomings created, due to lack of real time or near real time reaction, to begin with…

    1. Eric – first and foremost thank you for being a customer and thank you very much for voicing your concern. I’m sure you are currently working with your EMC account team on this, but I’ll toss in my thoughts. Please note, I’m doing this without any background on your situation so if I’m way off, please forgive me.

      So EMC, not unlike Compellent, IBM, Isilon, 3PAR etc that support in some way, shape or form AST implementation based on how I outlined it in the blog. We all suffer the same issue. AST is not real-time and because of that it’s rarely positioned as a “performance” boost and it’s always positioned as CAPEX/Cost savings feature. In other words most customers buy AST because it can clearly save CAPEX dollars on Tier 1 storage. Before AST your choices were to run everything in Tier 1 or have to manually figure out your performance requirements on each application and then place that app on the appropriate tier of storage. Whether that was putting your SQL/Exchange/Oracle on Fast Tier 1 and putting your end user data on Tier 3. A Storage and Application admin had to put a lot of thought in to the storage design layout. AST helped change a lot of that dynamic. Now instead of having to focus on where a certain application lives (within a tier) you simple created a volume that spans all the tiers and you let the array figure out where to move the data based on how frequently the block of data is being hit (Hot vs Cold/Stale data). Today that process is run as a policy and usually a batch job that is run at a predetermined time frame. Default is usually 24hours but if you have active data you can adjust that down to hours/minutes. The tradeoff is those process require Controller CPU cycles and other things so you have to figure out where your sweet spot is for this process. You have clearly run into the situation where we may need to evaluate when that process is run and how often it is run.

      Now, if you don’t want to mess with that then FastCache can be used as your realtime performance adjustment tool. It is 100% focused on performance boosts. In your case, if a block of data is sitting in tier 2 or tier 3 and gets referenced (read or write) 3 times in a X amount of time its mirrored right into SSD/EFD “FastCache pool” and you get your turbo boost. It does this without you having to manage/monitor or even think/worry about it. This can be turned on and off at a storage LUN level so you can apply it where you feel you need it the most. We can even work with you on seeing what sort of data could benefit with FastCache. We do this a lot for customers that are on the fence about adding FastCache to their array. I can tell you that very few storage vendors can offer fastcache type features (real time performance tuning). Most of the time they just encourage you (as I did above) to adjust when the policy is run or they may just recommend you not using AST for that particular application (boooo!!)

      So, if you are not interested in FastCache (I’d love an opportunity to change your mind), talk with your TC about adjusting the times when the policy is run.

      If you like, I would love an opportunity to speak with you about this issue. My e-mail address is tommy . trogden at emc com – feel free to drop me an e-mail and I’ll see what I can do to help you. I might not know the answer, but I’m pretty good about finding the right resource for you !!

      Again, thank you for being an EMC Customer and thanks for the comment.
      Tommyt

  8. Great post Tommy! Whenever I talk about the FAST Suite I always talk about FAST VP dealing with the longer term profile (heat) of the data to aid capacity efficiency. FAST CACHE is of course designed to deal with the immediate heat profile of data, i.e. performance efficiency. I think what you capture very well here is the combination of the two, especially in the use case where data that was cold becomes instantly hot again due to business demand. This reheating of data is usually something that IT staff are concerned about because they know fine well that what the business want to do with data is far from predictable. The FAST suite deals with long term and short term performance and therefore provides a suitable answer regardless of the use case.

  9. Hi
    I just want to get clear on this. Does the FAST cache feature move blocks of data from the initial tier to the FAST cache (tier 0). If so, this must mean that the FAST cache device must be at least mirrored or? And if it moves the blocks would you then call it a cache? A cache is by the definition a copy of the latest used data, this requires the FAST cache to actually copy the blocks and not move them.

    Regards Niklas.

  10. Great article!

    The only thing I would add is how Compellent always (using the recommended progression policy) commits new and modified blocks to the highest tier. If blocks become hot in terms of reads-only it will rely on frequency of access to move up. If it’s hot because it’s now being written to, that is a zero-delay elevation. The new block is written in the top tier, pointers are updated, old blocks released.

    Additionally, Compellent will tier RAID types inside a disk Tier. IE: SSD R10/R5, 15K R10/R5.

    Either case — ability to tier well is where it’s all headed unless we figure out how to create cheap quantum storage!

    1. Hey Chris,

      Thanks for the reply and congrats on the role at Dell/Compellent – heard they picked you up!! Still wish we could have landed you at Xio !! 🙂

  11. Fast Cache is nice..but unfortunately it’s not a fit all solution.
    What if your fast cache is full ? then you get the same problem again

    1. Hey Sam – thanks for the comment. So, FAST Cache is always full, and blocks get aged out and new ones fill in their place, not unlike controller cache. As far as being a fit for all applications you are correct. The great news is we usually have a good idea as to which ones work best.

      Thanks again for the comment !!

  12. Hi Tommy,

    thanks for your post..

    We are proposing FAST Cache in many cases to customers, here i have doubt.. can you tell me what if my active data is more, lets say of 10 TB of data, then in VNX max extend cache is 4 TB. so in this case how the 10 TB data is promoting to FAST cache.

    Thanks

  13. Hi Tommy..

    thanks for your post..

    We are proposing FAST Cache in many cases to customers, here i have doubt.. can you tell me what if my active data is more, lets say of 10 TB of data, then in VNX max extend cache is 4 TB. so in this case how the 10 TB data is promoting to FAST cache.

    Thanks

Leave a comment