Monday, June 3, 2013

Upgrading Your Storage Microcode

Folks,

I was just reading a posting by Chris Evans on this topic at http://architecting.it/2013/06/03/managing-microcode-upgrades/ and he makes a lot of great points.  I agree with everything that Chris posted,  only I would go even further and say that based on my experience that having a regular process for upgrading your storage microcode is critical to managing any storage environment.

There seem to be three competing philosophies that cause problems on this topic "in the wild":


  1. "If it Ain't Broke, Don't Fix It!" - This is the idea that you should only patch or upgrade your storage infrastructure if you run into a problem.  I run into this approach more often than you would think, and invariably what this means is that you will run into every problem that exists in the microcode and have to deal with it on an "emergency" basis. It also means that you will often go for long periods of time without patching or updating, and then when you hit a problem, you have a huge jump, which almost always means that you also have a lot of servers that need HBA firmware and/or driver updates.  This usually ends up being  aHUGE and painful project, that, in some people's minds simply confirms why they are avoiding doing the storage microcode upgrades in the first place.  What they don't realize is that the main reason it's so painful is that they are so far behind. If they actually kept up, then the pain would be less and spread over time.
  2. "Pick a standard, and keep it as long as possible" - This approach is one I see fairly often as well. Here the storage team picks a "standard" version of the OS, ans sticks to it only patching it when there is a problem, or until they are forced to change because new hardware doesn't support than version of the OS any longer. Then they adopt the new version of the OS as their standard, and bring everything up to that level. It's actually similar to #1, and suffers from the same sorts of issues.
  3. "Apply every patch and/or upgrade the vendor releases as soon as it becomes GA" - I see this much less frequently mainly because people are afraid, often rightfully so, that patching/upgrading this frequently will cause more problems than it solves.


The process that Chris outlines in his blog post, is, in my opinion, the right way to go. Apply your patches either quarterly, or twice per year in predefined upgrade windows.  This doesn't mean that you can't apply patches to resolve specific issues as they arise.

But I would go a bit further in my definition of the process.  Specifically, I would have a process that works something like this:


  1. Between upgrades (i.e. during the quarter or 6 month period between upgrades) I would pull down every patch and upgrade that the storage manufacturer releases into the storage team lab, and apply it to a lab box.  I would then apply a set of regression tests to validate that the patch/upgrade worked in my environment, with my servers, HBA's, etc.
  2. About a week prior to my upgrade window I would pull together an "upgrade" package where I decide what patches/upgrades, etc I was going to apply to the storage, as well as any that were required for the HBA's, host OS's, etc.  Note it's critically that the host HBA's be upgraded to the latest version of their drivers, etc. that are supported by the patches/upgrades that you are going to roll out to avoid issues. Upgrades to the servers are often avoided even more than the storage OS upgrades since they are usually the source of outages (reboot required) and due to the fact that it's not the storage team doing those upgrades in many cases.
  3. I would actually have two windows, once for arrays that support dev/test, and one for arrays that support production if it's possible. I would then roll out the patches to dev/test, and let them bake there for a week or two, and then roll them out to production. This isn't 100% necessary, especially if you've done good testing in your lab, but it would be nice.
  4. Go to step #1 and start the process all over again.
When I've described this process to people I often get push back like "hey, that means that we will constantly either testing, or performing upgrades"!  This is especially the case if you decide to go on the quarterly schedule. My response if "yup, because that's part of what a storage team does, and why you have a storage team". Frankly, the team's time is better spent on this, than on, say, doing a lot of LUN allocations which you can automate, and even delegate, once it's automated.

The bottom line is, it's a "pay me now, or pay me later" situation and I would rather do as much of my patching/upgrading in a proactive manner, than in a reactie maner where's an a big emergency, and a big project with a lot of downtime at once.