Tuesday, January 6, 2009

IBM XIV Could Be Hazardous to Your Career

So, I haven't blogged in a while. I guess I should make all of the usual excuses about being busy (which is true), etc. But the fact of the matter is that I really haven't had a whole heck of a lot that I thought would be of interest, certainly there wasn't a lot that interested me!

But now, I have something that really get my juices flowing. The new IBM XIV. I don't know if you've heard about this wonderful new storage platform from the folks at IBM, but I'm starting to bump into a lot of flolks that are either looking seriously at one, or have one, or more, on the floor now. It's got some great pluses:

  • It's dirt cheap. On top of that, I heard that IBM is willing to do whatever it takes on price to get you to buy one of these boxes, to the point that they are practically giving them away. And, as someone I know and love once said "what part of free, isn't free"?
  • Fiber channel performance from a SATA box. I guess that's one of the ways that they are using to keep the price so low.
  • Teir 1 performance and reliability at a significantly lower price point.

So, that's the deal, but like with everything in this world, there's no free lunch. Yes, that's right, I hate to break it to you folks, but you really can't get something for nothing. The question to ask yourself is, is the XIV really too good to be true? The answer is yes, it is.

But the title of this blog is pretty harsh, don't you think? Well, I think that once you understand that the real price you are paying for the "almost free' XIV could be your career, or at least your job, then you might start to understand where I'm coming from. How can that be? Well, I think that in most shops, if you are the person who brought in a storage array which eventually causes a multi-day outage in your most critical systems that your job is going to be in jeopardy. And that's what could happen to you if you buy into all of the above from IBM regarding the XIV.

What are you talking about Joerg?!? IBM says that the XIV is "self healing", and that it can rebuild the lost data on a failed drive in 30 minutes or less. So how can what your said be true? Well folks, here's the dirty little secret that IBM doesn't want you to know about the XIV. Due to its architecture if you ever lose two drives in the entire box (not a shelf, not a RAID group, the whole box all 180 drives) within 30 minutes of each other, you lose all of the data on the entire array. Yup, that's right, all your tier 1 applications are now down, and you will be reloading them from tape. This is a process that could take you quite some time, I'm betting days if not weeks to complete. That's right, SAP down for a week, Exchange down for 3 days, etc. Again, do you think that if you brought that box in after something like that your career at this company wouldn't be limited?

So, IBM will tell you that the likely hood of that happening is very small, almost infinitesimal. And they are right, but it's not zero, so you are the one taking on that risk. Here's another thing to keep in mind. Studies done at large data centers have show that disk drives don't fail in a completely random way. They actually fail in clusters, so the chances of a second drive failing within the 30 minute window after that first drive failed are actually a lot higher than IBM would like you to believe. But, hey, let's keep in mind that we play the risk game all the time with RAID protected arrays, right? But the big difference here is that the scope of the data loss is so much greater. If I have a failure in a 4+1 RAID-5 raid group, I'm going to lose some LUNs, and I'm going to have to reload that data from tape. However, it's not the entire array! So I've had a much smaller impact across my Tier 1 applications, and the recovery from that should be much quicker. With the XIV, all my Teir 1 applications are down, and they have to all be reloaded from tape.

Just so you don't think that I'm entirely negative about the XIV let me say that what I really object to here is the use of a XIV with Tier 1 applications or even Tier 2 applications. If you want to use one for Tier 3 applications (i.e. archive data) I think that makes a lot of sense. Having your archive down for a week or two won't have much in the way of a negative impact on your business, unlike having your Tier 1 or Tier 2 applications down. The once exception to that I can think of is VTL. I would never use a XIV as the disks behind a VTL. Ca you imagine what would happen if you lost all of the data in your VTL? Let's hope that you have second copies of the data!

Finally, one of the responses from IBM to all of this is "just replicate the XIV if your that worried". They right, but that doubles the cost of storage, right?

25 comments:

Chris M Evans said...

Joerg

I discussed a related subject to this; RSS (redundant storage sets) which reduce the RAID failure risk; on my blog here;

http://storagearchitect.blogspot.com/2008/10/understanding-eva-revisited.html

However I'd caveat what you say here; firstly you're saying that you need a total failure of the drive - most arrays these days pre-fail disks before they physically fail; also as XIV is writing 1MB blocks (from memory) across the whole array, then potentially blocks can be recovered to repair data before a drive fails totally. This would limit the impact of the failure and the amount of data to be recovered. It would be good to see the management software indicating the potential impact of a disk failure so when a disk does fail, it is easy to determine what's affected.

Joerg Hallbauer said...

Chris,

I agree with what you said in regards to array's failing drives before they hard fail. Some vendors are much more agresive about that than others, so I'm wondering how agressive IBM's XIV is in this regard?

The bottom line is that if IBM is marketing the XIV as an "enterprise" array, then they really need to address this issue in some reasonably cost effective way. All the other "enterprise" array vendors provide their customers to pick and choose different protection levels based on what the custoemr's needs are. Some let you mix different protection leveles in the same box so that you can match the application's importance to your business and the protection level for that application appropriately. The XIV has a single way to protect data, and that's it. All of the arguments end up being around if that single protection method is good enough for "enterprise" data or not.

--joerg

Anonymous said...

Joerg
You're a classic! Just how much EMC disk do you have anyway? This FUD is exactly what the EMC reps were telling us yesterday. They are marketing geniuses and you're their stooge.

Storage Pimp said...

Regardless of vendor anytime you utilize pooled storage/wide stripes a double disk failure that spans your raid protection it takes out the entire pool....

Anonymous said...

The chances of a failure on ANY hardware system is never 0, no matter who the manufacturer is.

It is true that with XIV, any simultaneous 2-disks failure will render all the data on the array faulty: unlike other systems, any large enough data-set will likely span all drives on the array.

However, the very rapid (relatively speaking) rebuild time reduces the probability of "simultaneous failure", because it reduces the window of simultaneity.

Any other system suffers similar risk: simultaneous failure of components will make some data unavailable. The fact that it's "all the data" in the XIV case and "only" some of the data in other cases is not entirely relevant: this "some of the data" may contain the one or two absolutely vital datasets that bring down the operation of the organization.

with XIV, you get increased exposure because simultaneous failure of *any* two disks is fatal, while with more traditional raid-5 systems you get increased exposure because "simultaneous" can sometimes mean "within the same day" or even "within the same week", due to increasingly growing rebuild windows.


(Of course, it is also possible that the failure modes of two disks in the array are not independent of each other, and hence the simultaneity will be real, i.e. the two devices will fail at exactly the same moment (physical trauma to the enclosure, power surge, uncontrolled temperature rise etc.) Unfortunately, in those cases it is highly likely that more than just two will fail, and I believe that in those cases the XIV is on the same footing as anyone else.)

In order to claim that XIV is less safe than other system(s), one needs to actually calculate the risk, and weigh the larger exposure window of the XIV ("any two disks in the whole array"), with the larger temporal exposure window of Raid-5 ("Any two failures in the same raid-set within the same week"), or that of any other technology (e.g., replace "any two" with "any 3" in the sentence above for raid-6).

I did not perform this calculation, but claiming that "XIV is less safe" without actually calculating the risk is just FUD.

This article helps readers understand the risk-factor involved with XIV better, kudos for that.
At the same time, the article seems to imply that there are other systems where the risk is 0, which is just not true.
If you want to claim that one system carries higher risk than another, you should come with something better than "on the finger" reasoning.

Anonymous said...

From the IBM Redbook regarding the XIV:

"Important: The system will tolerate multiple hardware failures, including up to an entire
module in addition to three subsequent drive failures outside of the failed module, provided
that a new goal distribution is fully executed before a subsequent failure occurs. If the
system is less than 100% full, it can sustain more subsequent failures based on the
amount of unused disk space that will be allocated at the event of failure as a spare
capacity."

It does not explicitly state whether the 3 "subsequent" disk failures occur simultaneously, but obviously the entire module would mean 12 disks. I have only read about the XIV and seen sales presentations. But until I have the chance to either ask an engineer about this, or put a couple hundred GB onto one and try pulling two drives in separate modules myself (which I will definitely do prior to purchasing one), I have to believe that they would have taken that scenario into account.

Anonymous said...

Joerg,

Have you or has anyone else been able to confirm this?

Thanks!

Storage King said...

It looks like EMC stopped funding Joerg's blog and he went silent. Anyone believing this should do some real research before coming to any false conclusions.

First, the XIV box never writes it's redundant data into another drive in the same module. So, If an entire module dies, which is more likely to happen than drives failing randomly in the system, all data is still preserved and the XIV box immediately re-spreads the data to make it redundant again. If a drive fails, it has pieces of its data on 168 other drives and the XIV box immediately re-mirrors that data to make it redundant. If a second drive fails, outside of the module, the system has that data already mirrored so it re-mirrors the second drive while the first is re-mirroring, which takes seconds to occur. The 30 minute rebuild time is the only thing you said that's correct. Rebuild only happens when the replacement drive is put into the system, and that data is all ready redundant anyway, when data is re-applied to the new drives. The XIV box ALWAYS re-mirrors all data with a drive loss. That's also why there is so much "extra" space on the XIV to make sure it can always insure the data is redundant. There is NO WAY the entire array would come down because of a couple of drives going down. Even the OS exists on modules 4, 5 and 6, so I could lose an OS module and keep going.

Please do some more research from a reputable source before making a decision regarding XIV.

Anonymous said...

Joerg you better had stayed silent, a whole page about one not very well researched function of XIV ... thanks to the medium internet that people like you are wasting our time.

Anonymous said...

Do you really believe wht red book say? I take red book as a reference and it has never been a official doc for machine specification.

When the disk system get older. It will be common to have more disk failed within a short period. So it is possible to have 2 disk failed around the same time.

Also take note a fully loaded XIV will take sometime for data rebuild.

Anonymous said...

I know your interview with XIV did not go well, but come on...

Anonymous said...

You guys are crazy! Talk about fanboys!

If I were to come into the market today and present you with a NEW Revolutionary RAID system that had...as a core element..a 164 drive raid stripe you would blow me out of the water. When IBM does it, it's a great thing? So they have a 30 minute rebuild time for a single drive failure...that's great! The likelihood of a second drive failure is definitely low...not zero, but low. And the data is reprotected against other modules...great! What happens when you have a module go down during production hours? Isn't a module just a server? Don't servers fail during production sometimes? What happens then? I think I can guess. You need to rebuild ALL of the data from all of the drives in the set to all of the other drives. Fast yes...30 minutes? No way...test if for yourself. It will take hours. And in that time...you can't have ANY OTHER FAILURES of ANY OTHER DRIVE, or MODULES! If you want to talk rationally, then look at failure rate analysis in large systems. What IBM is doing here is pushing 1 of the variable lower. reportection time. but they are pushing another variable (ratio of drives to redundancy) way high.

You have to admit that if it wasn't IBM doing this, that the market would have blown up this idea a long time ago.

Anonymous said...

In Joerg's original post From January of 2009 he said XIV was "hazardous":

"...if you ever lose two drives in the entire box (not a shelf, not a RAID group, the whole box all 180 drives) within 30 minutes of each other, you lose all of the data on the entire array."

Of course this is nonsense...and Joerg has since understood his mistake...since he subsequently posted this three months later:

"...create a "best of breed" approach for your storage environment. Here's an example...IBM XIV storage. The XIV provides wide striped storage on SATA disks and makes it all very easy to manage. This is where I would put the bulk of my data..." http://joergsstorageblog.blogspot.com/2009/04/real-cost-of-storage.html

But then, in the most recent comment here..an anonymous and pitifully misinformed person said (with much blustery wind):

"You guys are crazy! Talk about fanboys!...You need to rebuild ALL of the data from all of the drives in the set to all of the other drives...And in that time...you can't have ANY OTHER FAILURES of ANY OTHER DRIVE, or MODULES!"

It's amazing how pitifully misinformed some people still are about disk system reliability mechanisms in the XIV.

Back in 2003, IBM co-funded this research paper:

"Reliability Mechanisms for Very Large Storage Systems"; Proceedings of the 20 th IEEE/11th NASA Goddard...by Q Xin - 2003 - Cited by 93.

It is now THE most respected and most widely cited work on disk system reliability EVER PUBLISHED.

Anyone who wants to understand how XIV reliability works shoud read it...especially regarding the technique called "Mirror3"...and preferably do so before spouting any more nonsense.

Anonymous said...

Ummmmm...yes, it has been tested. We filled an array and pulled two drives within a few minutes of each-other. We lost the whole array. XIV does some clever things but the protection scheme and data layout almost guarantee that you lose data if you have a dual failure on a full system.

A few qualifiers:
(a) if the disks are in the same module, you won't lose data
(b) at least one drive has to be in one of the 9 "data only" modules (that's 108 drives)
(c) if there is little or no real (non-zero) data on the system, rebuild will be within seconds and you may not see data loss

Hector Servadac said...

My only concern is about dispersed storage algorithm. Think in this: 1 DISK = 1TB, 60 pieces with 17 GB each un 1 MB chunks with 2 copies in different modules.
It sounds good till you think you have 17,000 chunks a piece, and you have from 60 to 180 disks. It means that you can have more than one chunk from the same piece in the same disk, unless you have mirrored nodes or all the chunks from the same piece mirrored in the same disk.
Documentation talks about "randomness", that sounds scary...

Anonymous said...

Joerg,

My question to you is very simple. With all of this personal opinion you have on XIV, have you ever implemented and used it in your datacenter?

If not, then I can only say that you know so much for a technology you never even tried and tested. I believe that a customer testimonial would be more credible than a personal opinion of a person who knows storage. Howcome you never talk about EMC? How DMX fails when the laptop inside breaks down. Isn't it that you wont be able to do anything at all within the box and it takes aeons to do the manual confuguration. What about the Clarion, what happens to the 5 disk wherein the flare software which happens to be windows software sits? isnt it that if one of this disk fails your data is gone in the wind? What I'm trying to say here is you pick on IBM on their XIV and you have so much opinion about it. Why don't you share with us your personal opinion on EMC DMX4 and VMax. I would love to hear about them.

Anonymous said...

Jeorg,

This is a FUD! IBM disproved through having sold 1,700 units of XIV worldwide and still growing. None of these clients that IBM sold XIV to had ever lost their jobs, in fact some of them got promoted and got fat bonuses for helping the company save a lot of money...and this people are heroes to their companies today.
The whole article is centred on double drive failure, IBM XIV have ample collateral to discredit your assessment and would be happy to explain this to you or even put you in touch with one of their biggest clients, , , one that has over 2PB now. Do you think a client that size would keep buying if it failed? ever?

Chuck said...

Jeeze people you sound like your talking about Health Care reform and its liberals vs conservatives.

How about some rela data? The people that are posting that Joerg is obsurd seem to be biased towards IBM, and those that tend to agree seem to be backing their view with technical data. I'm not partial either way, just looking for teh facts. I know IBM will not tell me.

But it sounds that technicaly if you lose two drives at the same time you will lose the entire array.
So I Still Dont know as I compare midrange arrays. Just looking for the facts, just the facts. Cheack your bias at the door please.

Anonymous said...

Hi all,

I think this is all about probabilities, we all assume if a second disk of the same RAID1/5 fails during the rebuild time, we will loose the data in that RAID group.

As disks capacities are growing and growing the probabilities of loosing data in high capacity disks increases cause the time to rebuild them is higher.


Most of us use OS wide stripping to achieve better performance for our DBs, true? We all know the probability of loosing 2 disk drives at the same time or during the rebuild time is very low so we assume the risk. If the rebuild time of an FC drive is several hours, we have a risk to loose all our data.

FC disks rebuild time is about 4-8 hours... if we talk about 1TB SATA drives the risk is much more higher as it will usually take more than 20 hours...


Imagine we want to use EMC virtual provisioning for all capacity in a DMX o VMAX storage array, it uses wide stripping, so what risk are we assuming if rebuild time is several hours not minutes?

ORACLE ASM does something similar...

The probability of loosing all your data in the previous 3 examples increases as the rebuild time increases.

I think that XIV has reduced this probability to almost none as the rebuild time is better than any other storage array. (Worst case 30 minutes)

My point of view is that in the future RAID technology will not be an option, imagine how long will it take to rebuild a 2TB disk? or 4TB? days? weeks?

My 2 cents...
Regards.

Anonymous said...

Dont think you can say 30min is max for rebuild time. I have seen a fully loaded XIV. 30 min may be the min time to rebuild for that case.

Think it is only reasonable to assume no equipment is 100% full prove. Especially when it get older.

Anonymous said...

Anyone ever had a customer engineer come in to replace a failed drive and mistakenly pull the wrong one?

Anonymous said...

I agree with you, the XIV solves this problem as it is able to come back to full redundancy in minutes.

I have never seen customer engineers in my Datacenter in minutes after a disk fails...Usually they come after 2-4hours(best case). So once they arrive, there is no chance to make any mistake.

Regards

Anonymous said...

Why are we worry about double disk failure? Many disk vendors claim it is hard to happen but in the real world it happen. When your disk system get older. For XIV, I would be more worry on double module failure at the same time. As this means you lost 24 disks.

Anonymous said...

Hi folks,
Joerg was right ! In a customsers datacenter we have now exactly this case ! And the box is nearly new (6 weeks!).
Maybe I get a new boss now :-)
Stop talking about the theory - start looking at the reality...
It's dirty and ugly and IBM is talking sh...... al the time.

Unknown said...

Thanks for sharing your information with the world! I am deeply impressed by your awesome spice information.
Moving Boxes NYC