Saturday, May 17, 2014

The Task of Democratizing Big Data

Companies that fail to take advantage of the opportunities presented by "big data" management and analytics technologies can expect to fall behind the competition and possibly go out of business altogether.

The world is just getting started with big data technologies like Hadoop and MapReduce, and several obstacles – such as a dearth of skills and old-fashioned thinking about data -- continue to stand in the way of their adoption.

But, companies that embrace the concept now are the ones who will lead the way in the not-too-distant future when entry barriers are not so high. Companies that exploit big data will gain the ability to make more informed decisions about the future and will ultimately bring in more money than those that do not.

The phrase "big data" is most often used to refer to the massive amounts of both structured and unstructured information being generated by machines, social media sites and mobile devices today. The phrase is also used to refer to the storage, management and analytical technologies used to draw valuable business insights from such information. Some of the more well-known big data management technologies include the Apache Hadoop Distributed File System, MapReduce, Hive, Pig and Mahout.

There is certainly no shortage of hype around big data management technologies, but actual adoption levels remain low for two main reasons. First, Hadoop and other big data technologies are extremely difficult to use and the right skill sets are in short order. Today, organizations often hire PhDs to handle the analytics side of the big data equation, and those well-educated individuals demand high salaries.

The skills used to manage, deploy and monitor Hadoop are not necessarily the same skills that an Oracle DBA might have. For instance, if you want to be a data scientist on the analytics side, you need to know how to write MapReduce jobs, which is not the same as writing SQL queries by any means.

The second major obstacle standing in the way of increased adoption centers on the notion that most companies currently lack the mindset required to get the most out of big data.

Most large companies today are accustomed to gaining business insights through a combination of data warehousing and business intelligence (BI) re¬porting technologies. But, the BI/data warehousing model is about using data to examine the past, whereas big data technologies are about using data to predict the future. To take advantage of big data requires a shift, a very basic shift in some organizations, to actually trusting data and actually going where the data leads you. Big data is about looking forward, making predictions and taking action.

As with all emerging technologies, big data management and analytics will eventually become more accessible to the masses -- or democratized -- over time. But some important things need to happen first.

For starters, new tools and technologies will be needed to reduce the complexity associated with working with big data technologies. Several companies -- like Talend, Hortonworks and Cloudera -- are working to reduce big data difficulties right now. But, more innovation is needed to make it easier for users to deploy, administer and secure Hadoop clusters and create integrations between processes and data sources.

Right now you need some pretty sophisticated skills around MapReduce and other languages, or SAS and others to be a top line data scientist. We need tools that can abstract away some of that expertise so that you don't need to have a PhD to really explore big data.

The task of democratizing big data will also require a great deal of user training and education on topics like big data infrastructure, deploying and managing Hadoop, integration and scheduling MapReduce jobs. We really need to tackle the problem from both ends. One is to make the tools and technologies easier to use. But we also have to invest in training and education resources to help DBAs and business analysts up their game and operate in the big data world.

Monday, May 12, 2014

Modernizing Your Backups

This week, I'd like to spend a little time talking about backup modernization, or as I prefer to call it, data protection modernization. The process we use for traditional backups hasn't really changed much in 20 or 30 years.  We do a full backup once per week, and take some kind of incremental backup of our data every day in between. These backups are aways copied to some other storage mechanism like tape, or now disk and a retention is attached to the backups that defines how long we need to keep that backup.  Those retentions are important, since they define things like how much dedicated backup disk we need, or how many tapes we need to have on hand, etc.  They also play an important roll later on when/if we decide to change the way we do backups.

But first, let's talk about the fact that traditional backup processes are really beginning to become more and more problematic.  Why?  There are actually a number of reasons. First, and perhaps the most obvious, are that data sets are becoming lager and larger every day.  This means that either the backups are taking longer and longer to complete, or,  more and more backup infrastructure needs to be put in place.  Dedicated 10GbE connections, backup to disk, more and more and faster and fast tape drives, all need to be put into place to keep up. Yet it's a losing battle. The data sets just keep getting bigger. For example having a NAS array today that holds a petabyte of data isn't terribly unusual like it was not all that long ago.  These bigger and bigger data sets are now benign to outstrip the ability of the storage system to send data to the backup system in a timely manner. Things like NDMP are just not able to keep up with these very large data sets. So data set size is certainly one of the more pressing reasons that people are beginning to look into modernizing their backups.

Another reason that people are bringing to look at modernizing their backup processes is that backup windows are getting smaller and smaller, and in some cases, closing completely. Back in the day, we had all night to run backups. Yes, of course we had to dodge in-between the batch jobs, that that was easy enough to do when  you had 12 or more hours to do that.  Those days are pretty much over. Today you are luck to get any time at all to backup the data, and as I said above, in some cases, you really don't have a window at all.

Finally, Recovery Time Objectives (RTO's) are getting shorter and shorter and the Recovery Point Objectives are getting smaller and smaller.  What this means for the backup administrator is that they must take more backups, and must be able to retire form those backups more quickly.

So, what to do?  The first step that many of my customers have taken s to start to include snapshots are part of the backup process.  This addressees the issue of RTOs and RPOs since you can take those snapshots quickly, and you can recover from them quickly.  You can also take multiple snapshots per day, so you have a much more fine grained ability to recover that data to a particular point in time. However, must people continue to do their regular backups as well, based on the premiss that snapshots aren't backups since they don't make a full copy of the data to another storage medium. However, for some customers it's beginning to become so problematic to do those tradition fills and incrementals, they are revisiting this position. Specifically, if they were to have a problem with their storage array, such that they lost data, and couldn't recover from a snapshot, isn't that the definition of a disaster in the data center?  if you accept that premise, then you can start to consider a combination of snapshots, and say data replication for disaster recover, as a fixable, complete backup solution and drop traditional backups entirely.

A move to nothing but snapshots and replication as your data protection mechanism solves a number of issues.  It address the ever growing backup infrastructure, for example, by leveraging space you already have on your storage array, and a DR plan (replication) you may very well already have in place. Admittedly, for some longer retentions it might mean you need a bit more disk space in your array, but because of the nature of snapshots it's probably the same or less space than you would need for disk based backups. If you are already backing up to an external backup to disk array like a Data Domain, you can repurpose your DD budget and add the space you need to your storage array to hold all of the snapshots you need/want.

Another method now bringing to become popular to modernize your backups is to leverage change block tracking.  This is a mechanism in which the backup application, the storage array, or the hypervisor keeps track of the specific blocks that have changed,  and the backup application only "backs up" these changed blocks.  This can reduce the amount of backup traffic from the storage array to the backup infrastructure significantly, thus addressing the issue of the ever growing backup data sets.  If you couple this with CDP (Continuous Data Protection) or near CDP functionality, it will also address the RPO issues, and since recovery from this kind of backup often means sending less data back to the storage array/application it can also address the RTO issues.

However, since you are probably already do some kind of backup, most like a traditional backup, the question becomes, how do I get from my current traditional backups to one of these more modern backup techniques?  While on the surface it may seem simple enough, there are a number of issues to consider. First,  you need to consider your existing backups. Those backups have a retention, and so you need to keep you existing backup software/mechanism in place, at least until the retentions on those existing backups have expired. One question that often crops up in this regard is what if I have backups with very long retentions, like 7 years?  Does this mean I need to keep my existing backup mechanism in place for 7 years?Well, that's certainly one way to handle the problem.  One way to mitigate the issue a little if you can, is to PTV your existing backup servers once you've switch all you backups to the new method.  You can then shut down those VM's, and only spin them up if you need to get back at that old data for some reason.  Another way to address the issue to to recognize that backups with long retentions are often not backups at all, they are actually archives, and they probably shouldn't have been backups in the first place. This is the perfect opportunity to start a dialog with your customers about the difference between backup and archive, and getting an archive mechanism in place to handle that data. The difference between archive and backup is a topic near and dear to my heart, but it's also beyond the scope of this posting. Just keep this in mind when you go to do your backup modernization planning.

The other issue to that you should consider when planning to modernize your backups is management.  Much of the utility of today's backup software such as CommVault, NetBackup, and TSM is around managing the backups.  Scheduling them, monitoring that they complete successfully,  and reporting on them both from a administrative perspective, but also up the management tree and to your customers so that everyone is assured that their data is protected.  Many people think that moving to a new more modern backup process means getting rid of these tried and true software programs. However, these may be an advantage to keeping them in place.  For example, that reportage mechanism that is so important to your business then also stays in place.  Considering that many snapshots, for example, are managed by software provided by the array manufacturer,  and often only manage the snapshots on once array at a time, you could end up in a situation where your backups are modernized, but your backup management has taken a step back in time. this is also true if you bring on several deterrent techniques to backup you data.  For example, I know of customers who use snapshots and replication for the databases, and then use something like Veeam to backup their virtual infrastructure.  This has the potential to create an even bigger management/administrative/reporting headache.

So, if you can leverage your current backup software   to manage your snapshots, and/or perform CDP like functions via change block tracking, then I believe that you've hit on the best of both worlds.  The good news is, that most of the backup software vendors are recognized this, and are moving aggressively to add these kinds of features into their products. Admittedly, some are further ahead in some areas than others, but it's not like you have to change overnight, so implementing the features as they appear in your backup software isn't necessarily a bad thing.

Saturday, May 3, 2014

It takes courage to say "yes"

Today I want to talk about something a little different.  While my posts on here  have, in the past, all been technical, some of us are also in leadership roles.  So, I think that occasionally I might share some of my near 30 years of experience in that regard as well.

What I want to talk about in this post is, from a leadership pony of view, it really does take courage to say "yes", especially to a new idea.  “Definitely not,” is quicker, simpler, and easier than, saying, “Tell me more.” But, a quick “no” devalues and deflates teammates.

Some of the reasons that leaders are constantly saying "no" include:

  1. They think that it makes them look weak when they say "yes" too often.
  2. They prefer the "safety" of the state-quo.  This is another way of saying they are afraid of change, or at least it makes them uncomfortable.
  3. They haven’t clearly articulated mission and vision. Off-the-wall suggestions indicate the people in the ranks don’t see the big picture.
There are some dangers to offhanded yeses however.  Offhanded yeses can dilute your resources,  divide energy, and distract focus.  So, what do good leaders do?  They explore "yes". I know that takes time, but I believe that the time spent is a good investment.

Here are 8 questions to yes:
  1. What are you trying to accomplish?
  2. How does this align with mission or vision?
  3. Who does this idea impact? How?
  4. How will this impact what we are currently doing?
  5. What resources are required to pull this off?
  6. How does this move us toward simplicity and clarity? But, remember new ideas often feel complex at first.
  7. Is a test-run appropriate?
  8. How will we determine success or failure?
Leaders who say yes end up doing what others want and that’s a good thing.  Remember too that courageous leaders are willing to risk being wrong sometimes in order to be right most of the time. They know that decisions move the organization forward. They know that a lack of a decision is in fact a decision; it’s a decision to do nothing and that’s a decision that is almost always wrong and at times catastrophic.

So, are you a leader that says "yes"?