Big Data: What Could Go Wrong and What We Need to Do About It

Big Data storage poses challenges, yet those challenges can be addressed – with the right strategy.

Everyone is talking about massive data growth, but not enough of us are talking about the challenges of understanding that data and creating actionable information from it to make better decisions.

The problem is way more complex than has been presented, and the stakes couldn’t be higher. To help you make better sense of your data and prepare for the future, I’ll address a number of Big Data challenges that have to be solved before we can say we have learned everything we can about our data – and taken all the steps we need to protect ourselves and others.

First, some assumptions:

Network performance is not going to exceed data collection sizes, so we likely cannot move all the data to one place.

Given the declining Kryder rate, storage costs are not decreasing at rates seen in previous decades.

With those two limitations in mind, what are the challenges for understanding, learning and making decisions from all the data we are collecting?

Challenge 1: Getting the data to the right tier

For years, people have been talking about storage tiering, but I think the future is going to be processor tiering combined with storage tiering. The cost of storing all the data we are collecting is enormous, and we are learning that discarding data is likely a bad idea if we cannot easily recollect it. Given my first assumption that networks are not fast enough given sensor density and performance, we are going to have to preprocess data before we get to the point of data consolidation.

Getting the data to the right process tier in time to make the right decision is a very difficult problem. Take a simple case like cell phone data collection of weather observational data or sensor data on vibrations for detecting earthquakes. You could send all of the data every quarter second or you could send the base initial observations and any changes at quarter second intervals.

This is a very simple case of preprocessing data. It suggests that using the processor closest to the sensor to evaluate and triage the data will significantly reduce network bandwidth, but will it? The key here is to know what is important about the data before you can figure out what you can ship and when you have to ship it. Figuring out what data in important and when you need it is the hard part, and you cannot do that in most cases without having all of the data in one place to develop the knowledge model in the first place. So you have a chicken and egg scenario that makes modeling difficult.

Challenge 2: Our knowledge changes over time

More than 30 years ago, I worked on a project modeling a chromosome to evaluate changes from long-term radiation exposure. The geneticist I was working with called all of the information between the chromosomes “junk DNA” and told me that we did not have to worry about this as part of our model.

Lo and behold, we have learned that the DNA between chromosomes is not really junk and has value for things like replication. So the big data question becomes what if we had not kept that data we collected because we wrongly assumed that the data was not useful. There are huge numbers of example of this over the last hundred years and this has accelerated over the last decade.

Our knowledge changes based on observation of data and a historical look at the information we collect. New algorithms, new data, faster computers and old saved data have all contributed to this new knowledge, but the question is what will even more data, faster computer tiers and the old data provide us in the future?

Challenge 3: Decisions, decisions

With all of this data, decisions should be able to be made more quickly and with more accuracy, right? This, of course, is the goal, but right now decisions are made without all the data and often by people. As we move forward, more and more decisions will be made for us with the best data available, but what if the network is interrupted (sun spots, for example), or what if we have some other event that disrupts communication?

What is the fallback when decisions of the future cannot be made by the decision process that is required in real time, whether that is flying a plane, driving a car, operating a train, or the thousands of other potential options of the future?

The other issue is a legal one. If I make a bad decision driving a car, I am responsible and my insurance rates will reflect my poor judgment. What happens if there is a programming error or missing scenario, a momentary network outage, or computer error, or silent data corruption? How do you track back and place blame, or do I as the driver assume all of the risk? How you are you going to track back a silent data corruption because of a gamma radiation burst or a bad channel that causes your car to kill someone in an accident?

As we leave decisions to the complexity of Big Data, we are going to need to figure out what to do when things go awry.

Challenge 4: Security

Some of the decision algorithms are open source software and some of them are closed, but either way there are security issues that need to be considered and well thought out. What if someone could hack into your car (wait, that has already happened)? What if the new form of skyjacking is to hack the plane of the future? These threats are not that farfetched. Who is responsible for ensuring that the software stack cannot be taken over and a host of bad things follow, from ransom to death? Are we, as creators of actionable information, up to the challenge of protecting that information from those would use it against us?

There are countless examples of sloppy security practices today, so how can we ensure that systems of the future cannot be taken over by someone with malicious intent? The stakes will only get higher in the Big Data world of the future.

Challenge 5: The Luddite Mentality

How many times have you heard people say they can make a better decision than a computer? That might be true, but we have already seen Watson win on Jeopardy. There are some people who believe that we have had enough progress, and more jobs will be lost with these new innovations. As I believe history more often than not repeats itself, I think there will be resistance by some and we are already seeing decisions made by machines. Like every other technological revolution, people will have to reinvent themselves. There’s been no stopping progress.

The Future is Coming, Be Ready

Some big thinkers that got us to where we are today are concerned about our future dependency on machine decisions and how future decisions will be made. I am also, but I am not a Luddite. There is no going back. But going forward without a clear understanding of the potential risks and addressing security issues and other risks would be both unwise and naïve. The future is coming whether we like it or not. We need to walk carefully toward it with reason and without fear.

—

This article also appears on Enterprise Storage Forum, featuring in-depth technical and business insights focusing exclusively on the information needs of data storage professionals.