1. Big Data. 2. Now what?

February 2, 2015

True confessions: I am learning R, finally, after living and loving SPSS for many years. I am no expert in Big Data, having only been responsible for about 40GBs (about a million records of a thousand variables each) of data goodness in a previous work-life. [For an answer to What is Big Data, see this article.]

Stages of Big Data

We have been approached by clients with a certain impression of Big Data that goes something like:

![Ladyhawk Rolling in Scent](/wp-content/uploads/2020/12/2015-02-Ladyhawk-Rolling-in-Scent.jpg)Ladyhawk rolling in scent, courtesy of http://www.wolfhaven.org/category/wolves-of-wolf-haven/

1. As a company, we have [discovered] Big Data. 2. As a manager, I should have a dashboard and traffic lighting and pushbutton-ready reports that find everything of importance and nothing trivial or misleading.

Whoa there, big fella.

I’m sorry, but you are not going to be able to skip the step of having a relatively smart person, who has time to roll in the data ’til they smell like it, and, yes, they are smart, so they probably will have an agenda, and it may be different from yours.

A smart person in the sense of Big Data is a subject matter expert. Take someone who loves what they do, loves what your company does. Or find a smart person who loves learning. Can they at least handle Excel? Sure they can. It is not much harder to get them going on MatLab or SAS or SPSS [ask me about R in a few months] and they will learn the software as they ‘learn’ the data. Before long, you’ll have a fine data geek, er, analyst.

Now you need to give them that time. Rolling in the data begins very simply. Run the basic things: combine, sort, filter, look at frequencies, binning, distributions, mean/median/mode. Regressions, yes, but they can wait. The most important thing to do, right away, is to throw up some X-Y scatter plots. There is nothing like quick visuals to know where to look for the most sensitive variables, the biggest players, the outliers and what they mean.

This data-smelling analyst will groom [munge, wrangle] your data. No data is perfectly accessible, complete or high quality, so it will need reformatting, work-arounds and pruning. Be aware that this is 50 to 80% of the cost == time of Big Data. The analyst will make certain assumptions in order to do this cleaning; make sure you know what those assumptions are, and that you agree with them. (Beware of the pony assumption; it leads to spurious  correlations.) The analyst will tell you what data you can stop collecting and what data you need to start collecting. And therein lies the problem. You won’t like what you hear, I guarantee it.

A good analyst will start bubbling up an agenda out of the data. This agenda will involve fixing problems, relieving unfairness, balancing trade-offs. This agenda will involve poking sore spots, where good data is not being collected because of political constraints, and it will involve identifying weaknesses – issues that have been papered over because people have felt that some problems can’t be approached or solved.

Listen to this person. You don’t have to like them; no one liked Cassandra (although they seem to like Cassandra). But if you want the promised results of Big Data, you’re going to have to swallow the medicine whatever it turns out to be. You can’t conform Big Data to your agenda; the data will set the agenda, otherwise, don’t even go down the Big Data route. Like teaching a pig to sing, it will waste your time and money, and annoy the pig.

The analyst’s findings will inform your development team what features should be on a dashboard and what reports are worth canning. Are you done with the analyst now? I hope not. Because the only thing certain in life is change, and you are going to change data collection and analysis behaviours based on your data-informed results, and keep changing as the business climate evolves. Your analyst is going to keep rolling in the data, and will be testing theories, preparing one-time reports, suggesting new data to collect, and so on, so that s/he can keep the dashboard and scheduled reports relevant for you.

So the real Stages of Big Data are:

  1. Collect what you can collect. You may not need it all, and you may need more, but start quick and simple with what you have or can get easily.
  2. Roll in the data. Set your analyst on the path to finding meaningful relationships and informative outliers. Be patient and helpful with assumptions and cleaning/pruning. A little common sense and love of subject go along way in making a good analyst; you don’t need a Rock star.
  3. Construct a dashboard for managers and C-suite. You and your analyst will know what should be on it by this time. [I know a good company who can help you with 1-3.]
  4. Do not force your analyst to use the dashboard. The analyst must have different tools better suited for exploratory analysis. The analyst’s job is to update and improve your dashboard. They can’t make new connections if they are limited to just the dashboard.
  5. Collect new, better data as you are able. Your analyst keeps analyzing, and s/he can guide you on dropping old metrics and adding new ones as the world changes and you improve your business processes.
  6. Yes, it’s never done. You will need to keep changing your dashboard, but that is a good thing. No one needs metrics on selling vinyl any more, but the music keeps playing and somebody’s making money.
  7. Help everyone adjust. You are the manager and you may not like the results of Big Data; how much worse for your people who don’t have your vision? Go forth and Change-Manage. Good * Agile * use of Big Data will give you the confidence to lead those transformations and your people the confidence to follow you.