The cover story of the July issue of Wired, “The End of Theory” by Chris Anderson, forecasts a new age of science, the Petabyte Age—an age where data becomes more important than the frameworks we use to understand the data. The supposition is that with the massive amount of data that is becoming available, you will be able to run statistical correlations on the data that show relationships that exist and that the existence of these relationships will be enough to form the basis of decisions.
This model of the future is heavily influenced by the approach to searching the Web: it doesn’t matter why a page is the most relevant, from the perspective of a search; the fact that it is the most relevant is all that matters.
This is a dangerous path.
Every time I talk to clients about statistics I bring up a study published in 2006 by business professors Michael Waldman, Sean Nicholson, and Nodir Adilov. The study statistically correlated autism and watching cable television. The watching of cable television was inferred to occur at the highest rates in areas where cable subscriptions and bad weather were highest—if the weather is bad children were believed to be more likely to stay inside and, therefore, watch more cable television.
Statistically, there’s nothing wrong with the study, but is it an actuality? It’s impossible to tell. First, the correlation between weather and cable subscriptions, and increased cable viewing must be correct. And second, it assumes that the direct correlation between watching television and autism, rather than something like increased autism testing in areas where cable television subscriptions are highest, is correct. It’s a study that opens questions, rather than solving one.
More data, even Petabytes of data, aren’t going to solve the main problem with statistics: correlations can exist without true causality. The only thing that more data does is provide greater certainty that the correlation exists. But, more data also has a big problem: it increases background noise. In other words, it can mask relationships that are there and show ones that don’t exist.
With advanced analytical methods the odds that relationships will be masked is small, given that the software will likely analyze every possible combination. But, it may create relationships that don’t exist. When a data analysis comes up with multiple results, using the data alone, how can you know which is correct? Is it the one with the highest statistical correlation? Does the difference between a possible error of 0.001% and 0.002% make the data with the 0.001% chance of error somehow more true?
This is the main problem with search engines and why the problem of search, as Google even recognizes, is nowhere near being solved. Search engines assume that the highest correlation is the same correlation you’re looking for. In other words, because most people searching for “Robert” want “Robert Scoble” to be the result, then so do you. Correlation, rather than reality, becomes king.
In a 2004 talk for the TED conference, Malcolm Gladwell spoke about his friend Howard Moskowitz, an experimental psychologist and president of Moskowitz Jacobs, Inc., a consumer insights research firm. Moskowitz did research for Prego to discover the best type of tomato sauce. His research was influenced heavily by a study he conducted years before for Diet Pepsi: how much aspartame should be added to the mix to create the ideal Diet Pepsi. The Diet Pepsi experiment was inconclusive; the data was all over the place. Years later Moskowitz made sense of the data. There isn’t an ideal Pepsi; there are only ideal Pepsis. In other words, there should be multiple categories. It’s this thinking that he took to Prego and resulted in the creation of the much beloved category of chunky tomato sauce.
What would happen if this data were analyzed using the philosophy of the Petabyte age? Either the data would be inconclusive or the highest correlation would be revealed to be the ideal mix. In the first case the data would be useless, in the second case the data would be wrong; multiple categories for multiple taste preferences is the ideal solution. Only by understanding what the data means does it become useful; on its own the number crunching tells us nothing.
You’re probably wondering what this has to do with business. The majority of marketing research has been and is still being conducted according to statistical patterns, and, dangerously using these statistics to make future decisions.
For example, imagine a hypothetical, underperforming lawnmower manufacturer is trying to decide what percentages of red and green lawnmowers they should ship to Lowe’s. They analyze last year’s data and see that nine green lawnmowers sold for every red one. The company changes it’s production to make 90% of their lawnmowers for Lowe’s green and 10% red. When it came time to look at sales, hardly any of their lawnmowers sold.
Repeated statistical analyses show no cause for the increase in sales of red lawnmowers. The company hires a consumer insight firm to discover what went wrong. The firm looks at the Lowe’s stores and the purchasing decisions of Lowe’s customers. Looking at the stores, the firm finds that the previous year Lowe’s displayed green lawnmowers at the front of the store. But, this year there wasn’t a display at the front of the store. When asking the customers what color they wanted their lawnmower to be most customers answered red. But when the insight firm showed customers different colors and asked them to select their favorite lawnmower color from the group, 80% said orange—a color no lawnmower company was making. The next year the company released a slew of orange lawnmowers and outsold all other lawnmower makers in the Lowe’s stores.
Analyzing the manufacturer’s data would never have revealed anything. Sense was created from nonsense by coming up with questions to ask and looking for the answers from both the retail stores and the customers.
Just because a lot of data is out there doesn’t mean anyone has ever collected the relevant data. This is exactly what Howard Moskowitz discovered with tomato sauce: no focus group from Ragu or Prego ever came up with the idea of chunky tomato sauce as a type of sauce they would like until they were given the option. And no amount of data would reveal the observation that green lawnmowers were displayed at the front of the store the year before.
Only by understanding the customers can we give them what they want. On their own they don’t know. This has been a guiding force for Steve Jobs at Apple: “You can’t just ask customers what they want and then try to give that to them. By the time you get it built, they’ll want something new … If we’d given customers what they said they wanted, we’d have built a computer they have been happy with a year after we spoke to them—not something they want now.”
Observation and questioning gives us insight into what customers want. Statistical analysis only shows what they’re doing and is best used as a check and balance system to make sure the observations you made and the questions you asked were the right ones. Don’t let anyone try to fool you into believing it’s the other way around.