We read an interesting paper and post about Google Flu Trends (GFT) and its foibles last week. The paper points out a couple of lessons that those of us living in the big data analytics world have learned the hard way but the dangers are worth revisiting as tools like ours (AnalyticsPBI for Azure) begin to move big data analytics into the mainstream of organizational practices. After all, our tool (and others like it) makes it easy and even fun for analytics junkies to use all those available zettabytes of data and answer questions that they’ve long wondered about. But the paper also reminded us of the dangers of ignoring the natural cycles of an analytics process that we talked about in this recent post. If Google followed the PatternBuilders Analytics Methodology, they might have avoided many of the errors that GFT is now spitting out. In fact, the authors of the paper point out that:
“Although not widely reported until 2013, the new GFT has been persistently overestimating flu prevalence for a much longer time. GFT also missed by a very large margin in the 2011-2012 flu season and has missed high for 100 out of 108 weeks starting with August 2011… This pattern means that GFT overlooks considerable information that could be extracted by traditional statistical methods.”
This overestimation is attributed to two primary factors: data hubris and algorithm dynamics.
We run into data hubris quite often or, as the authors’ state: “The idea that data is a substitute for, rather than a supplement to, traditional data collection and analysis.” When operating in the big data world, it is quite easy to be seduced by the sheer amount of data available and fall into the trap of not fully formulating the question you are trying to answer. This is why we spend a lot of time with data science teams (see our many posts on this topic) trying to understand their goals and objectives before even considering the available data sources and how we might use them. Because if you aren’t asking the right questions, or as in the case of the GFT, asking your question and then not “analyzing” the answer you get, you can easily head down a wrong path. Or as as our soft-spoken CEO is fond of saying: “Improperly used big data analytics just lets you be stupid faster.”
Having lots of data doesn’t mean you can get away with having inadequate analytics processes and project management. Analytics processes must be sound and iterative as answers to questions most often create more questions (that’s why it’s iterative). As a result, stopping the process without circling back to your original goals – what’s important for success – will get you into trouble. Any analyst that has built a forecasting model can tell you that exploratory data analysis and “test and learn” rules their lives as:
- No model is ever completely done.
- Things change and factors that may have been important when we started can easily stop being meaningful.
Or as John Brownstein noted in an article on nature.com:
“You need to be constantly adapting these models, they don’t work in a vacuum.”
Worse, if we start with the wrong factors (our personal hypothesis for GFT), then it is absolutely imperative that the first set of results (i.e., the first set of answers) must be torn apart to ensure that we have true correlation and not a simple association at work in our model. This is the traditional analytical process and as we’ve stated in many posts based on our own experiences, that process is even more critical for big data analysis.
Additionally, as the authors point out, the very algorithm defining GFT’s search world changed but GFT did not. So it got left behind and became meaningless with regards to defining flu trends. This is why an “analytics team” in any organization can’t be made up of just data scientists (if your analytics tools are so hard that only data scientists can use them you have a different set of problems). You also have to have smart people from across all functions involved in figuring out the right questions to ask, the data needed for those questions, and whether the answers make sense.
For example, guess what happens when someone in the Supply Chain group of a manufacturer changes the way that orders are coded and they don’t tell the people responsible for analyzing customer growth? Without the correct information your analytics team may very well be crunching the wrong data in the wrong way. As an interesting sidenote, the nature.com article also mentions that:
“Brownstein is one of many researchers trying to harness the power of the web to establish sentinel networks made up not of physicians, but of ordinary citizens who volunteer to report when they or someone in their family are experiencing symptoms of ILI. ‘Flu Near You’, a system run by the HealthMap initiative co-founded by Brownstein at Boston Children’s Hospital, was launched in 2011 and now has 46,000 participants, covering 70,000 people.”
Crowdsourcing of this nature extends the idea that the analytics process and team isn’t just made of big data sources and data geeks: small data can be just as important to answering questions as big data is. Regardless, crowdsourcing, or other equally imaginative approaches, also require an analytics tool that is both powerful and easily accessible to all stakeholders. This is why we created AnalyticsPBI for Azure.
The authors end their paper with a plea for more transparency in how “big data tools” like the GFT are constructed and managed. They rightly point out that big data and big analytics have the potential to help make the world a better place. But when it comes to public policy (the greater good), as the Guardian point outs, we must not confuse correlation with causation (yes, we’ve pointed this out as well many times but they’re much more eloquent):
Google doesn’t know anything about the causes of flu. It just knows about correlations between search terms and outbreaks. But as every GCSE student knows, correlation is quite different from causation. And causation is the only basis we have for real understanding… Big data enthusiasts seem remarkably untroubled by this. In many cases, they say, knowing that two things are correlated is all you need to know. And indeed in commerce that may be reasonable. I buy stuff both for myself and my kids on Amazon, for example, which leads the company to conclude that I will be tempted not only by Hugh Trevor-Roper’s letters but also by new releases of hot rap artists. This is daft, but does no harm. Applying the kind of data analytics that produces such absurdities to public policy, however, would not be funny. But it’s where the more rabid big data evangelists want to take us. We should tell them to get lost.
William Deming coined the slogan “In God we trust, all others bring data.” Behind this simple notion stood a powerful statistical methodology that helped Japan transform its business and manufacturing processes as well as solidify its reputation for producing innovative high quality products. Unfortunately, the slogan and not the meaning behind it has made its way into the big data vernacular. The issues with GFT highlight an ongoing problem in our community: Data itself will not solve all our social ills. It is simply one of the tools we can use to derive insights. But our job is not finished once an insight is derived. It is up to us to apply a rigorous, iterative analytics process to ensure causation. Perhaps it’s time to amend this slogan: In God we trust, all others bring data, sound iterative analytical processes, and well-rounded data science teams. Not such a simple notion but certainly closer to the truth.