Wednesday, March 12, 2014

Why Einstein might call Big Data “spooky”?



Human beings, and scientists in particular, are conditioned to look for causes. One of the most brilliant scientists of our time, Einstein, had his nemesis in what we call quantum mechanics (or quantum physics as it is also sometimes called). This is the study of the smallest particles known to man. Einstein’s theory of relativity and quantum theory are – in many ways – contradictions, which have still not been satisfactorily reconciled into one cohesive theory.
For example, at the quantum level, one particle can potentially be at two places at the same time! It defies all logic and yet it is possible. If you look at a nanoparticle, it can change from a wave form to a point. This is observable in the lab. Other types of strange behavior of these smallest nanoparticles – which in turn make up all things big and small - has been known for a long time and can be replicated. There is little doubt that this strangeness actually occurs.
Einstein himself called quantum mechanics “spooky” because we could all see what was happening, but were unable to explain how it is happening. Today, these issues are still unresolved.
How does this relate to Big Data? To understand this, let’s consider:
Until recently we lived in the era of “Small Data” which was characterized by the premise of exactitude. Databases were built to very specific formats and rules, so that they could be relied upon to retrieve accurate results which perfectly matched the queries we applied against such data. If you used a spreadsheet to compute data, this is what you did. Data was smaller in size – because among other things, computing and storage costs were very high. This in turn meant that databases had to be precise in order to be useful. Databases tended to answer the causal question: What is happening and why? It took time to use data this way and actionable intelligence was, in some cases, delayed. But, that was the cost of using small data and applying data analytics to it.
Today, the costs of data storage have shrunk dramatically. Computing power is enormous and climbing. Enter the era of “Big Data”.
One of the earlier examples of Big Data was in healthcare (as Viktor Mayer-Schönberger and Kenneth Cukier point out in their bestseller “Big Data”). Back in 2007, the CDC would go through precise collection of data points, aggregate them and then use that data to keep track of the spread of flu among the states. They had been doing the same thing for a while. All of this took time, data had to be collected from doctors’ offices and hospitals and by the time the CDC statistics were compiled and published, weeks went by and the data was sometimes already too old.

Around that time, Google was testing its algorithms and mathematical models to test search terms. They found that when users search for 45 terms in specific combinations – that became a predictor of the flu and was closely mapping the CDC data. Just like the CDC, Google too could now predict how flu was spreading nationwide, but unlike the CDC, their data was real-time! And Google did not have to go to doctors’ offices to get data, they extrapolated it from their users’ searches.
Google’s data model was not built on the concept of exactitude. Rather it was based on the use of patterns and correlations to tell us what was happening, not necessarily – how it was happening.
The Bureau of Labor is responsible for computing CPI or the Consumer Price Index. To do this, they employ thousands of people who report approximately 80,000 prices on everything from home rentals to the cost of airline tickets and everything in between. Many things are dependent on CPI including wages, social security amounts etc. CPI is very important. Again, by the time these numbers come out, they too are already old.
Then, two economists from MIT came up with a Big Data solution for CPI. They collected data on 500,000 products in the US – many more than the Bureau of Labor was collecting - and did so only using the web. Admittedly, their data was “messy” in that it was not exact – but it was possible to figure out a CPI equivalent much more quickly. And because they had much more data, any individual anomalies were less meaningful. This project is now a commercial venture and everyday thousands of financial institutions are making decisions based on this work.
In that sense, Big Data is the equivalent of quantum mechanics in this storyline. We can see what is happening, but don’t really know (or maybe even care) how it is exactly happening? The price of getting actionable intelligence in a timely way, is an absence of the precise causal questions and answers. Many are still uncomfortable with this lack of preciseness, and this is only the 2nd or 3rd inning of the Big Data story.  More discomfort is in store.
Today, a significant percentage of the trades on Wall Street are driven by computerized Big Data orders. Healthcare, retail, airlines, hotels – almost everything you can imagine – is being impacted by Big Data. We haven’t seen nothing yet.
In the era of Small Data, accuracy was preeminent because only small data sets were possible and had to be exact. But, as we delve deeper and deeper into this new world of Big Data, we are all making a tradeoff: the quality of the information becomes less important because the volume of the data evens things out. It is the speed of decision-making and the new insights we are learning which are becoming paramount, and all this is being revolutionized by Big Data driven by patterns and correlations.

Prof. Einstein would likely think Big Data too was “spooky” - because just like quantum mechanics - we know what it does, but don’t go looking for too many answers on exactly how and why it got there. You may be disappointed.