Human beings, and
scientists in particular, are conditioned to look for causes. One of the most
brilliant scientists of our time, Einstein, had his nemesis in what we call
quantum mechanics (or quantum physics as it is also sometimes called). This is
the study of the smallest particles known to man. Einstein’s theory of
relativity and quantum theory are – in many ways – contradictions, which have
still not been satisfactorily reconciled into one cohesive theory.
For example, at the
quantum level, one particle can potentially be at two places at the same time!
It defies all logic and yet it is possible. If you look at a nanoparticle,
it can change from a wave form to a point. This is observable in the lab. Other
types of strange behavior of these smallest nanoparticles – which in turn make
up all things big and small - has been known for a long time and can be
replicated. There is little doubt that this strangeness actually occurs.
Einstein himself called
quantum mechanics “spooky” because we could all see what was
happening, but were unable to explain how it is happening. Today,
these issues are still unresolved.
How does this relate to
Big Data? To understand this, let’s consider:
Until recently we lived
in the era of “Small Data” which was characterized by the premise of exactitude. Databases
were built to very specific formats and rules, so that they could be relied
upon to retrieve accurate results which perfectly matched the queries we
applied against such data. If you used a spreadsheet to compute data, this is what
you did. Data was smaller in size – because among other things, computing and
storage costs were very high. This in turn meant that databases had to be
precise in order to be useful. Databases tended to answer the causal question: What
is happening and why? It took time to use data this way and actionable
intelligence was, in some cases, delayed. But, that was the cost of using small
data and applying data analytics to it.
Today, the costs of data
storage have shrunk dramatically. Computing power is enormous and climbing.
Enter the era of “Big Data”.
One of the earlier
examples of Big Data was in healthcare (as Viktor Mayer-Schönberger and Kenneth
Cukier point out in their bestseller “Big Data”). Back in 2007, the CDC would
go through precise collection of data points, aggregate them and then use that
data to keep track of the spread of flu among the states. They had been doing
the same thing for a while. All of this took time, data had to be collected
from doctors’ offices and hospitals and by the time the CDC statistics were
compiled and published, weeks went by and the data was sometimes already too
old.
Around that time, Google
was testing its algorithms and mathematical models to test search terms. They
found that when users search for 45 terms in specific combinations – that
became a predictor of the flu and was closely mapping the CDC data. Just like
the CDC, Google too could now predict how flu was spreading nationwide, but unlike
the CDC, their data was real-time! And Google did not have to go to
doctors’ offices to get data, they extrapolated it from their users’ searches.
Google’s data model was
not built on the concept of exactitude. Rather it was based on the
use of patterns and correlations to tell us what was happening, not necessarily – how it was happening.
The Bureau of Labor is
responsible for computing CPI or the Consumer Price Index. To do this, they
employ thousands of people who report approximately 80,000 prices on everything
from home rentals to the cost of airline tickets and everything in between. Many
things are dependent on CPI including wages, social security amounts etc. CPI
is very important. Again, by the time these numbers come out, they too are
already old.
Then, two economists from
MIT came up with a Big Data solution for CPI. They collected data on 500,000
products in the US – many more than the Bureau of Labor was collecting - and
did so only using the web. Admittedly, their data was “messy” in that it was
not exact – but it
was possible to figure out a CPI equivalent much more quickly. And because they
had much more data, any individual anomalies were less meaningful. This project
is now a commercial venture and everyday thousands of financial institutions
are making decisions based on this work.
In that sense, Big Data
is the equivalent of quantum mechanics in this storyline. We can see what
is happening, but don’t really know (or maybe even care) how it is exactly
happening? The price of getting actionable intelligence in a timely way, is
an absence of the precise causal questions and answers. Many
are still uncomfortable with this lack of preciseness, and this is only the 2nd or
3rd inning of the Big Data story. More discomfort is
in store.
Today, a significant
percentage of the trades on Wall Street are driven by computerized Big Data orders.
Healthcare, retail, airlines, hotels – almost everything you can imagine – is
being impacted by Big Data. We haven’t seen nothing yet.
In the era of Small Data,
accuracy was preeminent because only small data sets were possible and had to
be exact. But, as we delve deeper and deeper into this new world of Big Data,
we are all making a tradeoff: the quality of the information
becomes less important because the volume of the data evens
things out. It is the speed of
decision-making and the new insights we are learning which are becoming
paramount, and all this is being revolutionized by Big Data driven by patterns
and correlations.
Prof. Einstein would
likely think Big Data too was “spooky” - because just like quantum mechanics -
we know what it does, but don’t go looking for too many
answers on exactly how and
why it got there. You may be disappointed.