Tuesday, February 18, 2014

The Abundance of Big Data

The other day I was facing a long drive with nothing to read, so I went onto Audiobooks and grabbed the first thing that came up, Big Data: A revolution that will transform how we live, work, and think (2013) by Viktor Mayer-Schonberger and Kenneth Cukier, Houghton Mifflin Harcourt, Boston and New York.

I learned in the book about the algorithm by which Amazon is able to "recommend books to users based on their individual shopping preferences ...Amazon analyst Greg Linden saw a new way of doing things ... What if the site could make associations between products themselves rather than compare the preferences of people with other people? In 1998 Linden and his colleagues applied for a patent on ‘item-to-item’ collaborative filtering and the shift in approach made a big difference – a big data difference."
Quoted from the book in this blog post:

Linden is himself quoted in the book as saying the ideal algorithm would not show you dozens of recommended books but only the very book you were next going to buy. This is exactly what happened when I went onto the site, saw the first book offered, recalled having heard about it on http://democracynow.org, possibly in connection with Edward Snowden's revelations of NSA spying, and in a click dealt Audiobooks another data point.  The fact that Audiobooks would be able to pinpoint my interests so accurately suggests that NSA is not the only entity interested in my data. The fact that we collectively take this and all Facebook knows about us in stride (and Amazon, Google, Wallmart, and any given phone provider etc.) shows how much we accept this as normal behavior, and the book Big Data details how pervasive and normal this is. In fact, most of us accept websites tracking us as a fair trade, our data for their free services. We are only slightly annoyed when we find that corporations are doing this extensively, as when Apple was found to be tracking user movements via the GPS on their newly purchased iPads without their knowledge (as reported in the book).

Transparency is in fact the issue here.The problem with NSA spying, as Michael Geist points out, is that the government conceals and dissimulates about what they are doing with their harvest of big data. Writing in the Canadian context, he reports where a Canadian 'official' "remarked that in the wake of the Snowden revelations the political risk did not lie with surveillance itself, since most Canadians basically trusted their government and intelligence agencies to avoid misuse. Rather, the real concern was with being caught lying about the surveillance activities. This person was of the view that Canadians would accept surveillance, but they would not accept lying about surveillance programs."

Canada's neighbor to the south has not instilled confidence in its government's integrity lately, but that aside, the book Big Data is mind opening in explaining how that government's approach to data mining is not at all unusual, is in fact the norm for use of the abundance of data available in our era, and is certainly what we can expect more of in the future.

The book explains the shift in statistical analysis that big data has evoked. In the past, when data were tediously collected and analyzed, the empirical approach was to form a hypothesis and attempt to then support that hypothesis by constructing an experiment to establish causality from one variable to the next through random sampling, and extrapolate that out over larger populations.  Random sampling was shown to be reasonably reliable, where N size was large enough, to make predictions accurate for the population at large.

However, where the availability of data approaches infinity, and N equals "all" (all available data can be aggregated and analyzed through computer algorithms) then it turns out the approach to research is not to form a hypothesis at all, but to examine correlations in the data and see what patterns emerge. Thus the emergent approach to research in education, to take the instance that is the topic of this blog, is not toward replicating and inventing new experiments with inevitable shortcomings in data collection methods, where extrapolability to wider populations is always in doubt, but toward harvesting as much data as possible and seeing what pops out, as practiced with "learning analytics".

Where the number of data points is massive, and the amount of data is almost limitless, the results produced this way are exceedingly predictive, to the point where real-time pictures of happening phenomena (like the spread of flu outbreaks) can be inferred through correlating data points, and to where it is getting impossible to compete in markets without having the edge over rivals on data aggregation, storage, and algorithms for analysis.

Big Data takes pains to point out that correlation does not imply causality (it is what it is; when this and that are present then something else tends to happen as well, and the data show where this has historically been true, though they do not tell us why or how). However, it is possible to arrive at hypotheses to explain observed trends and then continue to observe that subsequent data support that hypothesis. For example, Ray Kurzweil has collected copious data to support the contention that technology improves on an exponential curve which on closer examination is seen to be comprised of repeated S movements as paradigm boundaries are crossed. This prediction is akin to Moore's Law which stated (in 1965) that the number of transistors on integrated circuits doubles approximately every two years, and this has proven to be the case ever since. Kurzweil postulates that from such data computers should move beyond human comprehension at a point called Singularity, which is predicted as early as 2030, or by Kutzweil's reckoning, 2045 (more information on Wikipedia and at http://www.singularity.com/, and in Kurzweil's words in a TED Talk, below).

However, lines can be crossed. The point is made more than once early in the book Big Data, and elaborated on in a later chapter, that such analysis can help authorities predict who will commit crimes before they happen. If arrests (or assassinations) are carried out on the basis of such models, is this itself a crime, a violation of supreme law of the land? On reading this book, it seems more in context now why governments venture toward this grey area in an era where all sides are seeking to leverage big data, or risk being one-upped (though some matters of conscience and justice remain unchanged, or should, and therein lies the conundrum). In its last chapters, Big Data explores the risks and implications for individual freedom and privacy.