11 February 2015

Looking beyond Big Data: Are we approaching the death of hypocrisy?


“Come, let us build ourselves a city, and a tower whose top is in the heavens.” The biblical Tower of Babel story is often used to illustrate Man’s hubris, the idea that we humans have of being better than we are. Even if notions of arrogance aforethought (and a conspicuously vindictive God) are taken out of the mix, the tale still reflects on our inability to manage complexity: the bigger things get, the harder they are to deal with. Cross a certain threshold and the whole thing comes crashing down.

Sounds familiar, tech people? The original strategists behind the ill-fated tower might not have felt completely out of place in present day, large-scale information management implementations. Technologies relating to analytics, search, indexing and so on have had a nasty habit of delivering poor results in the enterprise; while we generally no longer accept the notion of a callous super-being to explain away our failures, we still buy into the mantras that the next wave of tech will create new opportunities, increase efficiency and so on. And thus, the cycle continues.

The latest darling is a small, cuddly elephant called Hadoop. We are told by those who know that 2015 will be the year of ‘Hadooponomics' - I kid ye not - as the open-source data management platform goes mainstream and the skills shortage disappears. Allegedly.

Behind the hype, progress is both more sedate and more deeply profound. Our ability to process information continues down two paths - the first of which tries to build an absolute (so-called 'frequentist') picture of what is going on, and the second of which is more relativistic. Mainstream analytics tends towards the former model, based on algorithms which crunch data exhaustively and generate statistically valid results. Nothing wrong with that, I hear you say.

As processing power increases, so can larger data sets reveal bigger answers at a viable cost. When I caught up with French information management guru Augustin Huret a couple of years ago, just after he had sold his Hypercube analytics platform to consulting firm BearingPoint, he explained how the economics had changed - a complex analytical task (roughly 15x109 floating point operations, or FLOPs) would require access to three months of a Cray supercomputer over a decade ago. In the intervening period, the task length has been reduced to days, then hours and can benefit from much cheaper, GPU-based hardware.

This point is further emphasised by the fact that the algorithms Augustin was working on had first been worked out by his father in the 1970s - long before the existence of computers that could have handled the data processing requirements needed. “The algorithms have become much more accessible for a wider range of possibilities,” he told me - such as identifying and minimising the causes of Malaria. In 2009 he worked with the French Institut Pasteur to conduct an investigation into Malaria transmission, working with data from some 47,000 events across an 11-year period. Using 34 different potential variables, the study was able to identify the most likely target group: children under the age of five having type AA Haemoglobin and fewer than 10 episodes of the Plasmodium Malariae infection.

The race is on: researchers and scientists, governments and corporations, media companies and lobby groups, fraudsters and terrorists are working out how to reveal similar needles hidden in the information haystack. Consulting firm McKinsey estimates that Western European economies could save more than €100bn ($118bn) making use of Big Data to support government decision-making.

Even as we become able to handle larger pools of data, we will always be behind the curve. Data sets are themselves expanding, given our propensity to create new information (to the extent, even, that we would run out of storage by 2007 according to IDC - relax, it didn’t happen). This issue is writ large in the Internet of Things - a.k.a. the propensity of Moore’s Law to spawn smaller, cheaper, lower-power devices that are able to generate information. Should we add sensors to our garage doors and vacuum cleaners, hospital beds and vehicles, we will inevitably increase the amount of information we create - Cisco estimates this at a fourfold increase in the five years from 2013, to reach over 400 ZettaBytes - that’s 1021 bytes.

In addition to the fact that we will never be able to process all that we generate, exhaustive approaches to analytics still tend to require human intervention - for example to scope the data sets involved, to derive meaning from the results and then, potentially, to hone the questions being asked. "Storytelling will be the hot new job in analytics,” says independent commentator Gil Press in his 2015 predictions. For example, Hypercube was used to exhaustively analyse the data sets of an ophthalmic (glasses) retailer - store locations, transaction histories, you name it. The outputs indicated a strong and direct correlation between the amount of shelf space allocated to children, and the quantity of spectacles sold to adults. The very human interpretation on this finding: kids like to try on glasses, and the more time they spend doing so, the more likely are their parents to buy.

As analytical problems have become knottier, attention has turned towards relativistic approaches - that is, models which do not require the whole picture to derive inference. Enter non-conformist minister Thomas Bayes, who first came up with such models in the 18th Century. Bayes’ theorem, which works on the basis of thinking of an initial value and then improving upon it (rather than trying to calculate an absolute value from scratch) celebrated its 250th anniversary in 2013.

Long before this age of electronic data interchange, Bayes’ theorems were being denigrated by scientists - indeed, they continue to be. “The 20th century was predominantly frequentist,” remarks Bradley Efron, professor of Statistics at Stanford University. The reason was, and remains, simple - as long as data sets exist that can be analysed using more scientific means, the use of less scientific means have traditionally been seen as inferior. The advent of technology has in some way forced the hands of the traditionalists, says security data scientist Russell Cameron Thomas: “Because of Big Data and the associated problems people are trying to solve now, pragmatics matter more than philosophical correctness.”

The Reverend Bayes can rightly be seen as the grandfather of companies such as Google and Autonomy, the latter sold to HP for $11bn (an acquisition which is still in dispute). “Bayesian inference is an acceptance that the world is probabilistic,” says Mike Lynch, founder of Autonomy. “We all know this in our daily lives. If you drive a car round a bend, you don’t actually know if there is going to be a brick wall around the corner and you are going to die, but you take a probabilistic estimate that there isn’t."

Through their relativistic nature, Bayesian models are more attuned to looking for interpretations behind data, conclusions which are fed back to enable better interpretations to be made. A good example is how Google’s search term database has been used to track the spread of Influenza - by connecting the fact people are looking for information about the flu, and their locations, with the reasonable assumption that an incidence of the illness has triggered the search. While traditional analytical approaches may constantly lag behind the curve, Bayesian inference permits analysis that is very much in the ‘now’ because - as with this example - it enables quite substantial leaps of interpretation to be derived from relatively small slices of data, quickly.

Lynch believes we are on the threshold of computers achieving a breakthrough with such interpretations. "We’re actually just crossing a threshold - the algorithms have reached a point where they can deal with the complexity and are able to solve a whole series of new problems. They’ve got to the point where they have enabled one which is much less understood - the ability of machines to understand meaning.” This is not to downplay the usefulness of exhaustive approaches - how else would we know that vegetarians are less likely to miss their flights - but we will never be able to analyse the universe exhaustively, molecule by molecule, however much we improve Heisenberg's ability to measure.

Top down is as important as bottom up, and just as science is now accepting the importance of both frequentist and Bayesian models, so can the rest of us. The consequences may well be profound - to quote Mike Lynch: "We are on the precipice of an explosive change which is going to completely change all of our institutions, our values, our views of who we are.” Will this necessarily be a bad thing? It is difficult to say, but we can quote Kranzberg’s first law of technology (and the best possible illustration of the weakness in Google’s “do no evil” mantra) - "Technology is neither good nor bad; nor is it neutral.” To be fair, we could say the same about kitchen knives as we can CCTV.

We are, potentially, only scratching the surface of what technology can do for, and indeed to, us. The “human layer” still in play in traditional analytics is actually a fundamental part of how we think and work - we are used to being able to take a number of information sources and derive our own interpretations from them. Whether or not they are correct. We see this as much in the interpretation of unemployment and immigration figures as consumer decision making, industry analysis, pseudoscience and science itself - ask Ben Goldacre, a maestro at unpicking poorly planned inferences.

But what if such data was already, automatically and unequivocally interpreted on our behalf? What if immigration could be proven without doubt to be a very good, or a very bad thing? Would we be prepared to accept a computer output which told us, in no uncertain terms, that the best option was to go to war? Meanwhile, at a personal level, how will we respond to all of our personal actions being objectively analysed and questioned? “In an age of perfect information… we are going to have to deal with a fundamental human trait, which is hypocrisy,” says Lynch. While we may still be able to wangle ways to be dishonest with each other, it will become increasingly hard to fool ourselves.

The fact is, we may not have to wait a long time for these questions to be answered. The ability for computers to think, a question for another time, may be further off than some schools believe. But the ability for computers to really question our own thinking, to undermine our ability to misuse interpretation to our own ends, may be just round the corner. To finish with a lyric from Mumford and Sons’ song, Babel [video]: "Cause I know my weakness, know my voice; And I'll believe in grace and choice; And I know perhaps my heart is fast; But I'll be born without a mask."

Jon Collins is principal advisor at Inter Orbis

No comments: