Automating experience

Nowadays, evaluating large amounts of data enables us to examine issues for which there were still no viable theories until only recently. Two ETH computer science-professors, Joachim M. Buhmann and Donald Kossmann, explain how this will change society.

Cover Globe 2/June 2014

This article has been published in Globe, no.
2/June 2014:

Read the magazine online or subscribe to the print magazine.

The term “big data” is on everyone’s lips, but not everyone understands the same thing by it. What does the term mean for you?

Donald Kossmann: My favourite definition of big data is the “automation of experience.” Essentially, this means that you learn from the past with an eye on the future and avoid making the same mistake twice.

Enlarged view: Donald Kossmann and Joachim Buhmann
Donald Kossmann (l.) and Joachim M. Buhmann advocate developing new models for how we handle sensitive data as a society. (Photo: Tom Kawara)

And why does this need vast amounts of data?

Kossmann: Large amounts of data help because experi­ences are varied. With large amounts of data, not only can you show what’s obvious, what keeps happening, but rare phenomena too. So the more data, the better it is.

Joachim Buhmann: In artificial intelligence there is the strategy of “case-based reasoning” – a concept borrowed from the justice system. If you have to judge a case, you use precedence as a yardstick. That’s a sound approach because it’s usually easy to spot whether a case is similar. Scientific theories normally help describe phenomena globally. That doesn’t work in all fields.

For example?

Buhmann: In medicine or sociology. While the human race has been working on developing viable theories here since the year dot, these have a negligible predictive value in practice. The best we can do in this situation is to say, “We haven’t got a global theory so we memorise individual cases.” And the more individual cases we consult, the better the picture will be.

Can you back that up with a concrete example?

Kossmann: Yes, a very vivid one: Google Translate. This translation service is based on the principle that a large number of examples from translated texts have been pooled. No one can describe a language conclusively. But you can achieve astonishing results if you take known indi­vidual sentence components and reassemble them again.

Buhmann: Formalising the aspect of language that isn’t covered by grammar is incredibly complicated. But re­membering examples and then saying, “Well, the machine makes a compromise” – that’s possible today.

Do you look to human learning for inspiration when developing such systems?

Buhmann: Machine learning has a lot to do with human learning. Thanks to evolutionary pressure, however, we are built to recognise patterns as quickly as possible, not as faithfully as possible, which is why we have a tendency to see patterns in random data even when they don’t exist. With big data, we can study phenomena so complex that we are no longer able to grasp them because the correla­tions are hidden away in the databases. In actual fact, we can process them, but just not in the rational part of our brain. We often judge based on experience and sub-rational thinking. Consequently, a rule-based system for diagnosing illnesses built on the explanations of doctors works less effectively than a system where you let the doctors work, then imitate them. The trick is to mimic as many doctors as possible.

Kossmann: There are also unsuitable applications. Big data tries to look into the future, using the past. This shouldn’t be applied wherever it doesn’t make sense. Financial markets are one example. If we want to learn what the future will be from the past, we automatically alter people’s behaviour – and then you can no longer predict the future anyway. So big data isn’t a formula to get rich on the stock exchanges – at least not for any length of time.

There are very different kinds of data. How can you combine these optimally?

Buhmann: Data fusion – that’s a major issue. One of the most important questions in mathematics is what objects are and how they are compared. Typically, you begin with a definition: A equals B. The next step is to ask what’s similar. This enables me to form classes of equivalent objects and to develop theories on these classes. However, this requires a vastly complex mathematical apparatus.

Does data still need to be available in a standardised form or have we moved on from there?

Kossmann: Unfortunately, 70 per cent of the work still involves cleaning up and preparing the data. If you want to find out whether Joachim Buhmann is a good researcher, for instance, the problem you face is that he sometimes publishes as Joachim Buhmann, sometimes merely as J. Buh­mann. So it isn’t easy to find out which publications stem from him. Another difficulty is that data is recorded with different levels of precision and resolution. One person might take your temperature with an electronic thermometer every hour, another once a day using his hand. Col­lating this different data still requires a lot of effort.

Buhmann: But ultimately, that’s more of a technical problem than a conceptual one. If I have to compare ex­tremely different health data, it does become tricky, of course.

That’s precisely the aim of the “e-Health” initiative, which is looking to standardise patient data acquisition.

Buhmann: This initiative is necessary as it is the only way we will get a sufficient number of cases to be able to study rare diseases. We have just completed a study on schizo­phrenia with Klaas Enno Stephan. From the outside, the patients’ symptoms might seem similar. But because schizo­phrenia is a spectrum disorder, different mechanisms operate in different patients’ brains. If we succeed in dividing such a disease into sub-types, it will be a major step forward. That’s what big data’s all about: getting enough cases to have sufficient information on the rarer sub-types.

And if patients have qualms about data privacy?

Buhmann: Data security has to be guaranteed, of course. I’m convinced that we need a new social contract. As some­one in good health, I supply my data for research and reap the benefits as a patient. But we haven’t even created the ethical preconditions for this yet. How do I respond to someone who hasn’t made provisions by donating data as a healthy individual? Do I want to withhold knowledge from this person when they are ill? This is how it works with health insurance: if I don’t take out any insurance as a healthy person, I won’t get any help when I’m ill.

Mr Kossmann, do you share this view?

Kossmann: Yes and no. This is a typical question from the issue of common good versus personal rights. In my opinion, data basically belongs to the individual, which is why I also see the tax analogy: while my money belongs to me, I accept that I have to surrender some of it for the common good. The same applies to data. We simply haven’t got the right instruments yet.

So as an individual, how can I prevent this data from being used against me, later on down the line?

Kossmann: You have to guarantee that the data can’t be used improperly, which isn’t all that easy to define. What is proper in healthcare? Where is the purpose still served? And at what point have you gone beyond the purpose? That all needs to be regulated. A lot can already be achieved, even without any major rules or a tax model. Many people are willing to provide their data if they trust the institution in question. One idea, for example, is for people to contrib­ute their data as members of a cooperative and thus con­trol for themselves how their data is used. That might be a better model than the tax example I mentioned earlier.

All the same: as a Facebook user, for instance, I am con­stantly bombarded with new terms and conditions that I barely even comprehend. How can trust develop here?

Kossmann: Facebook is an extreme model: I provide a service that you can use and I can do what I want with your data in return. The national tax model is another extreme example: I tell you what data you need to give me. In both cases, the individuals lose control over their data. If we want to earn people’s trust, we have to give them back control over how their data is used, and create new ways in which people can benefit from the utilisation of their data.

Buhmann: And we should display more composure. Taxes have been levied for thousands of years, but there has only been a well-founded taxation policy since the En­lightenment. It simply takes a long time for a social contract to be negotiated.

So it all boils down to a fundamental societal debate: How do we handle our data?

Buhmann: We humans are not solitary egomaniacs. In­stead, the value of our lives largely consists in companion­ship, i.e. interaction. And that’s where the question be­comes blurred regarding who owns the data. Who owns the information I share on Facebook? The collective I inter­act with? Or only me personally? Those are things we need to clarify. The old values we adhered to when using rudi­mentary technology can’t just be applied to new, high tech­nology with its unknown possibilities.

Kossmann: As I said, we need to create new ideas and try them out. What do we enjoy? What works? Whatever we like will surely prevail. I’m optimistic that the human race will find a positive approach to the issue.

Interviewees:

Joachim M. Buhmann is a Professor of computer science and runs the Machine Learning Laboratory at ETH Zurich. His research focuses on pattern recognition and data analysis, specialising in methodological questions of machine learning, statistical learning theory and applied statistics..

Donald Kossmann is a Professor of computer science at the Institute of Information Systems at ETH Zurich. In his research, he studies the optimisation and scalability of database and information systems.

JavaScript has been disabled in your browser