The big idea: the next scientific revolution
A visitor walking the halls of Microsoft Research's campus in Redmond, Washington, likely to overhear discussions not only about computer science but about a surprising variety of other subjects, from which way a galaxy rotates, to a new AIDS vaccine, to strategies for managing the planet's precious supply of fresh water.
What could these issues possibly have in common? And why would Microsoft-ostensibly a software company-be involved with them? The simple answer is data-vast amounts of data. So vast that when we run the programs that analyze some of the databases, the temperature of the building that houses 10,000 microprocessors shoots up several degrees. Today our computer scientists find themselves in partnership with leading scientists in a wide array of disciplines-astronomy, biology, chemistry, hydrology, oceanography, physics, and zoology, just to name a few-working on efforts such as drug development, alternative energy, and health care cost containment. And, yes, even commercial software projects. We believe that a new generation of powerful software tools, which support collaboration and data exploration on an unprecedented scale, are about to enable revolutionary discoveries in these fields.
For decades computer scientists have tried to teach computers to think like human experts by embedding in them complex rules of linguistics and reasoning. Up to now, most of those efforts have failed to come close to generating the creative insights and solutions that come naturally to the best scientists, physicians, engineers, and marketers. The most talented experts not only have a deep understanding of data but also are able to see the possibilities "between the columns"; they can find the nonobvious connections within or between disciplines that make all the difference.
Let's start with an example of the kind of thinking that drives this type of research. In the 1980s my colleague Eric Horvitz, while training at a Veterans Administration hospital as part of his medical education, observed a disturbing phenomenon. During the holiday season, the hospital experienced a surge in admissions for congestive heart failure. Each year, some patients who had otherwise successfully managed their health despite a weakened heart would reach a tipping point after a salty holiday meal. That extra salt caused their bodies to retain additional fluids, which would lead to lung congestion and labored breathing-and often to a visit to the emergency room.
More than two decades later, Eric and his colleagues at Microsoft Research have developed analyses that can predict with impressive accuracy whether a patient with congestive heart failure who is released from the hospital will be readmitted within 30 days. This feat is not based on programming a computer to run through the queries a given diagnostician would ask or on an overall estimate of how many patients return. Rather, this insight comes from what we call "machine learning," a process by which computer scientists direct a program to pore through a huge database-in this instance, hundreds of thousands of data points involving hundreds of evidential variables of some 300,000 patients. The machine is able to "learn" the profiles of those patients most likely to be readmitted by analyzing the differences between cases for which it knows the outcome. Using the program, doctors can then plug in a new patient's data profile to determine the probability of his or her "bouncing back" to the hospital.
In one sense we owe this project to a human expert spotting a nonobvious connection: Eric not only earned his MD but also has a PhD in computer science, and he realized that machine-learning techniques similar to the ones he and his team had used to analyze Seattle traffic patterns could work for this important health care challenge. In 2003 they had developed methods of predicting traffic jams by analyzing massive quantities of data, which included information on the flow of traffic over highways, weather reports, accidents, local events, and other variables that had been gathered over several years. The team's new program compared data about patients who were and were not readmitted, and unearthed relationships among subtle evidence in a patient's clinical history, diagnostic tests, and even socioeconomic factors, such as whether the patient lived alone. This integration was not trivial: Information on a patient's living situation, for example, may reside in a social worker's report, not on a medical chart. It is unlikely that a single clinician involved in a patient's care could ever process the volume of variables sufficient to make a prediction like this.
The economic impact of this prediction tool could be huge. If physicians or hospitals understand a patient's likelihood of being readmitted, they can take the right preventive steps. As Eric explains: "For chronic conditions like congestive heart disease, we can design patient-specific discharge programs that provide an effective mix of education and monitoring, aimed at keeping the patients in stable, safe regimes.
On Wall Street, massive data-mining programs are already tracking "sympathetic movements," or related trading patterns among different investment vehicles. Hedge funds and large money managers are placing millions of dollars in bets every day based on these data-discovered relationships.
On the operational side of business, the possibilities are endless. Companies will be able to do massive analyses of customers and business opportunities using programs that unearth patterns in price, buying habits, geographic region, household income, or myriad other data points. The large quantities of available data on advertising effectiveness, customer retention, employee retention, customer satisfaction, and supply chain management will allow firms to make meaningful predictions about the behavior of any given customer or employee and the likelihood of gaps in service or supply. And more and more, we find companies using data techniques to spot irregularities in payments and receivables.
With all those business opportunities, some ask why Microsoft Research is working on so many global health and environmental projects. After all, aren't those projects that the Bill & Melinda Gates Foundation might fund? Yes, but the reason Microsoft Research has several dozen computer scientists working on them is that they involve some of the most enormous data stores imaginable and constitute an invaluable testing ground. We need to expand our own thinking and the capabilities of our tools by working on the biggest problems out there, which happen to be of immense importance to humanity. Tackling these problems also opens more opportunities for collaboration and experiments.