University of Wisconsin geologist Shanan Peters was frustrated by how much he didn’t know.
Most geological discoveries were locked away in troves of research journals so voluminous that he and his colleagues could read only a fraction of them. The sheer magnitude of existing research forced most geologists to limit the scope of their work so they could reasonably grasp what had already been done in the field. Research that received little notice when it was published too often was consigned to oblivion, wasting away in dusty journals, even if it could benefit contemporary scientists.
A decade ago, Peters would have had to accept his field’s human limitations. That’s no longer the case. In the summer of 2012, he teamed up with two University of Wisconsin computer scientists on a project they call GeoDeepDive.
The computer system built by professors Miron Livny and Christopher Re will pore over scanned pages from pre-Internet science journals, generations of websites, archived spreadsheets and video clips to create a database comprising, as nearly as possible, the entire universe of trusted geological data. Ultimately, the system will use contextual clues and technology similar to IBM’s Watson to turn those massive piles of unstructured and often forgotten information—what Livny and Re call “dark data”—into a database that Peters and his colleagues could query with questions such as: How porous is Earth’s crust? How much carbon does it contain? How has that changed over the millennia?
The benefits of GeoDeepDive will be twofold, Peters says. First, it will give researchers a larger collection of data than ever before with which to attack problems in the geosciences. Second, it will allow scientists to broaden their research because they will be able to pose questions to the system that they lack the expertise to answer on their own.
“Some problems were kind of off limits,” Peters says. “You couldn’t really think about reasonably addressing them in a meaningful way in one lifetime. These new tools have that promise—to change the types of questions we’re able to ask and the nature of answers we get.”
Order From Chaos
GeoDeepDive is one of dozens of projects that received funding from a $200 million White House initiative launched in March 2012 to help government agencies, businesses and researchers make better use of what’s called “big data.”
Here’s what that means: Data exist all over the world, in proliferating amounts. Satellites beam back images comprising every square mile of Earth multiple times each day; publishers crank out book after book; and 4.5 million new URLs appear on the Web each month. Electronic sensors record vehicle speeds on the Interstate Highway System, weather conditions in New York’s Central Park and water activity at the bottom of the Indian Ocean. Until recently, scientists, sociologists, journalists and marketers had no way to make sense of all this data. They were like U.S. intelligence agencies before the Sept. 11 terrorist attacks. All the information was there, but no one was able to put it together.
Three things have brought order to that cacophony in recent years. The first is the growth of massive computer clouds that virtually bring together tens or hundreds of thousands of servers and trillions of bytes of processing capacity. The second is a new brand of software that can link hundreds of those computers together so they effectively act like one massive computer with a nearly unlimited hunger for raw data to crunch.
The third element is a vastly improved capacity to sort through unstructured data. That includes information from videos, books, environmental sensors and basically anything else that can’t be neatly organized into a spreadsheet. Then computers can act more like humans, pulling meaning from complex information such as Peters’ geosciences journals without, on the surface at least, reducing it to a series of simple binary questions.
“For a number of years we’ve worked really hard at transforming the information we were collecting into something that computers could understand,” says Sky Bristol, chief of Science Information Services at the U.S. Geological Survey. “We created all these convoluted data structures that sort of made sense to humans but made more sense to computers. What’s happened over the last number of years is that we not only have more powerful computers and better software and algorithms but we’re also able to create data structures that are much more human understandable, that are much more natural to our way of looking at the world.
“The next revolution that’s starting to come,” he says, “is instead of spending a lot of energy turning data into something computers can understand, we can train computers to understand the data and information we humans understand.”
Big data has hit the digital world in a big way. The claims for its power can seem hyperbolic. A recent advertisement for a launch event for the book Big Data: A Revolution That Will Transform How We Live, Work, and Think (Eamon Dolan/Houghton Mifflin Harcourt, 2013) promised the authors would explain why the “revolution” wrought by big data is “on par with the Internet (or perhaps even the printing press).”
Big data’s promise to transform society is real, though. To see its effect one need not look to Guttenberg but to Zuckerberg, Page and Brin. Each day Facebook and Google chew through millions of pages of unstructured text embedded in searches, emails and Facebook feeds to deliver targeted ads that have changed how sellers reach consumers online.
Retailers are mining satellite data to determine what sort of customers are parking in their competitors’ parking lots, when they’re arriving and how long they’re staying. An official with Cisco’s consulting arm recently suggested big box retailers could crunch through security camera recordings of customers’ walking pace, facial expressions and eye movements to determine the optimal placement of impulse purchases or what store temperature is most conducive to selling men’s shoes.
Big data is making an appearance in international aid projects, in historical research and even in literary analysis.
Re, the University of Wisconsin computer scientist, recently teamed with English professor Robin Valenza to build a system similar to GeoDeepDive that crawls through 140,000 books published in the United Kingdom during the 18th century. Valenza is using the tool to investigate how concepts such as romantic love entered the English canon. Ben Schmidt, a Princeton University graduate student in history, has used a similar database built on the Google Books collection to spot linguistic anachronisms in the period TV shows Downton Abbey and Mad Men. His assessment: The Sterling Cooper advertising execs of Mad Men may look dapper in their period suits but they talk about “keeping a low profile” and “focus grouping”—concepts that didn’t enter the language until much later.
The ‘Holy Grail’
The White House’s big data investment was spawned by a 2011 report from the President’s Council of Advisors on Science and Technology, a group of academics and representatives of corporations including Google and Microsoft. The report found private sector and academic researchers were increasingly relying on big data but weren’t doing the sort of basic research and development that could help the field realize its full potential.
The council wasn’t alone. The research arm of McKinsey Global Institute predicted in May 2011 that by 2018 the United States will face a 50 percent to 60 percent gap between demand for big data analysis and the supply of people capable of performing it. The research firm Gartner predicted in December 2011 that 85 percent of Fortune 500 firms will be unprepared to leverage big data for a competitive advantage by 2015.
The White House investment was funneled through the National Science Foundation, the National Institutes of Health, and the Defense and Energy departments, among other agencies. The grants are aimed partly at developing tools for unstructured data analysis in the private, academic and nonprofit worlds but also at improving the way data is gathered, stored and shared in government, says Suzi Iacono, deputy assistant director of the NSF’s Directorate for Computer and Information Science and Engineering.
As an example, Iacono cites the field of emergency management. New data storage and analysis tools are improving the abilities of the National Weather Service, FEMA and other agencies to predict when and how major storms such as Hurricane Sandy are likely to hit the United States. New Web and mobile data tools are making it easier for agencies to share that information during a crisis.
“If we could bring together heterogeneous data about weather models from the past, current weather predictions, data about where people are on the ground, where responders are located— if we could bring all this disparate data together and analyze them to make predictions about evacuation routes, we could actually get people out of harm’s way,” she says. “We could save lives. That’s the Holy Grail.”
One of the largest impacts big data is likely to have on government programs in the near term is by cutting down on waste and fraud, according to a report from the industry group TechAmerica released in May 2012.
The Centers for Medicare and Medicaid Services launched a system in 2011 that crunches through the more than 4 million claims it pays daily to determine the patterns most typical of fraud and possibly deny claims matching those patterns before they’re paid out. The government must pay all Medicare claims within 30 days. Because it lacks the resources to investigate all claims within that window CMS typically has paid claims and then investigated later, an inefficient practice known as “pay and chase.”
The board that tracks spending on President Obama’s 2009 stimulus package used a similar system to weed out nefarious contractors.
Big data is having an impact across government, though, in areas far afield from fraud detection. The data analysis company Modus Operandi received a $1 million Army contract in late 2012 to build a system called Clear Heart, which would dig through hundreds of hours of video—including footage from heavily populated areas—and pick out body movements that suggest what officials call “adversarial intent.” That could mean the posture or hand gestures associated with drawing a gun or planting a roadside bomb or the gait of someone wearing a suicide bombing vest.
The contract covers only the development of the system, not its implementation. But Clear Heart holds clear promise for drone surveillance, Modus Operandi President Richard McNeight says. It could be used to alert analysts to possible dangers or to automatically shed video that doesn’t show adversarial intent, so analysts can better focus their efforts.
The technology also could have domestic applications, McNeight says.
He cites the situation in Newtown, Conn., where a gunman killed 20 elementary school students and six adults. “If you’d had a video camera connected with this system it could have given an early warning that someone was roaming the halls with a gun,” McNeight says.
Big data’s greatest long-term effects are likely to be in the hard sciences, where it has the capacity to change hypothesis-driven research fields into data driven ones. During a panel discussion following the announcement of the White House big data initiative, Johns Hopkins University physics professor Alex Szalay described new computer tools that he and his colleagues are using to run models for testing the big-bang theory.
“There’s just a deluge of data,” the NSF’s Iacono says. “And rather than starting by developing your own hypothesis, now you can do the data analysis first and develop your hypotheses when you’re deeper in.”
Coupled with this shift in how some scientific research is being done is an equally consequential change in who’s doing that research, Iacono says.
“In the old days if you wanted to know what was going on in the Indian Ocean,” she says, “you had to get a boat and get a crew, figure out the right time to go and then you’d come back and analyze your data. For a lot of reasons it was easier for men to do that. But big data democratizes things. Now we’ve got sensors on the whole floor of the Indian Ocean, and you can look at that data every morning, afternoon and night.”
Big data also has democratized the economics of conducting research.
One of NIH’s flagship big data initiatives involves putting information from more than 1,000 individual human genomes inside Amazon’s Elastic Compute Cloud, which stores masses of nonsensitive government information. Amazon is storing the genomes dataset for free. The information consumes about 2,000 terabytes—that’s roughly the capacity required to continuously play MP3 audio files for 380 years—far more storage than most universities or research facilities can afford. The company then charges researchers to analyze the dataset inside its cloud, based on the amount of computing required.
This storage model has opened up research to huge numbers of health and drug researchers, academics and even graduate students who could never have afforded to enter the field before, says Matt Wood, principal data scientist at Amazon Web Services. It has the potential to drastically speed up the development of treatments for diseases such as breast cancer and diabetes.
Over time, Wood says, the project also will broaden the scope of questions those researchers can afford to ask.
“If you rewind seven years, the questions that scientists could ask were constrained by the resources available to them, because they didn’t have half a million dollars to spend on a supercomputer,” he says. “Now we don’t have to worry about arbitrary constraints, so research is significantly accelerated. They don’t have to live with the repercussions of making incorrect assumptions or of running an experiment that didn’t play out.”