Welcome to the Data Driven World

University of Wisconsin geologist Shanan Peters was frustrated by how much he didn’t know.

Most geological discoveries were locked away in troves of research journals so voluminous that he and his colleagues could read only a fraction of them. The sheer magnitude of existing research forced most geologists to limit the scope of their work so they could reasonably grasp what had already been done in the field. Research that received little notice when it was published too often was consigned to oblivion, wasting away in dusty journals, even if it could benefit contemporary scientists.

A decade ago, Peters would have had to accept his field’s human limitations. That’s no longer the case. In the summer of 2012, he teamed up with two University of Wisconsin computer scientists on a project they call GeoDeepDive. 

The computer system built by professors Miron Livny and Christopher Re will pore over scanned pages from pre-Internet science journals, generations of websites, archived spreadsheets and video clips to create a database comprising, as nearly as possible, the entire universe of trusted geological data. Ultimately, the system will use contextual clues and technology similar to IBM’s Watson to turn those massive piles of unstructured and often forgotten information—what Livny and Re call “dark data”—into a database that Peters and his colleagues could query with questions such as: How porous is Earth’s crust? How much carbon does it contain? How has that changed over the millennia?

The benefits of GeoDeepDive will be twofold, Peters says. First, it will give researchers a larger collection of data than ever before with which to attack problems in the geosciences. Second, it will allow scientists to broaden their research because they will be able to pose questions to the system that they lack the expertise to answer on their own. 

“Some problems were kind of off limits,” Peters says. “You couldn’t really think about reasonably addressing them in a meaningful way in one lifetime. These new tools have that promise—to change the types of questions we’re able to ask and the nature of answers we get.”

Order From Chaos

GeoDeepDive is one of dozens of projects that received funding from a $200 million White House initiative launched in March 2012 to help government agencies, businesses and researchers make better use of what’s called “big data.”

Here’s what that means: Data exist all over the world, in proliferating amounts. Satellites beam back images comprising every square mile of Earth multiple times each day; publishers crank out book after book; and 4.5 million new URLs appear on the Web each month. Electronic sensors record vehicle speeds on the Interstate Highway System, weather conditions in New York’s Central Park and water activity at the bottom of the Indian Ocean. Until recently, scientists, sociologists, journalists and marketers had no way to make sense of all this data. They were like U.S. intelligence agencies before the Sept. 11 terrorist attacks. All the information was there, but no one was able to put it together. 

Three things have brought order to that cacophony in recent years. The first is the growth of massive computer clouds that virtually bring together tens or hundreds of thousands of servers and trillions of bytes of processing capacity. The second is a new brand of software that can link hundreds of those computers together so they effectively act like one massive computer with a nearly unlimited hunger for raw data to crunch. 

The third element is a vastly improved capacity to sort through unstructured data. That includes information from videos, books, environmental sensors and basically anything else that can’t be neatly organized into a spreadsheet. Then computers can act more like humans, pulling meaning from complex information such as Peters’ geosciences journals without, on the surface at least, reducing it to a series of simple binary questions. 

“For a number of years we’ve worked really hard at transforming the information we were collecting into something that computers could understand,” says Sky Bristol, chief of Science Information Services at the U.S. Geological Survey. “We created all these convoluted data structures that sort of made sense to humans but made more sense to computers. What’s happened over the last number of years is that we not only have more powerful computers and better software and algorithms but we’re also able to create data structures that are much more human understandable, that are much more natural to our way of looking at the world.  

“The next revolution that’s starting to come,” he says, “is instead of spending a lot of energy turning data into something computers can understand, we can train computers to understand the data and information we humans understand.”

Big Promises 

Big data has hit the digital world in a big way. The claims for its power can seem hyperbolic. A recent advertisement for a launch event for the book Big Data: A Revolution That Will Transform How We Live, Work, and Think (Eamon Dolan/Houghton Mifflin Harcourt, 2013) promised the authors would explain why the “revolution” wrought by big data is “on par with the Internet (or perhaps even the printing press).” 

Big data’s promise to transform society is real, though. To see its effect one need not look to Guttenberg but to Zuckerberg, Page and Brin. Each day Facebook and Google chew through millions of pages of unstructured text embedded in searches, emails and Facebook feeds to deliver targeted ads that have changed how sellers reach consumers online. 

Retailers are mining satellite data to determine what sort of customers are parking in their competitors’ parking lots, when they’re arriving and how long they’re staying. An official with Cisco’s consulting arm recently suggested big box retailers could crunch through security camera recordings of customers’ walking pace, facial expressions and eye movements to determine the optimal placement of impulse purchases or what store temperature is most conducive to selling men’s shoes. 

Big data is making an appearance in international aid projects, in historical research and even in literary analysis. 

Re, the University of Wisconsin computer scientist, recently teamed with English professor Robin Valenza to build a system similar to GeoDeepDive that crawls through 140,000 books published in the United Kingdom during the 18th century. Valenza is using the tool to investigate how concepts such as romantic love entered the English canon. Ben Schmidt, a Princeton University graduate student in history, has used a similar database built on the Google Books collection to spot linguistic anachronisms in the period TV shows Downton Abbey and Mad Men. His assessment: The Sterling Cooper advertising execs of Mad Men may look dapper in their period suits but they talk about “keeping a low profile” and “focus grouping”—concepts that didn’t enter the language until much later.

The ‘Holy Grail’

The White House’s big data investment was spawned by a 2011 report from the President’s Council of Advisors on Science and Technology, a group of academics and representatives of corporations including Google and Microsoft. The report found private sector and academic researchers were increasingly relying on big data but weren’t doing the sort of basic research and development that could help the field realize its full potential. 

The council wasn’t alone. The research arm of McKinsey Global Institute predicted in May 2011 that by 2018 the United States will face a 50 percent to 60 percent gap between demand for big data analysis and the supply of people capable of performing it. The research firm Gartner predicted in December 2011 that 85 percent of Fortune 500 firms will be unprepared to leverage big data for a competitive advantage by 2015. 

The White House investment was funneled through the National Science Foundation, the National Institutes of Health, and the Defense and Energy departments, among other agencies. The grants are aimed partly at developing tools for unstructured data analysis in the private, academic and nonprofit worlds but also at improving the way data is gathered, stored and shared in government, says Suzi Iacono, deputy assistant director of the NSF’s Directorate for Computer and Information Science and Engineering. 

As an example, Iacono cites the field of emergency management. New data storage and analysis tools are improving the abilities of the National Weather Service, FEMA and other agencies to predict when and how major storms such as Hurricane Sandy are likely to hit the United States. New Web and mobile data tools are making it easier for agencies to share that information during a crisis.

“If we could bring together heterogeneous data about weather models from the past, current weather predictions, data about where people are on the ground, where responders are located— if we could bring all this disparate data together and analyze them to make predictions about evacuation routes, we could actually get people out of harm’s way,” she says. “We could save lives. That’s the Holy Grail.”

One of the largest impacts big data is likely to have on government programs in the near term is by cutting down on waste and fraud, according to a report from the industry group TechAmerica released in May 2012. 

The Centers for Medicare and Medicaid Services launched a system in 2011 that crunches through the more than 4 million claims it pays daily to determine the patterns most typical of fraud and possibly deny claims matching those patterns before they’re paid out. The government must pay all Medicare claims within 30 days. Because it lacks the resources to investigate all claims within that window CMS typically has paid claims and then investigated later, an inefficient practice known as “pay and chase.” 

The board that tracks spending on President Obama’s 2009 stimulus package used a similar system to weed out nefarious contractors. 

Big data is having an impact across government, though, in areas far afield from fraud detection. The data analysis company Modus Operandi received a $1 million Army contract in late 2012 to build a system called Clear Heart, which would dig through hundreds of hours of video—including footage from heavily populated areas—and pick out body movements that suggest what officials call “adversarial intent.” That could mean the posture or hand gestures associated with drawing a gun or planting a roadside bomb or the gait of someone wearing a suicide bombing vest. 

The contract covers only the development of the system, not its implementation. But Clear Heart holds clear promise for drone surveillance, Modus Operandi President Richard McNeight says. It could be used to alert analysts to possible dangers or to automatically shed video that doesn’t show adversarial intent, so analysts can better focus their efforts. 

The technology also could have domestic applications, McNeight says. 

He cites the situation in Newtown, Conn., where a gunman killed 20 elementary school students and six adults. “If you’d had a video camera connected with this system it could have given an early warning that someone was roaming the halls with a gun,” McNeight says. 

Big data’s greatest long-term effects are likely to be in the hard sciences, where it has the capacity to change hypothesis-driven research fields into data driven ones. During a panel discussion following the announcement of the White House big data initiative, Johns Hopkins University physics professor Alex Szalay described new computer tools that he and his colleagues are using to run models for testing the big-bang theory.

“There’s just a deluge of data,” the NSF’s Iacono says. “And rather than starting by developing your own hypothesis, now you can do the data analysis first and develop your hypotheses when you’re deeper in.”

Coupled with this shift in how some scientific research is being done is an equally consequential change in who’s doing that research, Iacono says.

“In the old days if you wanted to know what was going on in the Indian Ocean,” she says, “you had to get a boat and get a crew, figure out the right time to go and then you’d come back and analyze your data. For a lot of reasons it was easier for men to do that. But big data democratizes things. Now we’ve got sensors on the whole floor of the Indian Ocean, and you can look at that data every morning, afternoon and night.”

Big data also has democratized the economics of conducting research. 

One of NIH’s flagship big data initiatives involves putting information from more than 1,000 individual human genomes inside Amazon’s Elastic Compute Cloud, which stores masses of nonsensitive government information. Amazon is storing the genomes dataset for free. The information consumes about 2,000 terabytes—that’s roughly the capacity required to continuously play MP3 audio files for 380 years—far more storage than most universities or research facilities can afford. The company then charges researchers to analyze the dataset inside its cloud, based on the amount of computing required. 

This storage model has opened up research to huge numbers of health and drug researchers, academics and even graduate students who could never have afforded to enter the field before, says Matt Wood, principal data scientist at Amazon Web Services. It has the potential to drastically speed up the development of treatments for diseases such as breast cancer and diabetes.

Over time, Wood says, the project also will broaden the scope of questions those researchers can afford to ask.

“If you rewind seven years, the questions that scientists could ask were constrained by the resources available to them, because they didn’t have half a million dollars to spend on a supercomputer,” he says. “Now we don’t have to worry about arbitrary constraints, so research is significantly accelerated. They don’t have to live with the repercussions of making incorrect assumptions or of running an experiment that didn’t play out.”

Stay up-to-date with federal news alerts and analysis — Sign up for GovExec's email newsletters.
FROM OUR SPONSORS
JOIN THE DISCUSSION
Close [ x ] More from GovExec
 
 

Thank you for subscribing to newsletters from GovExec.com.
We think these reports might interest you:

  • Sponsored by Brocade

    Best of 2016 Federal Forum eBook

    Earlier this summer, Federal and tech industry leaders convened to talk security, machine learning, network modernization, DevOps, and much more at the 2016 Federal Forum. This eBook includes a useful summary highlighting the best content shared at the 2016 Federal Forum to help agencies modernize their network infrastructure.

    Download
  • Sponsored by CDW-G

    GBC Flash Poll Series: Merger & Acquisitions

    Download this GBC Flash Poll to learn more about federal perspectives on the impact of industry consolidation.

    Download
  • Sponsored by One Identity

    One Nation Under Guard: Securing User Identities Across State and Local Government

    In 2016, the government can expect even more sophisticated threats on the horizon, making it all the more imperative that agencies enforce proper identity and access management (IAM) practices. In order to better measure the current state of IAM at the state and local level, Government Business Council (GBC) conducted an in-depth research study of state and local employees.

    Download
  • Sponsored by Aquilent

    The Next Federal Evolution of Cloud

    This GBC report explains the evolution of cloud computing in federal government, and provides an outlook for the future of the cloud in government IT.

    Download
  • Sponsored by Aquilent

    A DevOps Roadmap for the Federal Government

    This GBC Report discusses how DevOps is steadily gaining traction among some of government's leading IT developers and agencies.

    Download
  • Sponsored by LTC Partners, administrators of the Federal Long Term Care Insurance Program

    Approaching the Brink of Federal Retirement

    Approximately 10,000 baby boomers are reaching retirement age per day, and a growing number of federal employees are preparing themselves for the next chapter of their lives. Learn how to tackle the challenges that today's workforce faces in laying the groundwork for a smooth and secure retirement.

    Download
  • Sponsored by Hewlett Packard Enterprise

    Cyber Defense 101: Arming the Next Generation of Government Employees

    Read this issue brief to learn about the sector's most potent challenges in the new cyber landscape and how government organizations are building a robust, threat-aware infrastructure

    Download
  • Sponsored by Aquilent

    GBC Issue Brief: Cultivating Digital Services in the Federal Landscape

    Read this GBC issue brief to learn more about the current state of digital services in the government, and how key players are pushing enhancements towards a user-centric approach.

    Download
  • Sponsored by CDW-G

    Joint Enterprise Licensing Agreements

    Read this eBook to learn how defense agencies can achieve savings and efficiencies with an Enterprise Software Agreement.

    Download
  • Sponsored by Cloudera

    Government Forum Content Library

    Get all the essential resources needed for effective technology strategies in the federal landscape.

    Download

When you download a report, your information may be shared with the underwriters of that document.