Promising Practices Promising PracticesPromising Practices
A forum for government's best ideas and most innovative leaders.

Got Big Data? First Define Your Big Question


Big Data is big. Really BIG. Indeed, the definition from the McKinsey Global Institute, which coined the phrase “big data,” is “data sets whose size is beyond the ability of typical database software tools to capture, manage and analyze.” Big Data is so big that your organization (almost by definition) cannot cope with it.

If, however, your organization does have big software, it might be able to mine some big data for some analytical nuggets. Such data mining, to again quote from McKinsey, is “a set of techniques to extract patterns from large data sets by combining methods from statistics and machine learning with database management.”

But what kind of patterns might your organization seek to extract? If you are looking for crime patterns in your city, you don’t start with sophisticated software. For policing, as CompStat illustrated, an excellent first-order analytical tool is dots on a map. When the data are presented this way, you don’t need a degree in statistics to interpret them.

About a decade ago, I was at a party with a bunch of young quants. They were getting (or had already gotten) their Ph.D.s from MIT or Harvard in some quantitative discipline. One of these Ph.D.s had deserted his intellectual field to work for a supermarket chain. He was charged with mining all of the chain’s data on sales and product placement to determine where in its stores to display which products. For example, which ones should be given those priority spaces at the end of which aisle? To answer this question, the chain had lots of data and lots of computers.

I confess that I thought this analytical task had a very low meaning quotient. I long ago figured out that every grocery store puts the milk at the very back. Everyone needs milk. Indeed, some people come into the store for the single purpose of buying milk. And, if in doing so, they walk past cookies or soup they might make an impulse purchase.

But notice: For this chain’s effort to mine its big data, it had already defined its big question.

But how do we go mining for something that we don’t know is there? For something that we may not know exists? Before people go data mining, they have to do serious data thinking.

During World War II, the allies were analyzing the bullet-hole data from bombers returning from missions over continental Europe. The analysts were not, however, randomly mining the data. They were trying to answer a specific question: How could they improve these planes’ survivability? What parts of the aircraft should they reinforce with armor?

All of the analysts observed where the planes had been hit: primarily on the wings and the tail. So they recommended reinforcing these sections. Like Sherlock Holmes’ Watson, they could see, but they did not observe.

One statistician, however, dissented. Abraham Wald observed that the data came only from the planes that returned. These were not, however, the only planes that took off. Some had failed to return. Why?

Wald was the Sherlock Holmes of this analytical team. He noted that the returning planes did not have many bullet holes in the engines or core fuselage. Assuming that the axis artillery wasn’t very accurate—that their hits on allied airplanes were essentially random—Wald reasoned that the planes that failed to return were the ones that had been hit in the fuselage and engines.

Yes. Wald was “mining” the data. But to do that intelligently, he first had to think. And once he had done his thinking, he didn’t need a big computer to mine big data. For the important data were not the locations of the holes that were captured in some big data set. The key data were where the holes “that didn’t bark.”

As is almost always the case: Data thinking is much more important than data mining. And such thinking always starts with purpose: What are we trying to accomplish? Sell cookies and soup? Save planes and pilots?

Often, data thinking starts with small data. What patterns do we observe in a few data points? What patterns might we observe if we add more data? What did we learn from the few data points? What might we learn if we looked at different data?

What is a big number? A small number? Some short division with a few data points may be revealing. Simple, yet analytical, data thinking can reveal the size of the problem. Or the nature of the problem. Simple, yet analytical, thinking can suggest in what mine to look for what ore.

The supermarket chains are lucky. They know precisely what they want to accomplish. They have been pursuing this objective for a long time. They have accumulated lots of data. And they have people who have been thinking about these data. Thus, they know what questions their mining of their big data might answer.

Before you go mining big data, you have to think analytically with some small data. It’s data thinking that can prove to be really big.

Robert D. Behn, a lecturer at Harvard University’s John F. Kennedy School of Government, chairs the executive education program Driving Government Performance: Leadership Strategies That Produce Results. His book The PerformanceStat Potential will be published by Brookings in 2014. (Copyright 2014 Robert D. Behn)

(Image via Sur/

Close [ x ] More from GovExec

Thank you for subscribing to newsletters from
We think these reports might interest you:

  • Sponsored by G Suite

    Cross-Agency Teamwork, Anytime and Anywhere

    Dan McCrae, director of IT service delivery division, National Oceanic and Atmospheric Administration (NOAA)

  • Data-Centric Security vs. Database-Level Security

    Database-level encryption had its origins in the 1990s and early 2000s in response to very basic risks which largely revolved around the theft of servers, backup tapes and other physical-layer assets. As noted in Verizon’s 2014, Data Breach Investigations Report (DBIR)1, threats today are far more advanced and dangerous.

  • Sponsored by One Identity

    One Nation Under Guard: Securing User Identities Across State and Local Government

    In 2016, the government can expect even more sophisticated threats on the horizon, making it all the more imperative that agencies enforce proper identity and access management (IAM) practices. In order to better measure the current state of IAM at the state and local level, Government Business Council (GBC) conducted an in-depth research study of state and local employees.

  • Sponsored by Aquilent

    The Next Federal Evolution of Cloud

    This GBC report explains the evolution of cloud computing in federal government, and provides an outlook for the future of the cloud in government IT.

  • Sponsored by LTC Partners, administrators of the Federal Long Term Care Insurance Program

    Approaching the Brink of Federal Retirement

    Approximately 10,000 baby boomers are reaching retirement age per day, and a growing number of federal employees are preparing themselves for the next chapter of their lives. Learn how to tackle the challenges that today's workforce faces in laying the groundwork for a smooth and secure retirement.

  • Sponsored by Hewlett Packard Enterprise

    Cyber Defense 101: Arming the Next Generation of Government Employees

    Read this issue brief to learn about the sector's most potent challenges in the new cyber landscape and how government organizations are building a robust, threat-aware infrastructure

  • Sponsored by Aquilent

    GBC Issue Brief: Cultivating Digital Services in the Federal Landscape

    Read this GBC issue brief to learn more about the current state of digital services in the government, and how key players are pushing enhancements towards a user-centric approach.


When you download a report, your information may be shared with the underwriters of that document.