Data Warehouses Hit the Web

Data Warehouses Hit the Web

L

ife was relatively tranquil for the Environmental Protection Agency's information systems staff until the 1988 Superfund Reauthorization Act prompted a reengineering frenzy that radically altered how the agency handles data. That legislation contained a provision requiring the agency to make toxic release reports from more than 675,000 companies available to the public on a regular basis.

"Suddenly, everyone from the Sierra Club to the Boy Scouts wanted timely inventories about the amount of toxic substances released into the environment," says Pat Garvey, deputy director of EPA's information management division. "At first, we tried putting the data on CD-ROMs but that media wasn't conducive to monthly updates. We even photocopied thousands of pages and faxed them out to people."

At about the same time, EPA management began a reorganization campaign to make the agency less vertically focused. Instead of having divisions studying specific categories such as air, land, waste and water, EPA decided to focus on geographic regions. Researchers in the Chesapeake Bay division, for example, would study all elements affecting the environment in that area.

The only problem was that information about those elements was stored on numerous databases, each containing its own special account number, programming language, security codes and protocols. Especially troubling were the different nomenclatures used on the databases. One division, for example, would use the term "sites" while another would use "facilities" and yet another would use a numerical designation.

"We had to create a standard data structure so that someone other than [information systems] experts could access the information," says Garvey. "That's when we got the idea to build a centralized data warehouse and link it to the World Wide Web so that information could be made available to the general public and to EPA's 17,000 employees."

Using a relational database management system from Oracle Corp., Garvey and his team of seven computer specialists replicated information from six EPA databases onto a Web site called Envirofacts (www.epa.gov/enviro/). Reports on everything from air pollution levels and water-discharge permit compliance to hazardous waste site assessments can be retrieved quickly and easily. Instead of memorizing complex database languages, users can send queries by completing online forms requesting facility names and addresses.

The Envirofacts site is averaging about 125,000 hits (the number of times information elements within the site are accessed) per month. Internally, EPA employees use the site to closely track hazardous waste and a variety of air, water and ground pollutants. External users rely on the site to keep up to date on regulatory data.

"Envirofacts has empowered environmental groups to get the information they need to educate themselves and others in their communities," says Garvey. "It also has been an excellent way for EPA to fulfill its right-to-know requirements. Since building the data warehouse and linking it to the Web, most of our Freedom of Information Act requests have gone away."

Data warehousing is revolutionizing the way federal agencies access and use information. Government organizations are consolidating disparate databases running on incompatible computer systems and forming centralized data repositories that enable quick information retrieval. Specialized tools are being used to manipulate data in those repositories to reveal patterns that can help executives make decisions and ultimately boost efficiency and cut costs.

Best of all, data warehouses make a variety of information readily available to a variety of people. Budget analysts at Los Alamos National Laboratory in New Mexico, for example, used to wait weeks for financial reports to be generated from three separate databases. Information retrieval via mainframes was so complex that only about 150 analysts had access to the data. But with the lab's new data warehouse, reports can be generated in about 20 minutes and made available to more than 1,000 people.

"In the past, project and line managers were always a little bit in the dark," says Charles Lee, database administrator at the computing information and communications division at Los Alamos. "Financial reports-if they were able to be seen at all-were always at least a month old. Now the information is always up to date and can be manipulated and customized in a way the old reports couldn't."

Agencies are using data warehousing to do everything from detect fraud and abuse (Agriculture, Health Care Financing Administration and IRS) to analyze statistics (Census Bureau, Education and Labor) and monitor public safety (Federal Aviation Administration and Justice). The CIA has a data warehouse to interpret and compare intercepted communications, while the Social Security Administration employs the technology for multidimensional analysis. And the Army and Air Force Exchange Service's data warehouse of point-of-sale information recently uncovered inaccuracies in product bar codes.

The Data Warehousing Institute, a research and education group in Bethesda, Md., predicts the federal data warehousing market will be valued at about $1 billion by 2000. Roughly a quarter of all federal organizations own or are building data warehouses, according to the institute.

"Many organizations, including government agencies, have been technologically condemned to be informational wastelands, structurally incapable of providing useful business intelligence to management," says Ramon Barquin, founder of the Data Warehousing Institute. "But data warehouses offer a way out of that situation because they can deliver integrated information about organizations. They are knowledge delivery machines."

More Sophisticated Warehouses

Despite huge success stories at companies such as MCI, United Parcel Service and Wal-Mart, data warehousing was initially slow to catch on in the private sector. Part of the reason was cost. With price tags running from $100,000 up to $5 million, data warehouses were not exactly impulse buys for agencies. The 1990 Chief Financial Officers Act, however, put pressure on agencies to do a better job analyzing data. Soon thereafter, warehouse pilots began to spring up all over government.

First generation data warehouses were generally experimental projects built for internal use with products obtained from existing contract vehicles. These early efforts involved specialized query "engines" that were added to prevailing databases. Some integrated information from various platforms but few did heavy-duty analysis of relationships between data sets.

Second generation warehouses, by contrast, are generally larger and enable complex queries and forecasting. Some of the more sophisticated warehouses, such as those at the Defense Department and NASA, hold as much as a trillion of bytes of data-the equivalent of more than a billion business letters.

"The [1996 Information Technology Management Reform Act] has helped to further drive the federal data warehousing market because every agency now must be able to measure performance," says Barquin. "And to do that you have to be able to integrate information sources."

Some of the new data warehouses, like the one at EPA, are linked to the Web so non-sensitive information can be made available to Net surfers around the world, 24 hours a day, seven days a week. Makers of database programs have added Web gateways to their platforms so users can mine data warehouses with Web browsers. And many decision-support software products now offer Web protocols. Internet firewalls and encryption devices are being used to safeguard data warehouses from cyberbandits.

The Housing and Urban Development Department's Web warehouse integrates community planning and housing project information from various sources and arranges it on maps so that it can be easily understood. The data is available (www.hud.gov) to city planners, community residents, architects, builders, schools and libraries.

The Transportation Department uses its Web warehouse (www.bts.gov) to share information with other government agencies. Tools of the Trade

The key to a successful data warehouse, on or off the Web, is the proper integration of products. The most critical components are servers, from companies such as Compaq, Digital Equipment Corp., Hewlett-Packard, NCR and Sun Microsystems, and relational databases from suppliers such as Informix, Oracle, Sybase and Microsoft. Middleware, a type of software from firms such as Information Builders and Trinzic, aids warehouse management by pulling information from database engines and delivering it to client computers.

Middleware links query tools to multiple databases working on different platforms. Informix's new product, DataBlades, lets users analyze relationships between data from disparate sources without having to use special applications.

Extraction tools from companies such as Carlton and IBM work to "scrub" data so that it is consistent. Reporting tools enable users to pull the data and drop it into specific formats such as charts. Data mining products from firms such as Cross/Z International and Information Discovery use algorithms to refine raw data. These products use inductive-as opposed to deductive-reasoning to identify patterns and predict trends.

Online analytical processing products from suppliers such as MicroStrategy, Oracle and Pilot Software take data mining to its highest level. These tools, which do multidimensional analysis in separate relational databases, can answer complex queries and discover otherwise elusive data associations.

Visualization products, such as the latest tool kit from Advanced Visual Systems, help users create three-dimensional graphical images of data extracted from warehouses. Those images and other data can be stored on disk arrays and other media from companies such as Data General and EMC Corp.

Some firms now offer packages that combine a variety of products for data extraction and mapping. SAS Institute, for instance, has a tool kit that helps users both build and maintain warehouses. And those interested in smaller warehouses, known as data marts, can use tools from companies such as IBM and Microsoft to create simple applications quickly and inexpensively.

Lessons Learned

All sizes of data warehousing projects can run over budget and fail to meet objectives unless thought out carefully. Agencies must outline clear objectives and anticipate rapid expansion by reserving adequate storage capacity and skilled personnel. Experts recommend starting with small, manageable applications.

"Do a pilot first and remember to target your audience," says Charles Lee at Los Alamos. "Initially, don't put anything critical on the warehouse and don't turn it loose to everybody."

Security is another consideration, as is maintenance. As executives dream up new applications, warehouse staffers must maintain the ability to integrate data from multiple sources. One of the toughest problems is the conversion of metadata-detailed information about the data itself-because companies use different standards. But a newly formed consortium of vendors, known as the Metadata Coalition,

is hoping to solve problems associated with the integration of software tools.

Possibly the most daunting obstacles to data warehousing are the cultural problems that result during the building phase. Some employees may put up resistance to reengineering required before warehouse construction. Others may raise questions about which departments own data.

"At the outset, it's critical to put together an interdisciplinary team involving everyone from the systems folks to the program managers," says EPA's Garvey. "Clear definitions must be established about who owns the data and controls the warehouse. The turf wars soon disappear once people see the warehouse in action-then they all fight to get their databases on it."

NEXT STORY: Internet and Intranets Guide