Searching for Order

January 1, 2003

Business 2.0, CIO, InformationWeek, Mobile Computing & Communications Electronic Business.

n the ongoing battle against terrorism, information about potential attacks must be confirmed or discredited quickly and thoroughly. If pertinent information isn't found fast-or if pieces of vital data are left undisclosed-the future of the country could hang in the balance.

Suppose an intelligence analyst is asked to investigate rumors of a planned attack. Ideally, he will have access to terabytes of data in hundreds of classified and unclassified databases throughout the government. He sits down at his computer, typing the words "terrorist," "al Qaeda" and "flight school" into his system, hoping to pull up valuable data on flight school registrants, terrorists affiliated with al Qaeda, known aliases of confirmed terrorists and other vital information. It's imperative that he misses nothing.

Whether he succeeds depends heavily on the search and retrieval technology his agency is using. That could be off-the-shelf information retrieval software that organizes content, performs searches and presents results. Or it could be a Web-based technology, such as Google, Yahoo or Oracle's UltraSearch, all of which categorize, index and rank results.

Each method has staunch supporters. In the first category, Verity Inc. of Sunnyvale, Calif., has customers at the State and Defense departments, among others, while competitor Convera (formerly Excalibur) of Vienna, Va., has customers at the Social Security Administration, IRS, and the Agriculture, Defense and State departments. Other competitors include Thunderstone Software LLC of Cleveland, with customers at Defense, the National Weather Service and Agriculture; and OpenText Corp. of Waterloo, Ontario, with users at the Air Force and Navy.

The Defense Technical Information Center (DTIC), which creates libraries and information retrieval systems for organizations throughout Defense, uses information retrieval software. Depending on a customer's requirements, DTIC might choose technology from Verity, Thunderstone or Convera to develop the search capability, notes Carlynn Thompson, DTIC's director of component information support.

"Each one of them does certain things more robustly," she says. "If you have a dedicated user community that needs [to do complex] searches, we might go one direction, but if users need sound bite-type retrieval, we might go in another." In the case of GulfLink, a system that covers issues related to the Gulf War, DTIC developed a search and retrieval system that incorporates all three vendors' systems. To guide users, DTIC provides a checklist of capabilities and directs users to the most appropriate system based on the answers it receives.

The approach has merit, says Rob Rowello, a manager in the Washington office of management consulting firm Pittiglio Rabin Todd & McGrath. Technology offered by such companies as Verity, Convera and others can predict what users need, allowing them to have information ready in a format available to them before they ask for it, he says. And increasingly, these tools can deal with both structured and unstructured data, up to a point.

One example is Verity's Export technology, which attempts to add more structure to unstructured data through categorization and personalization. "Verity-like solutions have the ability to predict what you might need based on your prior search history. This predictive capability is useful to people who might not know exactly what they need, or who are trying to narrow in on a specific "needle" of information within a large "haystack" of data, Rowello says.

Search-and-retrieval engines are newer to the federal space and claim fewer customers, but their use is growing. One of the top search engine firms, Google, recently released its Google Search Client, which is used by several military and intelligence organizations. While this model sometimes returns irrelevant data and is less likely to correctly categorize information, it can draw from more sources and can index information it doesn't own or store, notes Tim Hoechst, senior vice president of technology for Oracle Government, Education and Healthcare.

Both information retrieval systems and search engines that categorize data in a Web-based infrastructure can be made secure, but security often is more of a concern in the first model, where content is centrally stored. In the search engine model, federal systems typically run on closed networks, addressing security concerns.

Those who see Web-based search engines as the basis of intelligent information retrieval say their popularity is destined to grow. Such systems "can organize information in a simple way that benefits users immediately," says John Piscatello, a product manager at Google, which is based in Mountain View, Calif. "Inevitably information changes, new and better sources become available, or something gets reorganized and doesn't work anymore." With this type of solution, "software algorithms can identify the best quality information to deliver the top results," he says.

Some believe the real answer lies in a combination approach, taking the best from both of the main types of information retrieval technology. "If we first use search engines to cull out what we know from what we have, we can then build more centralized, structured sources from what we know," Hoechst says. "If I were doing this, I would set up a content management system with a data librarian to store and manage my organization's core documents so everything could be neatly organized, thoroughly maintained and easily found. I would also set up a search capability of the second type to be used as a secondary search mechanism when further searching is required."

Whatever the approach, one thing is clear-more and more government entities will take advantage of advanced information retrieval systems. In a recent study, Delphi Group of Boston forecast that the market will expand at a rate in excess of 20 percent through 2004. Driving that growth, the study says, will be the adoption by both government and leading companies of such technology.

The FBI is just beginning to create an information retrieval system, replacing its mainframe architecture with a Web-based approach. Contractor SAIC will work to develop a search and retrieval system that incorporates the reams of information that are available today only on paper. The final product, called the Virtual Case File, will allow FBI personnel to search case information, including both text and images, says Mark Tanner, acting assistant director of the FBI's information resources management division.

"If I want to know what the FBI knows about Mark Tanner, I could put in the name and get back a linked diagram that shows all of Mark Tanner's identifying data, that he was associated with this person because he worked with her and with that person because he made a phone call to him," Tanner says. "Then I could begin to mine through that data to see what the investigative activity was that drew those relationships." Tanner's team has not yet chosen a specific information retrieval technology, leaving it up to the contractor when the time comes.

ACHIEVING THE PINNACLE

The problem with today's search technologies, experts say, is that often they don't scour the entire body of knowledge available to an agency. Although a typical search may gather valuable information from myriad sources, it may not include relevant data from systems outside the agency or unstructured data in the form of white papers, news reports or e-mail messages.

The problem gets worse in the case of cross-agency searches. Developing an intelligent search and retrieval system that works across agency boundaries is fraught with difficulties. Not only must such a system handle both structured and unstructured data, it must be monitored for constant updates and deal with the varied security clearances of federal employees seeking information.

The State Department has worked hard to meet such challenges. State, charged with managing a data repository on all arms control treaties signed by the United States, has spent significant time and effort determining how best to make records available to people within and outside the department who work on arms control and international relations. Eventually, department officials migrated from a home-grown search system to a Web-enabled system from Convera. "We needed something where we could incorporate new databases very easily, and something that could maintain a distributed system but still provide a single point of access," says Ned Williams, deputy director of verification operations in State's Bureau of Verification and Compliance.

Williams and his team first did a proof-of-concept test with Convera in 1999, as the agency prepared to deal with the Y2K computer problem. The group set up a task force to monitor Y2K events worldwide from the State Department's perspective. "Since we had so many different agencies involved, that meant a lot of different formats for information and a lot of printed text," Williams says. "We needed something that would allow us to cross-index across different platforms."

CHALLENGES AHEAD

Integrating information from a variety of sources is clearly the primary challenge facing the federal government in developing systems to search and retrieve information, but it's just one of many. A second daunting task is dealing with the inherent complexities of security levels and access.

One way to get a handle on the challenge of integrating information from a variety of sources is to enforce the Government Information Locator Services (GILS) standard. While the standard is required of all federal search and retrieval systems, vendors are just beginning to realize the merits of complying with it, Thompson says.

Once systems are GILS-compliant, agencies will be better able to connect with databases from other agencies. More importantly, they will be able to easily transfer data and the way it is cataloged to another search tool without much retrofitting.

"We want to move as quickly as possible toward interoperable interfaces, because when you have that, you can pull in resources from lots of different maintainers and collection owners," says Eliot Christian, an architect of GILS, which is headquartered at the U.S. Geological Survey. He cites the example of FirstGov, the governmentwide Web portal. FirstGov, he says, started out by running information retrieval technology from Inktomi Corp. If FirstGov had used the GILS standard when it dropped Inktomi and went with FastSearch, a Norwegian information retrieval technology, it would have been able to leave all of its Web pages and cataloging alone. Instead, FirstGov had to re-catalog pages and re-enter search criteria, slowing the transition process.

DTIC is one of the agencies pushing for GILS compatibility. "As long as there is a GILS standard in the search engine, external organizations can come in and grab your data," Thompson says. "From a homeland defense perspective, we want to be able to share a lot of data among emergency management organizations, and using the GILS standard would help."

GILS could help combat the lack of standardization of search and retrieval tools even within agencies. In the State Department alone, for example, some bureaus use Convera while others use Verity.

Once federal agencies have comprehensive, functional and intelligent retrieval systems in place, the next step would be to use them to extract important pieces of information from text sources and mine them for unusual confluences.

"If you see, for example, that there is a co-occurrence of two terms within a certain window of time, you start to make progress in the war between terabytes of information and thousands of people processing that information," says Prabhakar Raghavan, chief technology officer at Verity. It's a relatively new idea, and one that Verity plans to add to the next release of its search tool, due in 2003. Raghavan says there is interest in the concept from the intelligence community.

The concept of further processing retrieved content holds great promise, DTIC's Thompson says. To make it work for most agencies, information retrieval vendors must combine the best of today's analysis tools (which search for the most frequently used words and perform document comparisons) and visualization tools (which can be used to search for emerging concepts that appear across multiple documents). As these tools become incorporated into more and more systems, search and retrieval will begin living up to its potential, she says.

Karen D. Schwartz is a freelancer writer specializing in technology and business issues. She has written for numerous publications, includingand

NEXT STORY: The Data Migration Challenge