eyword searches are not considered a big deal when done on small text files, but plowing through millions of electronic documents full of typographical errors is another matter. Search and retrieval of particular information is difficult if not impossible without the help of special software that scans database documents for key words or phrases. Such programs run anywhere from $100 to $100,000, depending on the level of sophistication.
Search and retrieval software is capable of indexing both structured and unstructured data. Structured data such as Census lists is organized into predetermined fields containing names, addresses, Social Security numbers and other information. Searches then can be done, for instance, to find all the people named Smith living within a particular ZIP code.
Unstructured data, such as maps or regulatory documents, is considerably more difficult to catalog. Relational database management systems from companies such as Informix and Oracle can be used to break down data into various tables that are cross-indexed. Matrixes developed from the Navy's aircraft maintenance records, for instance, highlight information such as plane identification numbers, engine overhaul dates and flying weather. The tables can be linked to answer questions such as "What is the average fuel consumption for helicopter landings in windy versus calm weather?"
High-end search and retrieval packages from companies such as Excalibur Technologies Corp. and Future Tech Systems employ a type of artificial intelligence known as fuzzy logic, which enables users to search for words even if they are misspelled. Most large-volume imaging applications use optical character recognition devices that convert documents into digital formats. But OCR scanning has about a 5 percent error rate, resulting in lots of misread characters in big jobs.
Fuzzy logic programs use techniques such as adaptive pattern recognition processing to search for patterns in digital data, instead of searching for specific words. Thus if similar characters-i's and l's, for instance-are misread during OCR scanning, the software can deduct that "lmuemtery" is really "inventory" and thus retrieve relevant database data.