Fighting a War of Words

March 1, 2004

Business 2.0, CIO, InformationWeek, Electronic Business Mobile Computing & Communications.

ntelligence and military leaders have made progress in capturing terrorists, but they really can't finish the job without understanding the plans, thought processes and movements of terrorist organizations.

Interpreting cryptic correspondence-much of it written on scraps of stained, crumpled paper or locked in e-mails-is the key. Combine that intelligence with a search of Web sites, newspapers and other documents for names, locations and additional information, and officials have the ammunition they need to stamp out terrorism.

That sounds like a good plan, but it's easier said than done. Software tools are available to help translate information, find spelling variations in names, analyze sentences and concepts, and search for terms across multiple languages. But many think these technologies aren't mature enough to keep vital information from falling through the cracks.

Software vendors have little interest in many of the languages the government must deal with, such as Pashtu and Somali, says Melissa Holland, leader of the multilingual research program at the Army Research Laboratory. Because of low market demand, the development of text-based multilingual technology has been slow to nonexistent. Other hurdles to creating reliable translation software include having to decipher torn, faded or faxed documents; nuance, tone and colloquial expressions of language; and the vast number of documents that must be evaluated. Reams of information have been recovered from caves in Afghanistan alone.

Federal agencies have been developing programs to analyze commercial products, create proprietary products, and in some cases, tweak existing ones to meet specific language-based needs. And commercial software vendors have been responsive to agencies' requests for more comprehensive translation technologies. The result has been a handful of well-received tools that agencies use with varying degrees of success.

TURN OF PHRASE

The most prominent area of text-based multilingual technology is direct translation of text from one language to another. The concept might seem straightforward, but it's far from simple when idiomatic expressions, nuance, tone, inflection, humor and dialects are factored into the equation.

"Language is the most complex of human behaviors, so it's the biggest challenge when trying to reproduce those language behaviors," says Ray Clifford, chancellor of the Defense Language Institute, which trains translators, most of whom become signals intelligence officers or debriefers of prisoners of war, in dozens of languages. "How do you make such a translation by machine for a concept that doesn't exist in the first language?"

The Defense Department's Language and Speech Exploitation Resources (LASER) program is trying to improve machine translation. About halfway into a five-year project, officials are evaluating commercial technologies and developing ways to tailor them to the government's needs. John Kovarik, chairman of LASER's text-to-text integrated project team and a senior language technology expert at the National Security Agency, wants to develop machine translation technology that will work for everyone from an FBI analyst at headquarters to a sergeant on the battlefield.

Pushing the envelope, the LASER team invested in leading machine translation providers such as Systran Software of San Diego. Today, Systran provides machine translation for a variety of languages, including most Western European languages, Chinese, Japanese, Korean and Arabic.

Under the direction of the Army Research Laboratory, the Army's Communications and Engineering R&D Command, and the Defense Information Systems Agency, LASER worked with MITRE Corp. of Bedford, Mass., to refine its Translingual Instant Messaging (TrIM) system. They developed an instant-messaging protocol that would allow U.S. troops on joint exercises with Japan in the Pacific to exchange messages using machine translation.

The team continued to upgrade the system, including user-defined dictionaries that address vocabulary unique to a specific exercise or situation. But the accuracy rate still isn't what the government would like it to be-often lower than 80 percent, some vendors and industry experts say.

In an effort to provide speedy and consistent machine translation services for a variety of languages, the LASER team decided to augment traditional machine translation with new technology. Partnering with Language Weaver of Marina del Rey, Calif., a portfolio company of the CIA-funded venture firm, In-Q-Tel, the LASER team plans to develop machine translation that uses a statistical instead of traditional rules-based approach. The goal, Kovarik says, is to capture English phrases that align with their foreign language equivalents in parallel text.

COMBINATION APPROACH

Although direct text translation has been the primary focus of new technologies, others have emerged to address hurdles such as multicultural name recognition.

Consider the case of Pakistani immigrant Mir Amal Kansi, a terrorist executed in Virginia in 2002 for killing two people and wounding three more outside CIA headquarters in 1993. At the time, The Washington Post decried the crime, noting that he passed through immigration checkpoints at John F. Kennedy International Airport in New York with a passport and visa listing his name as Mir Aimal Kasi.

Incidents like this have prompted some vendors to improve automated name recognition across languages. "We had to come up with a way to identify the culture of a name. Because it's clear that if you have a Chinese name with short syllables versus a Hispanic name, there is so much more syntax involved in the Hispanic name, that it's difficult to invent one algorithm to handle both of them," says Jack Hermansen, CEO of Language Analysis Systems of Herndon, Va. Vendors such as Language Analysis Systems and Basis Technology Inc. of Cambridge, Mass., have developed tools to bridge that gap for many agencies, including the Bureau of Customs and Border Protection, Homeland Security Department and intelligence organizations.

Another translation tool making inroads is multilingual information retrieval-a sort of Google search engine. By typing in an English phrase, a user can retrieve documents containing that phrase in other languages. The latest versions of this technology use Unicode, which assigns a unique number to every character, independent of platform, program or language. The standard, adopted by industry leaders such as Apple, HP, IBM, Microsoft, Oracle, SAP, Sun, Sybase and Unisys, allows multilingual information retrieval systems to work in ways they never could before, says Carl Hoffman, CEO of Basis Technology, a Cambridge, Mass.-based firm that produces the technology.

"Let's say you have a document written in Arabic script that contains the name of an al Qaeda leader, and that document is buried on a hard disk among tens of thousands of other documents," Hoffman says. "You want to search the hard disk for any document containing that name. All you have to do is type that name in Latin letters, and it will match it as written in the Arabic script."

Another approach is multilingual information extraction, in which users can pinpoint names, places, dates and other words and phrases in a variety of sources, such as e-mail, documents and the Web. The Defense Advanced Research Project Agency's Translingual Information Detection, Extraction and Summarization program develops systems that focus on languages such as Arabic and Chinese, in which extraction can be more difficult. Basis Technology and other vendors offer similar software.

Government language experts say the best way to increase efficiency of translation technologies is to use a combination approach-not just machine translation or multilingual entity extraction, for example, but two or more techniques. That's especially true of machine translation, says Doug Naquin, director of the Foreign Broadcast Information Service, a CIA organization that translates the text of daily broadcasts, government statements and news articles from non-English sources. "The technology has gotten better, but not to the point where most of the people we hire to do language work would feel comfortable saying it can all be done via machine translation," he says. Machine translation technology is most useful as a filtering technique, Naquin says, "because no matter how many people we hire, we can't keep up with the volume."

Other tool combinations also have proved useful. The Foreign Broadcast Information Service staff, for instance, integrates translation tools with search engines. "By combining multilanguage information extraction and information retrieval, for example, we can enter a search term like 'SARS in Asia,' and use the extraction entity to give us responses both in English and the original language," Naquin says.

"We've come a long way," the Defense Language Institute's Clifford says. "Today, we've got multiple enterprises tasked to deal with this issue, and that's an important step."

Karen D. Schwartz is a writer specializing in technology and business issues. She has written for numerous publications, includingand

NEXT STORY: Managing Smart Cards