My KM experiences: The language barrier in KM

While I'm learning about KM, I noticed that I don't know much about some tools or websitethat most of other class mates are aware of such as facebook or myspace. Instead I'm awareof some web sites that they even don't know probably. The reason is very simple. I use to usethis blog thing to contact with my friends in Korea, therefore I go to the web sitewhich is popular between them and provides services in Korean. Moreover, I noticed that this web site started toprovide services in english,japanese and chinese, from about a year ago. This fact made me think about the language barrier in KM and its solution. To understand more deeply about the language barrier in KM, I did some research and I found this interesting article about it.

The language barrier in KM

In the last years, due to the great spread of Internet and the introduction of new communication technologies, the amount of information available online has quickly increased and the need to use different languages while having conversations or accessing documents has become a potential barrier to knowledge diffusion. Users surfing internet for personal reasons face this problem in a way that limits their opportunity to access and fully understand the information they retrieve.Moreover, many companies and organizations have to deal with documents written in foreign languages or have to organize online events and meetings with partners of different nationality so that the need to use and speak a common language is strongly growing. Though most of the people connected to Internet are not English native speakers (Spanish and Asian languages are in fact the most diffused), the English language is playing a more and more dominant role for international communication and for written shared documents. However, the need of preserving linguistic and cultural diversity acts as a barrier difficult to overcome in a short and middle term, preventing the diffusion of English as the primary communication language in the world.The lack of a common writing system is the second biggest issue. If we only think of how diverse Roman scripts and Chinese or Korean ideograms appears, it seems really difficult to find a common standard to be adopted by all countries. In the last century many researchers and scientists dreamt about mechanical devices capable of translating from many languages in real-time. Imagine having a phone call with someone not speaking your language but to still be able to understand each other while keeping talking each in his own language thanks to some hidden on-the-fly translation tool.Though still visionary, this idea starts to be quite reasonably achievable in a near future. The first attempt of using computers for language translations dates back to the middle fifties. The earliest systems were based on large bilingual dictionaries where, for every words of the source language, the software provided one or more equivalent term in the target language. There were, obviously, many limitations related to this approach due in particular to formal grammar and syntactic order of single words in a sentence. After a twenty years break in this research field, the early Nineties were a major turning point for machine translation. Many companies (such as IBM, Sharp, Nec, Microsoft and so on) began investing in developing real-time translation systems with the aim of providing good quality translations (often aimed at publication) and a support for Internet applications and web sites.

Reading and translating an existing paper written in a foreign language and retrieving information from it are two main different situations that could happen. The most used approach to these problems is represented by the so called Machine Translation. The first MT software packages analysed the source text and, through a hidden set of rules, attempted to translate it and generate the target text without human intervention. The translation provided by these systems was often very rough and couldn’t be used as is. A human translator was still necessary to get better quality results even if, some applications tried to improve the process by limiting the vocabulary through use of a dictionary and of a set of predefined sentences/grammar. Some systems were also trained to learn from corrections.Language machines have improved their effectiveness using a specific language models that enable natural languages to be understood and processed in a useful and smart way. The translation process is made up of two complex steps: 1) decoding the meaning of the source text and 2) re-encoding this meaning in the target language. The decoding process begins with analysing and understanding of written text by matching the single words with the incorporated dictionary and by using a semantic tagger that annotates each word by a representation of its meaning. Then, the re-encoding process provides the response in the target language. This sub-system can produce entire translation or it could be limited to simply producing predefined sentences or summary. This last stage could still involve human interactions to correct and improve the translation quality. Today, quality is also improved by customizing terminology, by creating ad hoc dictionaries for different disciplines (Economics, Engineering, Literature and so on) and building pre-processing and post-processing scripts to avoid known issues. Since these activities must be maintained over time, MT are strongly inter-linked with language engineering, terminology management, translation memory, and human translation processes. Yet problems still arise due to language ambiguities. Words could be lexically ambiguous when they have different meanings and a sentence could be structurally ambiguous when, changing words order, its meaning also changes. Many words are at least two ways ambiguous (i.e. they could have two meaning) and this is a problem because ambiguities multiply. For example, a sentence consisting of ten words, each two ways ambiguous, and with just two possible structural analyses can have different translations. In this sense, software applications could face several problems trying to understand the single words or the entire sentences. Different languages often use different structures for the same purpose, and the same structure for different purposes, and this is by itself another difficult problem to cope with. In this case, the machine translation needs human intervention to understand the full meaning of the text. Other problems are related to the users’ expectations and needs. Indeed, few years ago, researchers used to design or build software applications without realising what users want from a language technology and from these kinds of machines. Moreover, it takes many years to fully implement an entire language with its dictionary, ontology and phrasal verbs and it makes use of many resources that it often becomes very expensive for many companies and organizations.

Software and Technology

Trying to make a general overview of translation tools and MT systems, it is important to distinguish two main users’ needs: dissemination and interchange. The first seems to be the most important for companies that are going to publish or produce documents and information in different languages. As we saw before, MT systems produce translations that need to be post-edited by a human translator even if some systems have been developed to deal with a limited range of language styles and text content in order to reduce human revisions. To this goal MT systems have also been improved by integrating and merging translation tools (dictionary, terminology database, translation memories and so on) with authoring and publishing processes. These kinds of translation workstations are now able to produce a very high quality translation output. The second need is strictly related to participants involved both in online communication (telephone, audio/video conference) and in translation of internet-based text (e-mails, newsgroups, web-sites). In this sense, MT systems are not made to solve this problem, since the need is to quickly understand the content of messages or to retrieve the right information and data from a document. Another issue concerns the development of systems that are able to understand spoken language and to arrange an acceptable translation. This problem is still far to be solved due to many different factors such as words pronunciation and voice inflexion that make it difficult to develop universal systems without an initial training session. Over 30 years ago, SYSTRAN developed the first MT system for professional translations and it soon became a real state of the art for other developers. Today, SYSTRAN offers a huge range of solutions for individual users and/or professional services by providing revolutionary translation technologies for the Internet, PCs and network infrastructures that facilitate communication in 36 language pairs and in 20 specialized domains. All SYSTRAN systems use one translation engine combined with the latest Natural Language Processing (NLP) technologies. They also make use of XML and Unicode over the HTTP protocol to speed up access to very large linguistic knowledge bases available online. These systems are based on scalable and modular architecture to be fully customizable to fulfil needs and requirements of different customers. Similar services are offered by LOGO systems, a translation company born 1979 to help and support multinational companies for technical documentation. LOGO systems tools are fully scalable and customizable and offer translation and language services, multilingual content and on-line terminology management, translation memory and so on. Logosys is a product specifically designed for business purposes since it integrates both a project management and a knowledge management module. The first one ensures fast turnaround times, the selection of the most qualified team and on-line access to project status, while the latter allows centralising and leveraging all translation memories, client knowledge bases, terminology databases and customised glossaries in one highly organised networked archive. Logo systems are based on the WCGT (workflow, content, glossary, translation) management technology with the support of powerful Oracle databases to store and retrieve format independent translations to facilitate future updates. An alternative approach is used by Meaningful Machine, a company founded in 2000 to develop, patent and commercialize translation technologies and tools. Its most important product is the Fluent Machine based on a core technology that understands natural language and on AI-based semantic techniques. The Fluent Machine works with two different processes: the first enables a computer to automatically generate a database of translated words/strings, sentences and other language units by examining the written text; the second process connects translated words in a target language with human-quality accuracy. The system can automatically build many new, longer word-strings each time a new entry is made to the cross-language database. This synergy allows faster and more efficient translations of documents to and from many different languages. One significant benefit of this approach is that the cross-language database is built simultaneously during the translation process and is continuously updated with new terminology and strings. The need of online translations of web pages, e-mails, and other Internet resources and documents has also facilitated in the last years the diffusion of on-demand Internet-based translation services. Some search engines have an interface in different languages to help users in seeking only content and documents written in their own language or provide access to tools for an immediate translation. Some examples of these tools are Transphere (including Chinese, Arabic, and Japanese language), Easy Translator 4 (with 12 language pairs), Altavista Babel Fish (based on Systran engine) and WebSphere. The last one was developed by IBM in partnership with Alis Technologies and offers a more comprehensive solutions to break down language barriers by centralizing organization translation memories, glossaries, lexicons, and other linguistic assets, including e-mail, web pages, chat, documents, other content since it can manage 18 different language with Chinese and Japanese support.

Future Scenarios

Over the next years, Machine Translations and Language Engineering will become more and more important and the number of applications and tools that will be designed and developed will increase with great speed. A new market will be driven by the need of text to speech (and vice-versa) applications that many industries will focus their efforts to break into it. Science fiction has often told us about vocal interfaces to any digital tool, and sometimes the envisaged scenario has made us dream about a keyboard-less world. Some applications like Scansoft’s Real Speak (a text-to-speech software based on L&H’s language engine), make now this scenario closer than we expected. Motorola’s Lexicus Division is working on its Message Connect system, a combination of email and voice messaging that can read email and recipients over the phone and can then reply by dictating a message back. To help and support travellers, many industries are working on embedding machine translation engines on smartcards that, through a membrane microphone, are able to make a real-time translation in a selected language. Moreover, the R&D department of L&H has produced a prototype of travel sunglasses able to translate road signs, marquees and others text in a specific implemented language. After many years of development then, commercial systems are now able to satisfy the needs of many multilingual companies that cope with partners and customers around the world every day and, above all, for those companies that look for cost-effective production of good quality translation for dissemination purpose. Still a gap exists for minor languages, including those of Eastern Europe, Africa and India, even though many developers are working on dictionaries and thesaurus of these languages to be added to existent systems. Some skeptics still feel that a fully automated MT system is unachievable in the short time, but the above scenarios allow us to say that in the next 20 years the dream of the Babel Tower can become reality.

Wednesday, October 15, 2008

The language barrier in KM

1 comment:

My KM experiences

Blog Archive

About Me