Dr M Sasikumar
Today, Indian language rendering on computers is no longer a research problem. Effective solutions are available on most browsers and operating systems. Fonts are also available for many Indian languages. Large scale e-Governance initiatives at both central and state government levels, as well as high penetration of mobile phones in the rural areas of India are waking up the non-English speaking community into the IT space.
Despite the high degree of linguistic diversity and a very small fraction of the population being able to use English, software localisation is still a largely unknown element in India – in any information technology or computer services (IT/CS) curricula, academia or industry. The difficulties in rendering Indian language scripts and the large number of widely spoken languages may be part of the reason. The predominantly export oriented IT industry is looking at candidates and software applications with English capability and are relatively blind to Indian language requirements. Thus, the IT education is also biased towards people learning English and using computers, rather than getting computers to converse in our native language.
Today, Indian language rendering on computers is no longer a research problem. Effective solutions are available on most browsers and operating systems. Fonts are also available for many Indian languages. Large scale e-Governance initiatives at both central and state government levels, as well as high penetration of mobile phones in the rural areas of India are waking up the non-English speaking community into the IT space. Use of an intermediary to avail the e-Governance services would offset a number of major advantages of G2C services. On the other hand, large scale use of these services directly by the citizen requires a high focus on software localisation – a tough technological challenge for a highly multilingual country like India.
Though, not very widely known, localisation is not totally unknown to India. There have been attempts to get GNU/Linux desktops, office application suites, web browsers, etc fully localised into a few Indian languages. The major Department of Information Technology, Government of India initiative to release national CD in all official languages of India has popularised these efforts, in addition to encouraging more focused efforts in this direction by organisations such as Centre for Advanced Computing (CDAC). However, there are no studies yet, indicating how many people have started using/exploiting ICT thanks to these efforts, who would not have otherwise used computers. One hopes that the increasing penetration of ICT applications to the rural areas will facilitate this to happen more.
In this article, the author looks at the road ahead in making this happen looking at the challenges and problems, and how to address them. Indian language computing, so far largely dormant and pursued as an academic subject in a few institutions, is now picking up momentum and the associated technologies will need to play a pivotal role in this effort to make the task manageable. This article looks at primarily the translation aspect of localisation, in the context of work in the area of Indian language computing.
Stages of Localisation
Prerequisite to localisation of any software into a given language is the availability of a suitable character encoding, fonts to display text in that language, keymap to enter text, etc. Availability of word processors and editors to create such text is also required to make this task feasible. The second – for most applications the last – stage is to modify all textual output from the application to the corresponding language. In a suitably internationalised software, such text are available in files separate from the source code files. This enables one to hand over these files to language translators who will then translate them into the desired language. They do not have to see or deal with any aspect of programming due to this separation. In theory, this is all you need to get your software in the desired language. But, in practice, this is a much harder problem. Some of these issues are discussed in the next section.
In general, just changing the language of display of messages do not suffice for making the software usable for the target users, namely, the relatively non-techie non-English speaking community. A third stage named cultural localisation is needed to fill this gap. This includes adapting the icons used by the application, implicit cultural references in the messages, cultural conventions used by the software, etc. This is largely an open area today, with very little work. Making these kinds of conventions explicit during the software development so as to make them amenable for localisation, is a hard problem.
Translation for Localisation
Translation for localisation differs from translation of standard books or other literature. Mostly the pieces of text to be translated are short in length – many such as menu entries and commands are just one word long. Given the natural ambiguity of words in a language, this poses a tricky problem. For example, the word ‘close’ may mean opposite of ‘open’ (seen in File menus, for example) or opposite of ‘far’ (in the sense of near). This information is not visible just looking at the string ‘close’. In the case of normal translation, this does not create a similar problem, since the rest of the sentence or the previous/following sentences provide the clues to identify the intended meaning. This indicates the need for capturing the context of the text segment when externalising them during the development process. Defining the context is itself a difficult problem. A context is represented normally using some human understandable annotation, giving an alternative word with the same intended meaning (e.g. shut), a sentence containing the word in its intended meaning, etc. In the absence of useful annotations of this form, the translator needs to look at the source code where the text appears in order to disambiguate the word. Since those who translate and those who develop software are often disjoint groups of people, this is not desirable.
Another concern is with respect to the choice between transliteration and translation. Usually, in translation tasks, proper nouns are transliterated (that is, the same word is written using the target language script), and other words are translated (that is, an equivalent word from the target language is found e.g. kaksh for classroom). However, for software applications this may not be very useful. Most technical words coined in Indian languages to correspond to usual technical vocabulary are less familiar to the users than their corresponding English words (e.g. words like phone, mobile, computer, etc), and hence, purely from a usability point of view transliteration may be more appropriate. This choice is to be carefully exercised keeping in mind multiple considerations.
Consistency of translation is another aspect of concern. This again is not that serious in normal translation tasks, since normally there is adequate context for the reader to decipher the intended meaning. But, in software localisation, most of the terms have very specific meaning, and hence the term usage need to be very consistent across the software. For example, the terms teacher and faculty may not be interchangeable in general. The source application itself may use these terms to mean different roles and hence separate terms need to be found for these in the target language as well. The mapping and distinction should be maintained across the entire application.
Automating Localisation with Natural Language Processing
Given the complexity and size of the localisation task – primarily caused by the translation component – there has been interest in building tools and frameworks to automate this task as far as possible. A number of commercial and open source tools are available with varying range of capabilities. Support for Indian languages is not yet common among them, perhaps due to the low current market in India. In principle, one could make use of a machine translation system to perform these translations. However, this is not likely to work for the many reasons mentioned below.
Machine translation is an extremely complex task involving a variety of difficulties. It has been a research problem the world over for more than half a century. There have been some notable successes; but a general purpose good quality automated translation system is still not in view for most language pairs. For most of software localisation, the translation will be from English to Indian languages. English and Indian languages are structurally very different, making the task particularly more complex. A number of experimental systems are available working on this problem. This includes the Matra system from CDAC Mumbai, the Mantra system from CDAC Pune, Shakti from International Institute of Information Technology (IIIT) Hyderabad and Anusarak from Indian Institute of Technology (IIT) Kanpur. They differ in their approaches to translation. No reliable comparative performance analysis among these are available today. Given arbitrary sentences or fragments from a random domain, none of them would perform well, illustrating the complexity of the task. In general, for practical deployment, one resorts to either restricting the domain or accepting lower quality of translation. Mantra takes the former route, where as Matra takes the latter route.
For software localisation, this poses a problem. A translation tool for localisation cannot restrict its domain, to healthcare, education, etc, from a practical utility point of view. This requires it to deal with open domain sentences. However, it is not possible to accept low quality translation either, since these will be used regularly by novice end users, and if the messages are not clear, the user experience can be poor.
A popular approach to translation for open domains has been to follow the statistical approach. Google’s translation work is based on this approach. However, this requires a large corpus of bilingual closely matching sentences. For localisation, this is very unlikely to be available, since the sentence collection may vary widely from application to application.
As mentioned earlier, identifying the right choice for transforming a word – either through transliteration or translation – is a difficult issue. A widely accepted glossary of common technical and computer terms would be very helpful here. Ideally, we need glossaries at multiple levels. Firstly, we need a glossary of normal use terms (as in a dictionary) likely to be useful in software applications. This differs from dictionary in the coverage of words. Normally complex, colloquial and old-usage terms are not likely to be found in software messages. The second level would be a domain specific glossary. For example, all packages relating to education would need terms such as classroom, teacher, syllabus, curriculum, examination, etc, and this glossary could map these words into the corresponding words in the target language. This helps significantly in disambiguation of words, since most application packages are concerned with only one or two domains. For example, consider the use of the term examination in a medical application and an e-Learning application. The third level glossary would be for the particular application. This may introduce specific terms that are used only in that application. For most languages and application domains, such glossaries do not exist. For some languages, a general purpose dictionary is available.
Glossaries and other linguistic resources introduce another difficulty which hinders use of localisation tools across languages. This is lack of standards in creating these resources. The variation may be due to the nature of storage – relational database, Extensible Markup Language (XML) structures, simple text formats, etc. – as also in the variety of information stored per item. Formulating adequately general standards for this would help in sharing of application programmes across languages and these language resources across different applications.
In the context of standards, the concern is with specification of context in the externalised source text file. Context here only means any information that helps the user to disambiguate among the possible meanings of the text segment. However, this level of generality makes it hard for automated systems to make use of this information. Context handling is, in general, a hard problem in natural language processing. Specifically, in the context of software localisation an appropriate framework for this need to be worked out.
All these make complete automation of translation in the context of localisation extremely difficult. Systems, therefore, normally resort to providing as much support as possible to the human translator. This can include the following:
Assistance with formation of inflections for select words. This can be a major issue with inflection rich languages like Hindi, Malayalam, etc. These tools derive the probable word of a given variant, enabling the user to simply select the form rather than having to type. Providing auto-completion of words given the first few characters, using a dictionary, is another useful support, reducing the keying in effort.
Looking up words automatically in the dictionary, and providing mouse-over or tool-tip selection of the target word is also easy to provide.
A richer degree of help would be identifying fragments of text from earlier translated sentences which also appear in the current segment for translation. Often, common fragments in two sentences would result in similar translation, and hence can be reused. However, this is not necessarily true, and hence, once again, human discretion is required to make the final choice.
Thus one finds that most commercial localisation support tools restrict their support to locating matching sentences from the memory. However, most of these are used for content localisation, where this is normally effective more often than in the case of software localisation. In the case of software localisation, the reuse can only come from prior versions of the same software, same class of software (two learning management systems from two different vendors), or software in the same application domain. If the source and target languages are grammatically similar, a higher degree of help can be provided easily, since most of the effort would be in word or phrase level translation.
India needs to see a substantial increase in the effort in the area of software localisation, to ensure that the domestic consumption and exploitation of ICT capabilities reaches the common man. We need to urgently build shareable standardised linguistic resources such as dictionaries and grammar tools. Software and content localisation is an excellent application area for research work in natural language processing, focussing on adapting the current technologies and approaches to address the problems of the sort mentioned above. One may also be able to build specialised solutions for localisation.
The topic of software localisation need to find its way in the curriculum at various IT related courses, to build more awareness and get more people to work in this area. A lot of resources are available in the open source community to learn about and practice software localisation. This is also an excellent area for volunteer efforts by NGOs, etc, and for student projects in Bachelor in Engineering (BE)/ Masters in Engineering (ME)/ Masters in Computer Applications (MCA) and other computer related courses. Apart from being an interesting field, the work in this area will be a significant social contribution.
The OSSRC – open source software resource centre–located at Mumbai (URL: http://ossrc.org.in) would be happy to assist any group interested in pursuing this area and in hosting your contribution to share them with potential users