Making Computers Converse in Native Languages

Article

Making Computers Converse in Native Languages By Elets News Network 01-September-2008

Dr M Sasikumar

Despite the high degree of linguistic diversity and a very small fraction of the population being able to use English, software localisation is still a largely unknown element in India – in any information technology or computer services (IT/CS) curricula, academia or industry. The difficulties in rendering Indian language scripts and the large number of widely spoken languages may be part of the reason. The predominantly export oriented IT industry is looking at candidates and software applications with English capability and are relatively blind to Indian language requirements. Thus, the IT education is also biased towards people learning English and using computers, rather than getting computers to converse in our native language.

Today, Indian language rendering on computers is no longer a research problem. Effective solutions are available on most browsers and operating systems. Fonts are also available for many Indian languages. Large scale e-Governance initiatives at both central and state government levels, as well as high penetration of mobile phones in the rural areas of India are waking up the non-English speaking community into the IT space. Use of an intermediary to avail the e-Governance services would offset a number of major advantages of G2C services. On the other hand, large scale use of these services directly by the citizen requires a high focus on software localisation – a tough technological challenge for a highly multilingual country like India.

Though, not very widely known, localisation is not totally unknown to India. There have been attempts to get GNU/Linux desktops, office application suites, web browsers, etc fully localised into a few Indian languages. The major Department of Information Technology, Government of India initiative to release national CD in all official languages of India has popularised these efforts, in addition to encouraging more focused efforts in this direction by organisations such as Centre for Advanced Computing (CDAC). However, there are no studies yet, indicating how many people have started using/exploiting ICT thanks to these efforts, who would not have otherwise used computers. One hopes that the increasing penetration of ICT applications to the rural areas will facilitate this to happen more.

In this article, the author looks at the road ahead in making this happen looking at the challenges and problems, and how to address them. Indian language computing, so far largely dormant and pursued as an academic subject in a few institutions, is now picking up momentum and the associated technologies will need to play a pivotal role in this effort to make the task manageable. This article looks at primarily the translation aspect of localisation, in the context of work in the area of Indian language computing.

Stages of Localisation

Prerequisite to localisation of any software into a given language is the availability of a suitable character encoding, fonts to display text in that language, keymap to enter text, etc. Availability of word processors and editors to create such text is also required to make this task feasible. The second – for most applications the last – stage is to modify all textual output from the application to the corresponding language. In a suitably internationalised software, such text are available in files separate from the source code files. This enables one to hand over these files to language translators who will then translate them into the desired language. They do not have to see or deal with any aspect of programming due to this separation. In theory, this is all you need to get your software in the desired language. But, in practice, this is a much harder problem. Some of these issues are discussed in the next section.

In general, just changing the language of display of messages do not suffice for making the software usable for the target users, namely, the relatively non-techie non-English speaking community. A third stage named cultural localisation is needed to fill this gap. This includes adapting the icons used by the application, implicit cultural references in the messages, cultural conventions used by the software, etc. This is largely an open area today, with very little work. Making these kinds of conventions explicit during the software development so as to make them amenable for localisation, is a hard problem.

Translation for Localisation

Translation for localisation differs from translation of standard books or other literature. Mostly the pieces of text to be translated are short in length – many such as menu entries and commands are just one word long. Given the natural ambiguity of words in a language, this poses a tricky problem. For example, the word ‘close’ may mean opposite of ‘open’ (seen in File menus, for example) or opposite of ‘far’ (in the sense of near). This information is not visible just looking at the string ‘close’. In the case of normal translation, this does not create a similar problem, since the rest of the sentence or the previous/following sentences provide the clues to identify the intended meaning. This indicates the need for capturing the context of the text segment when externalising them during the development process. Defining the context is itself a difficult problem. A context is represented normally using some human understandable annotation, giving an alternative word with the same intended meaning (e.g. shut), a sentence containing the word in its intended meaning, etc. In the absence of useful annotations of this form, the translator needs to look at the source code where the text appears in order to disambiguate the word. Since those who translate and those who develop software are often disjoint groups of people, this is not desirable.

Another concern is with respect to the choice between transliteration and translation. Usually, in translation tasks, proper nouns are transliterated (that is, the same word is written using the target language script), and other words are translated (that is, an equivalent word from the target language is found e.g. kaksh for classroom). However, for software applications this may not be very useful. Most technical words coined in Indian languages to correspond to usual technical vocabulary are less familiar to the users than their corresponding English words (e.g. words like phone, mobile, computer, etc), and hence, purely from a usability point of view transliteration may be more appropriate. This choice is to be carefully exercised keeping in mind multiple considerations.

Consistency of translation is another aspect of concern. This again is not that serious in normal translation tasks, since normally there is adequate context for the reader to decipher the intended meaning. But, in software localisation, most of the terms have very specific meaning, and hence the term usage need to be very consistent across the software. For example, the terms teacher and faculty may not be interchangeable in general. The source application itself may use these terms to mean different roles and hence separate terms need to be found for these in the target language as well. The mapping and distinction should be maintained across the entire application.

Automating Localisation with Natural Language Processing

Given the complexity and size of the localisation task – primarily caused by the translation component – there has been interest in building tools and frameworks to automate this task as far as possible. A number of commercial and open source tools are available with varying range of capabilities. Support for Indian languages is not yet common among them, perhaps due to the low current market in India. In principle, one could make use of a machine translation system to perform these translations. However, this is not likely to work for the many reasons mentioned below.

Machine translation is an extremely complex task involving a variety of difficulties. It has been a research problem the world over for more than half a century. There have been some notable successes; but a general purpose good quality automated translation system is still not in view for most language pairs. For most of software localisation, the translation will be from English to Indian languages. English and Indian languages are structurally very different, making the task particularly more complex. A number of experimental systems are available working on this problem. This includes the Matra system from CDAC Mumbai, the Mantra system from CDAC Pune, Shakti from International Institute of Information Technology (IIIT) Hyderabad and Anusarak from Indian Institute of Technology (IIT) Kanpur. They differ in their approaches to translation. No reliable comparative performance analysis among these are available today. Given arbitrary sentences or fragments from a random domain, none of them would perform well, illustrating the complexity of the task. In general, for practical deployment, one resorts to either restricting the domain or accepting lower quality of translation. Mantra takes the former route, where as Matra takes the latter route.

For software localisation, this poses a problem. A translation tool for localisation cannot restrict its domain, to healthcare, education, etc, from a practical utility point of view. This requires it to deal with open domain sentences. However, it is not possible to accept low quality translation either, since these will be used regularly by novice end users, and if the messages are not clear, the user experience can be poor.

A popular approach to translation for open domains has been to follow the statistical approach. Google’s translation work is based on this approach. However, this requires a large corpus of bilingual closely matching sentences. For localisation, this is very unlikely to be available, since the sentence collection may vary widely from application to application.

As mentioned earlier, identifying the right choice for transforming a word – either through transliteration or translation – is a difficult issue. A widely accepted glossary of common technical and computer terms would be very helpful here. Ideally, we need glossaries at multiple levels. Firstly, we need a glossary of normal use terms (as in a dictionary) likely to be useful in software applications. This differs from dictionary in the coverage of words. Normally complex, colloquial and old-usage terms are not likely to be found in software messages. The second level would be a domain specific glossary. For example, all packages relating to education would need terms such as classroom, teacher, syllabus, curriculum, examination, etc, and this glossary could map these words into the corresponding words in the target language. This helps significantly in disambiguation of words, since most application packages are concerned with only one or two domains. For example, consider the use of the term examination in a medical application and an e-Learning application. The third level glossary would be for the particular application. This may introduce specific terms that are used only in that application. For most languages and application domains, such glossaries do not exist. For some languages, a general purpose dictionary is available.

Glossaries and other linguistic resources introduce another difficulty which hinders use of localisation tools across languages. This is lack of standards in creating these resources. The variation may be due to the nature of storage – relational database, Extensible Markup Language (XML) structures, simple text formats, etc. – as also in the variety of information stored per item. Formulating adequately general standards for this would help in sharing of application programmes across languages and these language resources across different applications.

In the context of standards, the concern is with specification of context in the externalised source text file. Context here only means any information that helps the user to disambiguate among the possible meanings of the text segment. However, this level of generality makes it hard for automated systems to make use of this information. Context handling is, in general, a hard problem in natural language processing. Specifically, in the context of software localisation an appropriate framework for this need to be worked out.

All these make complete automation of translation in the context of localisation extremely difficult. Systems, therefore, normally resort to providing as much support as possible to the human translator. This can include the following:

Assistance with formation of inflections for select words. This can be a major issue with inflection rich languages like Hindi, Malayalam, etc. These tools derive the probable word of a given variant, enabling the user to simply select the form rather than having to type. Providing auto-completion of words given the first few characters, using a dictionary, is another useful support, reducing the keying in effort.

Looking up words automatically in the dictionary, and providing mouse-over or tool-tip selection of the target word is also easy to provide.

A richer degree of help would be identifying fragments of text from earlier translated sentences which also appear in the current segment for translation. Often, common fragments in two sentences would result in similar translation, and hence can be reused. However, this is not necessarily true, and hence, once again, human discretion is required to make the final choice.

Thus one finds that most commercial localisation support tools restrict their support to locating matching sentences from the memory. However, most of these are used for content localisation, where this is normally effective more often than in the case of software localisation. In the case of software localisation, the reuse can only come from prior versions of the same software, same class of software (two learning management systems from two different vendors), or software in the same application domain. If the source and target languages are grammatically similar, a higher degree of help can be provided easily, since most of the effort would be in word or phrase level translation.

Conclusion

India needs to see a substantial increase in the effort in the area of software localisation, to ensure that the domestic consumption and exploitation of ICT capabilities reaches the common man. We need to urgently build shareable standardised linguistic resources such as dictionaries and grammar tools. Software and content localisation is an excellent application area for research work in natural language processing, focussing on adapting the current technologies and approaches to address the problems of the sort mentioned above. One may also be able to build specialised solutions for localisation.
The topic of software localisation need to find its way in the curriculum at various IT related courses, to build more awareness and get more people to work in this area. A lot of resources are available in the open source community to learn about and practice software localisation. This is also an excellent area for volunteer efforts by NGOs, etc, and for student projects in Bachelor in Engineering (BE)/ Masters in Engineering (ME)/ Masters in Computer Applications (MCA) and other computer related courses. Apart from being an interesting field, the work in this area will be a significant social contribution.

The OSSRC – open source software resource centre–located at Mumbai (URL: http://ossrc.org.in) would be happy to assist any group interested in pursuing this area and in hosting your contribution to share them with potential users

Be a part of Elets Collaborative Initiatives. Join Us for Upcoming Events and explore business opportunities. Like us on Facebook , connect with us on LinkedIn and follow us on Twitter, Instagram.

Tags: author Centre for Advanced Computing Department of Information Technology GNU/Linux Google Government of India Healthcare Hyderabad India Indian Institute of Technology Information Technology Kanpur learning management s Linux MCA mobile phones Mumbai Natural Language Processing relational database software applications software development software localisation software messages specialised solutions Teacher translation tool translator TRENDS word processors

Making Computers Converse in Native Languages

Article

GST: Reshaping Indirect Taxation through Digital Governance

The last six years of implementing GST (Goods and Services Tax) have brought phenomenal changes to the way India adminis...

By Elets News Network 17-04-2024

Article

Uttar Pradesh Accelerating Green Mobility

The global automotive industry is undergoing a transformative shift towards electric mobility, prompted by concerns over...

By Elets News Network 15-04-2024

Article

Uttar Pradesh Leads Pharma Innovation

India's pharmaceutical sector is on an impressive growth path, aiming to transform global access to affordable medicines...

By Elets News Network 15-04-2024

Article

Embracing Uttar Pradesh's Tourism Potential for Global Leadership

While India perennially entices millions of tourists with its myriad attractions, the recent G20 Summit in 2023 showcas...

By Elets News Network 15-04-2024

Article

One District One Product Transforming Uttar Pradesh's Economic Landscape

In a bid to revitalize its economic landscape and empower local communities, Uttar Pradesh brought a transformative poli...

By Elets News Network 13-04-2024

Article

Logistical Brilliance Uttar Pradesh's Warehousing Infrastructure

Uttar Pradesh has immense potential in the warehousing and logistics sector being a pivotal transit point connecting var...

By Elets News Network 13-04-2024

Article

Uttar Pradesh's Thrust Towards Sustainable Development

Uttar Pradesh, India's most populous state, is experiencing a transformative wave propelled by Corporate Social Responsi...

By Elets News Network 13-04-2024

Article

Balancing the Power Equation: A Holistic Approach to Energy

In the pursuit of a sustainable and reliable energy future, the state of Haryana has emerged as a leader of progress, me...

By Elets News Network 12-04-2024

Article

Harnessing Renewable Energy for Sustainable Development in Haryana

Energy is the lifeblood of modern development, and its efficient utilization is pivotal for sustainable progress. Haryan...

By Elets News Network 11-04-2024

Article

Uttar Pradesh - Where Industries Grow Investments Soar

Uttar Pradesh (UP) has experienced significant industrial growth in recent years. The state has a robust industrial infr...

By Elets News Network 10-04-2024

Bureaucratic Appointments: Shobhit Gupta made Joint Secy Civil Aviation, Vimal Anand named Joint Secretary, Commerce Deptt.

Appointments: Sanjukta Prashar appointed IG CID in Assam, Sakshi Mittal made Director DoPT

Haryana: Sushil Sarwan Appointed MD of HSIIDC, Yash Garg made Panchkula Deputy Commissioner

Punjab: Mohammad Tayyab made Secretary Finance, Gurkirat Kirpal Singh appointed Administrative Secretary Transport and Parliamentary Affairs Deptt.

4 IAS transferred in Bihar, Dr. Ashima Jain made Secretary Housing & Urban Development Department

Latest News

SJVN inaugurates India's first multi-purpose Green Hydrogen pilot project

Bureaucratic Appointments: Dnyaneshwar Bhalchandra Patil made Development Commissioner SEEPZ, Prerna Puri appointed CEO JakeGA

REC inks ₹1869 crore term Loan Agreement with CVPPPL for Kiru 624MW Hydro Project

Bureaucratic Appointments: Shobhit Gupta made Joint Secy Civil Aviation, Vimal Anand named Joint Secretary, Commerce Deptt.

India takes the Lead in Global Generative AI Adoption: Report

Appointments: Sanjukta Prashar appointed IG CID in Assam, Sakshi Mittal made Director DoPT

Making Computers Converse in Native Languages By Elets News Network 01-September-2008

Related Article

GST: Reshaping Indirect Taxation through Digital Governance

Uttar Pradesh Accelerating Green Mobility

Uttar Pradesh Leads Pharma Innovation

Embracing Uttar Pradesh's Tourism Potential for Global Leadership

One District One Product Transforming Uttar Pradesh's Economic Landscape

Logistical Brilliance Uttar Pradesh's Warehousing Infrastructure

Uttar Pradesh's Thrust Towards Sustainable Development

Balancing the Power Equation: A Holistic Approach to Energy

Harnessing Renewable Energy for Sustainable Development in Haryana

Uttar Pradesh - Where Industries Grow Investments Soar

Haryana’s Significant Renewable Strides

Rajasthan striking a balance between increasing revenue and ensuring road safety

Haryana Strides Towards Sustainable Energy Future

Elets IndiaAI Summit, Bengaluru

Elets National EV Summit 2024

5th edition Elets National Urban Innovation Summit