Su Tra : An Intelligent Translator Tool for Incremental Localisation

Article

Su Tra : An Intelligent Translator Tool for Incremental Localisation By Elets News Network 01-September-2008

One of the most important features of e-Government applications is to reach the people of the most remote corner of the country. And one of the main requirements for this is to have the interface of the application in the language of the local people. Software localisation makes this possible. More and more people are concentrating their efforts in making software applications available in multiple languages. But with new versions of applications coming out very often the translation efforts of the previous versions are rendered largely useless. SuTra is an Intelligent translation suggestion tool in Hindi, which suggests to the translators reusable parts from the older version translations.

Background

In this age of information technology, where every field is taking the benefit of the technology revolution, government organisations are not lagging behind. e-Governance applications are becoming increasingly popular, with people and government showing increased belief in setting up e-Governance applications. The railway reservation website, passport application automation system, online postal services, online systems for administration of the zilla-panchayats, etc. are some example of e-Governance applications.

The beneficiaries of e-Governance applications include the actual citizen, who can avail the services from anywhere they like and the government organisations who can have more efficient ways of handling important data. A large population of the ordinary citizen users are from rural areas, where English is rarely used. It will not be very encouraging for them to first learn the language in order to use the application. Hence, it is essential to have the software applications in the language of the people. This would mean to have the entire application interface in the local language and have means to type and store data in the local language. Developing applications from scratch in multiple languages is too time consuming a process with inherent drawbacks. Localisation is a solution to this problem.

Introduction

Localisation is the process of customising and adapting software to the local market. This means adapting the software interface commands, menus, messages, etc. in the local language and providing means to input and process local language text. Locales, encoding, fonts, rendering engine and input methods are used to input, display and process any language text. One of the most important processes in localisation is translation and this is the focus of this article. In software localisation the translation process involves translating all the displayed messages, which include menu names, error messages, welcome/user messages, feature lists etc. to the local language.

Software localisation and software development generally happen parallely, in order to make the localised version available at the same time as the English version is released or with minimum delay. For this purpose all the messages, menu names, etc. are extracted into separate files, called string files and the development happens independent of these string files. These string files are distributed to the various translators to translate into the respective languages. The translated files are then consolidated together and kept in the corresponding language directories from where the application reads them. For example, all English string files are kept in a directory named ‘en’ and all Hindi string files are kept in a directory named ‘hi’. When the user selects the corresponding language from the language option, in the application, the strings from the corresponding language directory are fetched.

Every application has a number of string files, with the size of each file varying from 10 strings to 10,000 strings. Different applications have different representations of string files, the most common being the PO file format. Each structure minimally gives the original English string and its corresponding translation in the local language.

Translation of strings is not as easy as it sounds. It is a very tedious and time consuming process. In most applications the number of words to be translated is huge for example the open office suite has approximately 120,000 words to translate. This is coupled with the need for the translations to be complete, grammatically correct and terminology consistent across the whole application. Such and more issues make the translation process lengthy, leading to the localised version of the application being released much later than its non-localised counter part. In a world which is driven by technology and where new versions or new applications are being released every other day, this delay in localisation is not very effective for people.

These problems can be reduced to a very great extent with translation assistance tools. The assistance can come in any form whether it is quick reference to context specific translation of words, or help in maintaining consistent terminology or automatic suggestions for translations of the strings. We have found that many of these are possible and especially the last part.

Issues in Translation

As mentioned in the earlier section, translation of the string files have to be done with lot of care in order to ensure that there is no ambiguity for the users of the application. Before one starts the translation there are some issues that he/she needs to worry about:

Firstly, the translator should have a sound vocabulary knowledge and also have a good amount of technical knowledge. Technical knowledge is essential because many technical words are used in applications for instance the words ‘logout’ or ‘logged -in as’. These are not used in ordinary day to day scenarios.
Many a times, different translations are used for the same word (in similar context) confusing the end-user. For example, the translation of the word ‘Cancel’ in the context of
Cancel Ticket’ can be either ‘radh karna’ or ‘nisprabhav karna’. One form should be used everywhere in the application.
Many a times the English word is better understood than its local counter-part. Hence the choice of translation or transliteration of the word is necessary to be decided and followed consistently throughout the application. For example, the term ’email’ is popular as is and will be better understood than its translation.
Hard-copy translation references like dictionaries are time-consuming. Dictionaries with easy search facility with exhaustive set of context-specific translation of words will be very useful.
Person and number should be retained in translations.
Translated messages can sometimes be longer than their English counter part, thus spoiling the application interface. Such translations should be given special attention.

The above issues make the translation process slow and tedious. However, there are solutions possible for many of these issues, which will help make the process more efficient.

Most of the problems discussed above can be reduced to a large extent by having translation assistance systems. Assistance can range from providing glossary (a context specific translation dictionary) support to suggestions on translations to automatic translations. For every word, translations used by other translators in the system should be readily visible so that they can be reused in case of the word appearing again in the application, in the same context. Also, there should be easy mechanisms to update the glossary with specific context translations. Translation assistance tools can combine one or more of these components to provide a good tool to help efficient translation. Translation assistance tools are feasible today, thanks to sophisticated natural language processing and translation memory techniques.

The broad approach followed by translation assistance tools is as follows:

Make ready a repository of translated strings, of earlier translations. Let us call this the old version set.
For every string in the file to be translated, let us call this the new version file, matches are obtained from the old version set.
The corresponding translation of the matched words or phrases is indicated in the appropriate positions in the new version file.
Glossary support is provided for quick reference of context specific translations of words in the strings of new version file. Glossary also caters to the issue of inadequate familiarity with vocabulary as a translator with average vocabulary knowledge will have easy reference of the context-specific translations used by more superior translators.
Translators can update the glossary with more translations for future use.

Existing Systems and our Approach

There are many translation tools that provide assistance at various levels. But most of them either do not handle Indian languages or support only one structure of the string files the PO format. Also, almost all the tools require the user to translate every word atleast once before making it available for reusability. These systems replace all the matching phrases in all the strings with the corresponding translations.

The most popular open-source translation assistance tool is Kbabel. This is a stand-alone application which uses translation memory. The user can feed the translation memory with translated strings. Whenever the user asks for translations, from the translation memory, Kbabel will automatically replace the matched phrases with the corresponding translations. Poedit is a PO file editor which helps easy identification of translated text and fuzzy translations. It also features whitespace highlighting. visuallocalise needs the user to enter the translation of any word once and then replaces all occurances of the same word with the translation, irrespective of the context of its usage. Support for Indian languages is not present. Trados Translator’s Workbench uses translation memory to indicate translations of matching phrases. It is a Windows platform editor with no support for Indian languages.

The approaches used by the systems discussed above have limitations. Most of the systems replace all occurances of any phrase with the translation provided at one position, without considering the context in which they are used. Also, the matching used is primitive and can be enhanced for better results. Most of them support only PO structures and this may not allow all the application string files to be translated using the systems. Automatic translation is something that none have looked at so far. These factors motivated us to develop SuTra.

Our Approach

We propose a multiuser translation assistance tool, Su Tra that will make intelligent suggestions to translators on possible reuse of translations from older version systems or systems with similar domains. The translator has to open his file for translation (the new version file) in this system and also provide a zip of all translated files of earlier version/another application in similar domain (old version set). The system then computes all the possible matches for every string in the new version file, from the old version set. SuTra divides every string of the new version into the following three categories:

Perfect matches – These untranslated strings have an identical copy in the old version set. Hence the translation of the corresponding string from the old set can be used as it is. For example, if there is a string ‘Check the Status’ in both the old set and new version then its translation can be used as it is.
Partial matches – These untranslated strings have strings in the older version set which partially match it. For example, the strings ‘Check the Status’ and ‘Check the Reservation Status’ match partially and hence the translation of one can be used partially when translating the other.
No match – Strings for which no matches were found in the old version set.

Whenever the translators choose to edit a string, they will get the corresponding list of matches for the strings for reference during translation. The translators can use all or part of the corresponding translation with a simple copy-paste mechanism. SuTra also provides glossary help for each word in the string. When translating the translators can also refer to the glossary entry for any word by a simple mouse-over and they can use any of the entries with just one mouse-click. This mechanism has been tested and found to be very time efficient. Translators can also update the glossary with new entries.

In SuTra, the partial matches are further classified into arbitrary and consecutive, based on the pattern of matches.

Arbitrary matches- Arbitrary matches of strings will have some words in them matching with the new version string, however the position of the words are irrelevant. For example, the strings ‘Book a Ticket’ and ‘Ticket Reservation Status’ qualify as arbitrary matches, because the word ‘Ticket’ occurs in both the strings.
Consecutive matches- Consecutive matches of new version strings will have some words in them matching with the new version string, with the relative position of the words the same as that in the string of the new version. For example, the strings ‘Controls the boxes that are displayed around the main content’ and ‘Customise the menus that are displayed at the top and/or bottom of the page.’ are consecutive matches with the string ‘that are displayed’ being the match.

Translations of consecutively matching phrases can be better reused than words at arbitrary positions. Many a times we may find better consecutive match sets and there maybe no requirement to refer to the arbitrary match set at all. Providing this separation will save a lot of computation overhead in such cases.

SuTra further has the facility for the translator to specify the minimum number of words that should match. This feature was provided keeping in mind that it may not be of much use for the translators to have only one or two words matching in a long string, because effort in reuse may not be worth it. Translators can also choose to instruct SuTra to ignore conjunctions, prepositions and articles when identifying matches.

However, this feature of ignoring articles/conjunctions/prepositions is not used when computing consecutive matches. This is because the order of words is important for considering translations of consecutive matching phrases for example the translations of ‘Ram is a boy’ and ‘Ram is the boy’ are very different.

SuTra also provides the feature to change settings at runtime, for example if the translators choose to see arbitrary matches and are not happy with the match and wants to change the minimum words limit, the match to consecutive, etc. they should be able to do it. This becomes essential for better reuse of translations.

SuTra also allows multiple users to work on the system at the same time. This feature was designed keeping in mind that a translation project is generally huge with multiple translators working on different files of the system at the same time. The context-specific glossary helps translators to maintain consistency throughout the application. Based on the existing translations, automatic translations for new strings and automatic updation of glossary will be provided by Su Tra.

SuTra at work

SuTra has been implemented as a web-based solution using JSP and JavaScript technologies. MySQL has been used as the backend database for storing user information and XML technology has been used for the glossary. XML structure is used because it is one of the popular and standard ways of representing data. Also, XML allows one to have custom-defined tags to represent the data. Currently glossary supports the Hindi language alone. Below is an entry from the glossary showing various meanings of the word ‘open’ in Hindi:

The <english> tag represents the english word and the <hindi> tags represent the hindi translation. The <hindi> tag has two parts -<hi_word> represents the hindi translation and the <context> tag gives information on the context of the translation. SuTra can be intsalled on a central server with 512 MB or more RAM and tomcat server installed in it. Users can connect to the system on this server using a web browser. It is best viewed under Mozilla Firefox. In Internet Explorer (IE), the glossary feature will not be viewed well because, IE does not support some clauses in the CSS specification, which have been used to in Su Tra implementation.

The process flow in Su Tra

The translator needs to give the necessary input, which includes the file to translate, a zip of the translated files, choice of arbitrary or consecutive partial matches and the minimum number of words to be matched. Currently, the PO file format is supported by the system. The translator can also instruct to ignore articles, conjunctions, prepositions, etc. Based on this input specification, SuTra processes the strings and indicates which strings have what type of matches. The below figure shows the screen in SuTra after this processing

The green bar indicates strings with perfect match. Red bar indicates strings with partial matches and black bar indicates strings with no match.

For editing the translation of strings the translators are required to click on the edit link below ‘msgstr’ This will take them to the editing page as seen in the illustration 2
The user can enter the translation in the ‘translated Devnagiri text’ block. The ‘Keyboard’ link provides the language keyboard reference. Words highlighted in red have entries in glossary the popup menu on the word error shows the glossary entry for it. Translators just need to click on the word to use it. A list of matched strings is provided in the ‘best possible matches’ region. This list is presented in the descending order of matching words, from where the translators can reuse the translations by simple copy-paste mechanism. Currently support for Devnagiri script is provided hence translations to Hindi and Marathi are possible using this system.
Translators can also update the glossary with context-specific translations using the ‘Update Glossary’ link.
After translating all the strings, the translator can download the translated PO file and work on the next file.
Conclusion

The first version of SuTra is now ready. We tested SuTra to translate gedit the open source text editor that comes with the GNU/Linux. We used two versions of gedit 2.20 and 2.22, one version translated and one untranslated. The file to be translated had 1097 strings. For a translator with above average vocabulary knowledge, the entire process took 8 hours without SuTra and 10 minutes using Sutra. This was because, most of the strings had a perfect match in the other version and there were no strings which had ‘no match’ in the other version. Further tests are in progress.

Future work and Enhancements.

As a part of the future work we plan to implement the following features in SuTra:
- Provide support for more string file structures currently only PO file format is supported. The idea is to have a generic format for SuTra and have a converter which will convert any string file format to this format.
- Provide support for more Indian language scripts. Currently only Devnagiri support is provided. It is envisaged that more Indian language support is provided and users have the option of selecting the language for translation.
- Improved suggestion techniques for partial matches.
  The idea is to provide automatic translation support using SuTra.
References:
1. Sasikumar M, Aparna R, Naveen K and Rajendra Prasad M -Guide to localisation. http://www.iosn.net, 2004.
2. Bret Esselink, A Practical guide to Localization. John Benjahimns publishing company, Volume 4.
3. Kbabel: http://www.kbabel.org
4. Poedit: http://www.poedit.net
5. visuallocalize: http://wwwvisloc.com
6. Ulrich Drepper , Jim Meyering, Francois Pinard and Bruno Haible . GNU gettext tools, version
  0.17. Native Language Support Library and Tools Edition 0.17, 31 October 2007
7. Leon J. Osterweil, Charles M. Schweik, Norman K. Sondheimer and Craig W. Thomas. Analyzing Processes for E-Government application development: The Emergence of process DefinitionLanguages
  http://www.haworthpress.com/web/JEG. The Haworth Press, 2004.

Replication of Bihar RTI Call Centre Model Likely

United Progressive Alliance (UPA) Chief Sonia Gandhi called upon Congress Chief Ministers to adopt the model of an Right To Information (RTI) call centre run by the NDA government in Indian state – Bihar. This would enable accessibility of information to the poor and illiterate. However, the UPA government is weighing the system before replicating it at the Centre.

The Department of Personnel and Training (DoPT), the nodal agency for RTI, has given an “in principle” clearance to the call centre, but they are still unsure whether to follow the 2007 Bihar model and bear the cost of converting a phone call into an RTI application.
In case the government charges the cost of the call centre to the applicant, the total amount payable would be INR 115, instead of the usual INR 10 for the regular RTI application. Thus, there would be a 10 fold increase in the application fees.

The Bihar RTI model may be soon replicated in the Congress run-states. Speaking on the non-subsidised option of filing RTI, Personnel Minister Prithviraj Chavan said that if the call centre route worked out to be more expensive, the citizen had the option of sending his RTI application by post. “Cost is not a major issue”.

Be a part of Elets Collaborative Initiatives. Join Us for Upcoming Events and explore business opportunities. Like us on Facebook , connect with us on LinkedIn and follow us on Twitter, Instagram.

Tags: GNU Information Technology Internet Explorer JavaScript JavaScript technologies Manitoba Microsoft Windows multiuser translation assistance tool MYSQL Natural Language Processing PO file editor Postal Services Ram Red bar software applications software development software localisation technology revolution translator Translator Tool TRENDS web browser web-based solution XML technology