June 2004

Pan Localisation regional initiative:Developing local language computing

Views: 160

Information has now become such an integral part of our society, that its access is considered as a basic human right. This is because development of rural and urban developing populations is getting increasingly dependent upon access to information. This is specifically applicable to Asia which houses the largest developing population. ICTs, including the Internet, is the largest repository of this information. And, though Asians have become the largest group of Internet users since 2001, these users still form only about 4.5 % of the total Asian population. This shows that there is enormous potential for Internet usage in Asia.

However, in addition to being most populous, Asia is also the most culturally and linguistically diverse region of the world. There are 2197 languages spoken in Asia, which is the largest number of languages spoken in any one region. Only about 20% of these people can communicate in English. This makes English language content available on ICTs inaccessible to a large majority of Asians. This particularly affects those living in the rural areas of developing countries in Asia.

Investments have been put into developing ICT infrastructures in Asia. Nevertheless, the persisting digital divide attests that the current path towards providing connectivity and technology infrastructure alone would not enable the majority of Asian populations to benefit from the present information availability. There are multiple problems perpetuating this divide. One obvious reason is that these populations cannot circumvent the obstacle of English language content. Unless these large non-English speaking populations have the ability to generate and access content in their native languages, they will not be able to use ICTs for their development effectively.


Enabling ICTs in the local language of the user is known as “localisation”. Specifically, it is enabling computing experience in linguistic culture of the user. Linguistic culture is not just limited to the language but how the language is used by the environment of the user. Thus, for Punjabi speakers in India, the computer should display the language in Gurmukhi script and for Punjabi speakers in Pakistan, the same language should be displayed in Arabic script.

Localisation of ICTs requires definition and implementation of standards. These standards include character set encoding, keyboard (and keypad) layout, collation/sorting sequence, locale and ICT terminology. In addition to definition of standards, applications also need to be developed for local language computing to support access and generation of local language content. There is a large variety of applications required, some being more fundamental in nature, while others are more advanced and complex but equally vital for end users. These applications include fonts, lexicon, thesaurus, spell checker, grammar checker, text-to-speech system, speech recognition system, machine translation system and optical character recognition system.

Survey of state of localisation in Asia

Localisation and development of applications is only starting for many Asian languages. The reasons have been lack of commercial incentives (as these markets do not promise of large financial returns for software vendors) and the complexity of the


local Asian languages. A survey was conducted during a recent training on “Fundamentals of Local Language Computing” held as part of the PAN Localisation project (details presented later) at Lahore, Pakistan, in January 2004. Localisation experts and developers from 13 different countries participated in this training and provided the data collated in the following tables .

Before any content can be generated or any application is developed, some basic standards for encoding the language must be developed. These include character set encoding (e.g. Unicode), keyboard layout, key pad layout (e.g. for mobile telephones), collation sequence (to enable applications like databases), terminology translation and locale definition (to enable computer interface in local language). The survey responses are tabulated in Table 1.

This survey is limited to only the countries from which representatives attended PAN Localisation Project training (see the “Training” link at www.PANL10n.net). The data was provided by the training participants, not independently verified, therefore some variation may exist from the responses received. The data is still representative of the bigger picture for Asian region.

The responses indicate that the encoding and keyboard layouts are standardised for most languages. This would allow devel ping basic desktop publishing capability, and has been achieved through national and international efforts, e.g. organisation like Unicode Consortium (www.unicode.org). However, much work needs to be done to define other standards needed to further process the data. For example, collation sequences for the languages have to be defined to enable applications which sort linguistic data, like voter lists, etc. Based on the standards, the applications may be developed on Microsoft or Linux platforms, two most popular end-user desktop operating systems. The survey also tried to determine the level of application support on these two platforms. The questions were divided into two categories of applications: basic applications which realise the standards and allow basic desktop publishing for the end-user, and advanced applications used to assist user to generate and access content in local languages.

The basic applications include utilities which enable to realise the encoding standard (Keyboard and Fonts), sort and search data (Collation and Find/Replace utilities) and allow basic word processing facilities, like spelling checker, thesaurus and Natural Language Processing component (e.g. Word/Line Break Determiner for languages like Lao and Khmer, Bidirectional Algorithms for Arabic script based languages like Urdu and Farsi, etc.). The responses for

Microsoft platform are tabulated in Table 2 and for Linux platform are given in Table 3.

As the responses indicate, there is currently more support on Microsoft platform for keyboard and fonts to do basic local language data processing. Linux is catching up as more solutions are being developed but does not provide the same level of support at this time. It should be noted that all the support indicated for Microsoft platform is not developed by Microsoft and is sometimes developed by third parties.

Additional utilities, out of which collation is perhaps most necessary for data processing (NLP is also significant for some languages), are still missing for most of the languages and much work needs to be done to fill this gap. Japanese and Thai are ahead of all other languages surveyed.

For wider access to content to literate and illiterate non-English speaking population of Asia and for quicker content generation, some advance applications must be developed. Automatic Machine Translation systems can provide instant access to existing English data on the Internet. Text-to-speech systems can provide access to illiterate populations. Automatic speech recognition can help create local language and culturally meaningful content quickly and similarly optical character recognition system can help convert published material into electronic content for exchange. These applications can be instrumental in bridging the digital divide. The status of these applications for Asian languages is given in Table 4 (for Microsoft platform) and Table 5 (for Linux platform).

As can be seen from the responses of the survey, though many initiatives are underway for various languages, there is hardly any significant development completed in this area. Only Japanese language applications are currently available. For many of the languages, there is not even efforts which have started looking in these areas. Most of the times research and development in these areas are guided by policies. Relevant policies which would address local language computing consist of linguistic policies of the country, their ICT policies and specifically their localisation policies. The survey also included questions regarding existence of such policies. The responses are tabulated in Table 6.

Interestingly though few countries have localisation policy, many are working toward developing one. Most countries also have Linguistic and IT policies, but they may or may not drive localisation policy. This needs to be further investigated.

As the survey indicates, localisation initiative has not been either rigorously taken up or pursued consistently in many Asian countries. To address these issues, a regional effort to develop local language computing capacity for Asia has been taken up by the International Development Research Centre (IDRC) through its Pan Asia Networking (PAN) programme, in collaboration withNational University of Computer and Emerging Sciences (NUCES) through its Centre for Research in Urdu Language Processing (CRULP), and is called PAN Localisation project.

PAN Localisation project

PAN Localisation project focuses on documenting the problems and researching the solutions to enable localisation of ICTs. This project is unique as it will be the first study of its kind which looks at the common problems faced by Asian region and research into a comprehensive solution. It is thus a timely and an urgently needed initiative for Asian underdeveloped populations and will be instrumental towards providing an equitable access to information is this digitally divided information society.

The core objectives of PAN Localisation project are to research into the following three fundamental dimensions of localisation for Asian languages:

  • To develop sustainable human resource capacity in the Asian region for R and D in local language technology
  • To raise current levels of technological support for Asian languages
  • To advance policy for local language content creation and access across Asia for development In this project, CRULP is coordinating efforts across Asia within ICT researchers, practitioners, linguists and policy makers from the governmental agencies, universities and private sector of six countries of Asia including:
  • Bangladesh: BRAC University, working for Bangla
  • Bhutan: Department of IT, Ministry of Information and Communications, working for Dzongkha

 

 

 

 

Table 7: Project Output Matrix for Local Language Software Being Developed Through PAN Localization Project x

Comments

comments

Click to comment

Leave a Reply

Your email address will not be published.

Latest News

To Top