June 2006

Language Observatory in Japan

This article presents an effort made by a consortium of universities and research centres in Asia to address the problem of 'digital language divide' through the establishment of a World Language

Compared to an astronomical observatory, which observes space for astronomical phenomena, a language observatory observes language phenomena in cyberspace. The mother Language Observatory (LO) in Japan periodically sends software agents in the form of soft bots into cyberspace. This is intentionally to examine websites and identify its languages and contents in an attempt to identify language communities in various regions of cyberspace and to report on the current language situation in cyberspace, which have implications on education.

ICT education and mother tongue
Customised ubiquitous learning model sparks discovery activities that are student-centred and personalised. Personalised education also means that learning is best administered in the natural language of the student. Although this model is very pervasive and the technology is superb, we are still confronted with an age old problem that relates to the issue of 'digital divide' or 'e-Exclusion'. The issue of the digital divide is more than direct access to technology, it is also regarding the disparity between how different nations are using ICT as a tool for social and economic development. However, here focus has been made more on the language-related issue.

Language is an important tool for human communication and now, the language dominating ICT is English language. According to the UDHR website, the number of persons speaking English as their mother tongue is 322 millions. Another study by O'Neill, etal in 2003 found a higher proportion of English usage to be 72 percent in terms of web pages, which were recorded by analysing random samples of web pages.

There are certainly many merits for using a single de facto language like English, but studies have shown that, in many cases, instruction in a mother tongue is more beneficial for students in regards to acquisition of language competencies, achievements in other subject areas, and even for learning a second language.

According to Sri-Lanka country report by APDIP 2003, only less than 10 percent of computers in Sri Lanka use Sinhalese and Tamil. The main operations are mostly for word processing, publishing, and sadly insignificant usage in local languages. With such a low usage in mother language, it is likely that the competitive nature of English language will dominate and supersede the mother language in Sri Lankan cyberspace.

Latest observatorial analysis found that there are 4332 web servers with sub domains of .ac and .edu in Asian country code Top Level Domains (TLDs). This contributed to more than one fifth of nearly 10 millions in text documents. By means of such info structure, it is mainly important to ensure that there are rooms for the usage of mother languages for their very existence.

Languages and scripts diversity
Customised education has to cope with the tremendous diversity of world languages and scripts. The United Nations Higher Commission for Human Rights (UNHCHR) has translated a text of universal value, the Universal Declaration of Human Rights (UDHR), into as many as 328 different languages (covers existing national languages) where Chinese language has the biggest speaking population of almost a billion people. This is followed by English, Russian, Arabic, Spanish, Bengali, Hindi, Portuguese, Indonesian and Japanese. The site also provides the estimated speaking population of each language.

From the viewpoint of complexity in localisation, diversity of scripts is another problematic issue. Here, for the sake of simplicity,  all Latin based scripts, alphabets and its extensions used for various European languages, Vietnamese, Filipino, etc. are treated as one set. Chinese ideograms, Japanese syllabics and Korean Hangul scripts will be treated as 'Hanzi'. The remaining languages will comprise of many kinds of diversified scripts. Here, the 'Indic script' will be taken to be in the third category. This category includes not only Indian language scripts such as Devanagari, Bengali, Tamil, Gujarati, etc., but also four Southeast Asian language scripts, Thai, Lao, Cambodian (Khmer) and Myanmar. Languages based on Arabic script will be treated as one set and so on for languages using Cyrillic scripts.

ICT and multilingualism
If the website of the Office of the Higher Commissioner for Human Rights of the United Nations is visited, more than 300 different language versions of the Universal Declaration of Human Rights (UDHR) will be found. Unfortunately, many of the language translations, especially for non-Latin scripts based languages, are just posted as 'GIF' or 'PDF' files and not in encoded texts. The table below it clearly shows that languages that use Latin scripts are mostly represented in the form of encoded texts. Languages that use non-Latin script, especially Indic and other scripts are difficult to be represented in encoded form. When the script is not represented by any of the three foremost forms provided, they are grouped as not available. Moreover, it is necessary to download special fonts to properly view these scripts. This difficult situation can be described as a digital divide among languages or termed as the 'digital languages divide'.

From a technical viewpoint, the major reason behind the digital language divide is due to the lack or non-availability of appropriate character encoding schemes. Internationally recognised directories of encoding schemes, like the IANA Registry of character codes or ISO-IR (International Registry of Escape Sequence), we cannot find any encoding schemes for these languages is found.

Unicode for a multilingual cyberspace
Character coding standards that are internationally recognised such the 'Unicode' provides character encoding schemes for 50 writing systems from English to Osmanya and through Kannada. Unicode with its latest version 4.1.0 covers a vast system of encoding properties. In table below findings for the percentage of Unicode encoded documents on web servers in Asian TLDs are provided.

Establishment of the language observatory
The Language Observatory (LO) was launched in 2003 due to the importance of monitoring language activities in cyberspace. Language observatory operates by periodically releasing crawler robots into cyberspace by the mother Language Observatory in Japan to examine websites and attempt to identify language communities in various regions of cyberspace.

The Language Observatory is planned to provide as a means for assessing the usage level of each language in cyberspace, for instance to periodically produce a statistical profile of language, scripts, and character code usage in cyberspace.

Preferably, the following questions can be answered: How many different languages are found in the virtual universe? Which languages are missing in the virtual universe? How many web pages are written in any given language, say Pashto? How many web pages are written using the Tamil script? What kinds of character encoding schemes (CESs) are employed to encode a given language, say Berber? How quickly is Unicode replacing the conventional and locally developed encoding schemes on the net? Along with such a survey, the project is expected to work on developing a proposal to overcome this situation both at a technical level and at a policy level.

The information collected from such a study has implications on multilingual ICT education such as customised ubiquitous learning. By having a monitoring body such as that performed by the Language Observatory, to look at the development of languages through for an example, its encoding system, a sophisticated method to understand the language scenario can be realised. Through these efforts, LO hope to make the world more aware of its living and dying languages in the cyberspace. The LO is also not a closed network grouping and interested parties are most welcomed to participate in its activities.



