July 2007

Language Technology and Document Reading

Views: 154

In the past two decades, language technology has developed and produced “single – sentence generation” capability and “limited – purpose multi sentence paragraph planning capability”. These possibilities are numerous and available.   Analysing documents with complex layouts, recognition of printed texts and distinguishing running hand writing is still a large research area.

Language technology is also called Human Language Technology (HLT) that consists of Computational Linguistics (CL) and Speech Technology (ST) as the core of it. It is closely related to computer sciences and general linguistics.

In order to process a language on a computer, different aspects like the science and art of the languages as well as the grammar of the language is needed.  In an object oriented view, root words and their usages are important. Hand writings are considered as the language usages.

National Language Generation (NLG) is the area of investing how computer programmes can make high-quality natural language texts from computer internal representations of information. This area of study consists of entirely theoretical (linguistics, psycholinguistic) to entirely practical subjects (as the production of outputs of computer programmes).

In the past two decades, language technology has developed and produced “single – sentence generation” capability and “limited – purpose multi sentence paragraph planning capability”. These possibilities are numerous and available.

Analysing documents with complex layouts, recognition of printed texts and distinguishing running hand writing is still a large research area. The major challenges of hand writing are words and line separations, segmentation of words into characters, recognition of words when lexicons are large and use of language models in aiding preprocessing and recognition.

In order to understand large textual units, combination of smaller units is understood. The main goal of linguistic theory is to indicate how large units of words meaning arise out of the combination of the smaller ones.

Analysis and describing of the hand written documents provides information that can be used for different purposes and provides links to different areas of study and researches in the past.

Process of hand written documents is not just something to recognise the documents and use their information, but it can be also helpful for children to learn to write.

There are many new inventions dealing with intelligent electronic notebooks, signature verification and other recognition systems to process written documents.

Hand writing is a complex task and involves emotional, rational, linguistic and neuromuscular functions. When implementing any “pen-based” system, these factors should be considered. For this, the control of the movement of pen and perceiving the line images are important.

Even in the computer age, still paper has its interesting scheme and “paper is the most popular medium for sketching, note taking and form filing, because it offers a unique combination of features: light, cheap, reliable, available almost everywhere any time, easy to use, flexible, foldable, pleasing to the eye and to the touch, silent” .

“On-line hand writing recognition” is the notion of recognising hand written documents recorded with digitising equipment. Of course, this recognition faces difficulties as: Restrictions on the number of writers.

Constraints on the writer: entering characters in boxes or in combs, lifting the pen between characters, observing a certain stroke order, entering strokes with a specific shape.

Constraints on the language: limiting the number of symbols to be recognised, limiting the size of the vocabulary, limiting the syntax and/or the semantics.

“On-line hand writing recognition” researches were academic until beginning of the nineties, but the situation has changed and in recent years the rapid growth on pen computing industry is happening. Many companies are trying their best to produce new equipment to read and write different words.

Converting “pen trajectory data” to pixel images is a need to analyse documents and process them by optical character recognition (OCR) recognisers. Hand written recognition needs number of distinguished features. In order to get best results the following should be considered:

“Preprocessing operations such as smoothing, de-slanting and de-skewing and de-hooking and feature extraction operations such as the detection of line orientations, corners, loops and cusps are easier and faster with the pen trajectory data than on pixel images.

Discrimination between optically ambiguous characters (for example, “j” and “;”) may be facilitated with the pen trajectory information.

Segmentation operations are facilitated by using the pen-lift information, particularly for hand printed characters.

Immediate feed-back is given by the writer whose corrections can be used to further train the recogniser”.

Many years ago it was a dream, but today it has become reality to recognise cursive hand writing. Of course, the recognisers are not improved and still need to be completed. Researchers and developers are trying to improve the pen computers. The industry is currently focusing on two kinds of products as:

“Data acquisition devices for form filling applications requiring only a limited alphabet and allowing very constrained grammars or language models. Users such as commercial agents would be willing to print characters in boxes or combs.

Personal Digital Assistants (PDA) combining agenda, address book and telecommunications facilities (phone, fax and mail). Users would want to use natural unconstrained hand writing, cursive or hand printed. “

Document analysis is mostly document image analysis that interprets the content of documents. Document analysis is the area of concerning the recognition of written language in the image form. Interpreting old documents to reach to their content and organisation to receive knowledge inside them, is one of the considerations of our time.

Localisation World Conference

Localisation World is a conference and networking organisation dedicated to the language and localisation industries. It aims at providing a network for the exchange of high-value information in the language and translation services and technologies market.

Localisation World held its ninth conference at the Berliner Congress Center, Berlin Germany, 19-21 June 2007 The theme of the conference was Local Language First!. The conference saw the participants from corporate, developers, and practitioners across the world. The conference covered a wide range of issues in the world of translation and localisation. Three concurrent tracks with information and discussions were held about linguistic assets, with case studies from a wide range of applications. The conference sessions were geared towards professionals seeking to learn about new tools, methods and business practices in the areas of localisation and internationalisation. Networking and mutual problem discussion opportunities were abound at the social gatherings, meals and break periods. Exhibitors had provided information about their products and services to all levels of attendees.

Comments

comments

Click to comment

Leave a Reply

Your email address will not be published.

Latest News

To Top