Using AVS/Express |
This appendix describes facilities for building a worldwide application and adding to the standard fonts available for each locale within AVS/Express.
It is important for the productivity of the application developer and end user that visual interfaces are presented in the most natural way. AVS/Express allows developers and end users to work in their own languages. There are two high-level approaches to developing applications with AVS/Express for international markets:
Each of these project mechanisms is supported by an object property: localized projects use object aliases from the "user_name" property; internationalized projects use the "dictionary" property. The sections "Localization" and "Internationalization" describe these features and the associated project lifecycles.
You configure AVS/Express to run in a particular language through the locale model. This uses the concepts of locale, character set, encoding, and font. The sections on "Language Support" and "Locales" develop these ideas with background material and examples. AVS/Express uses aspects of the host platform's operating system and windowing system to provide worldwide language support. You should consult your operating system documentation to find out how to configure and use your platform with the language you require.
The section "Text Processing" presents an overview of strings in the V language, describes how they are exchanged between various components of the AVS/Express environment, and gives details of interfaces and formats relevant to worldwide language support.
This section is background material designed to help you to understand the mechanics of worldwide language support. Information relating specifically to support of these features in AVS/Express is provided in later sections of this chapter and in the Release Notes.
This section describes how languages are represented for computer processing. It starts with a discussion of the two main approaches (local character sets and Unicode) and describes the choice AVS/Express has made. It then summarizes the languages, character sets, and encodings used by AVS/Express.
A language is represented by using one or more character sets to enumerate the abstract symbols of the language. An encoding maps the symbols in a character set to numerical codes. Representations of the symbols, called glyphs, are bound with their encoded values in a font.
There are two approaches to the definition of character sets:
Standard local character sets are widely supported across different platforms. Local character sets allow worldwide language support to be implemented incrementally, by adding font handling and character conversion routines around existing core string processing.
The Unicode Consortium and the International Standards Organization (ISO) jointly developed a universal character set, known as Unicode. ISO10646 (1993) is a generalization of the Unicode standard (1991). With Unicode it is possible to develop multilingual applications that do not require language-specific text processing. Unicode is a fixed-width encoding, which means that every character is encoded by two bytes, even for languages such as English that use small alphabets. Unicode is only just becoming widely available. Some operating systems, such as Windows NT, use Unicode as their internal representation for all text information, but even these systems also provide interfaces and font mappings for local character sets.
AVS/Express uses local character sets.
The English alphabet can be encoded with 7-bits per character, while other European languages require 8-bit values:
These are known as Single Byte Character Sets (SBCS). Each single-byte character set is limited to 256 characters.
Chinese, Japanese and Korean use thousands of characters, so they require two or more bytes to encode each character. These are Multi-byte Characters Sets (MBCS).
Multi-byte encodings are defined such that SBCS and MBCS can coexist in the same string. For example, you may use an ASCII English abbreviation in a Japanese sentence. They also have the important property that standard string processing for ASCII encoded English is valid in every locale.
There are two basic approaches to MBCS encoding:
Chinese characters (hanzi) are used in Korean (hanja) and Japanese (kanji). Korean phonetic symbols are called hangul. Japanese syllabic characters are called kana; there are two forms: the cursive hiragana and the more angular katakana.
Chinese, Korean and Japanese languages have national and industry standard character sets. For more details, see Supported Locales on page A-11.
Locales are distinguished by the number of bytes per character in their principal character set: single-byte locales, based on European or Middle Eastern languages; and multi-byte locales, based on the Chinese, Japanese or Korean languages.
The combined factors of language and culture which apply in a particular territory are grouped together in a locale. There are many considerations within a locale, such as monetary units, date and time, personal name order, collation sequence and number format, but the most important is the language.
Locales are often defined for territories that share the same language. For example, French is an official language of Belgium, Luxembourg, Switzerland, Canada, various countries in West Africa, and parts of the Caribbean, as well as in France itself. Territories that share a language often have cultural differences that are reflected in the locale. For example, in Britain the date is written day/month/year but in the U.S.A it is written month/day/year.
AVS/Express takes the current locale from the system environment at initialization time. The locale remains current throughout the AVS/Express session. The current locale is not represented by an AVS/Express object. The exact method for specifying the locale is system dependent. For more details, see Initialization on page A-7.
The default locale for AVS/Express is known as the "C" locale, and the default language is English. When AVS/Express is used in a non-default locale, it is said to use a local language. Text strings in this language will be referred to as local text or local strings.
AVS/Express supports input, display, storage and processing of local languages, but does not adapt to other factors in the locale, such as collation sequence.
There are three possible levels of support for a locale in AVS/Express:
This section describes initialization for a UNIX platform running the X/Motif window system.
The system locale is set using the LANG environment variable. The general format for LANG is:
where clauses in square brackets, '[ ]', are optional. Each platform has its own set of values for the fields in the locale name. See your platform release notes to find the value of LANG appropriate to your system in your locale. The default locale is called "C", which implies English language. The codeset can be a character set, an encoding, or a name which implies both. Modifiers adjust certain details of the locale, such as choosing between various collation sequences or input methods.
For example, this is a valid Chinese locale for DEC OSF/1:
The language is zh, which stands for zhong-guo-hua, meaning Chinese. The territory is CN, for the People's Republic of China. The codeset is dechanzi, a DEC-specific group of character sets for simplified Chinese characters; pinyin is a collation order based on the romanized Pinyin transliteration of Chinese words.
The AVS/Express locale is set during initialization and remains current for the rest of the session. AVS/Express uses a simplified locale name derived from the LANG environment variable. This provides a common naming convention across platforms. Optional modifiers do not affect the operation of AVS/Express, so they are ignored. To find the AVS/Express locale, the LANG value is truncated at the first '.' or '@', and the resulting name is looked up in a list of aliases within the AVS/Express locale database. When a match is found, the simplified name for that locale is used as the AVS/Express locale. The simplified format has two-letter abbreviations for both language and territory, separated by an underscore:
For example, ja_JP is the simplified name of the AVS/Express locale for Japanese. It is derived from platform-specific LANG variables such as: ja,japanese,ja_JP.EUC, and ja_JP.deckanji.
The defaut locale is an exception to this format rule; it is just called "C".
AVS/Express uses the locale in two ways:
European languages usually have direct input methods from local keyboards, perhaps using shifted key sequences. Multi-byte character sets, however, require more complex methods. A separate application mediates between keyboard input and the target text widget. On UNIX platforms this application is called a Front End Processor (FEP), and on Windows NT it is called an Input Method Editor (IME). Only when the input interaction is finished will the FEP/IME send a local string to the AVS/Express application. Input methods determine where the raw keyboard input appears on the screen and how pre-edit operations are performed. Each FEP/IME supports different input methods.
Configuring an FEP/IME is platform dependent. See the window system release notes for your platform and your locale.
AVS/Express accepts string input in EUC, 7-bit (JIS) and Shift-JIS encodings for the relevant multi-byte locales. There is no configuration required; all of the encodings can be used in the one session of AVS/Express. Separate strings can have different encodings, but the encoding must be consistent throughout any individual string. There are some additional technical restrictions:
AVS/Express has an output encoding type which determines how local language strings are written. The output encoding is determined by the optional codeset field of the LANG environment variable. This value can be overridden by an independent environment variable, XP_MBCS_ENCODING.
Each supported encoding has a list of recognized values for the LANG codeset and the XP_MBCS_ENCODING environment variable:
The default output encoding, EUC, is used when the LANG codeset and XP_MBCS_ENCODING are unset or unrecognized.
If the locale for the LANG variable cannot be set on the system, AVS/Express defaults to using the C locale, and issues this message:
This means that your system does not have the correct configuration of Motif, X or C libraries to support the requested locale. Consult your operating system release notes for this locale.
If the LANG codeset or XP_MBCS_ENCODING are unrecognized, AVS/Express prints a warning message:
For the list of recognized values, see See Input Encoding on page A-8.
There are several runtime errors that can be written by AVS/Express when parsing multi-byte text in various encodings. These relate to corrupted strings: 8-bit values in a 7-bit encoding; escape sequences in an 8-bit encoding; unrecognized 7-bit escape sequences, and so forth. AVS/Express does not test every byte value for validity within the current character set, so it is possible to produce unintelligible text without any error message.
For more information about the LANG variable and the locales used for your session of AVS/Express, set the environment variable XP_LOCALE_DEBUG before running AVS/Express. This will force the LANG variable, system locale, AVS/Express locale and AVS/Express language name to be printed out. Here are some sample results:
Note that the system locale may be different from the LANG variable.
If the XP_DEBUG_LOCALE environment variable is set and the locale is a multi-byte locale, useful information about the encoding variables is printed to the AVS/Express terminal. For example, if the LANG variable is ja_JP.eucJP, these are examples of possible encoding information:
Notice that the XP_MBCS_ENCODING environment variable takes precedence over the LANG codeset.
The C locale is the default. The language used in the C locale is English. AVS/Express loads the default font for the ISO8859-1 character set.
There are three situations when AVS/Express uses the C locale:
For example, suppose VE is the territory code for Venezuela. You set the LANG variable to es_VE for Spanish language in Venezuela and your system accepts this value. AVS/Express will load an ISO8859-1 font. It will look for this dictionary pathname under the AVS/Express install directory, or another project directory in $XP_PATH:
You can either create a real subdirectory with that name to contain Spanish translations unique to Venezuela, or just make it a link to es_ES to find generic Spanish dictionaries:
This default mechanism allows AVS/Express to run in unrecognized locales based on Western European languages (ISO8859-1 character set).
These locales use the ISO8859-1 character set. The system locale is recognized by AVS/Express if it matches the AVS/Express locale name, its language name, or one of a list of other aliases. Codesets and modifiers are ignored. The supported locales are:
Express recognizes these Eastern European locales:
Optional LANG codesets and modifiers are ignored. AVS/Express loads a default font for the ISO8859-2 character set.
AVS/Express recognizes these additional single-byte locales:
Optional LANG codesets and modifiers are ignored. AVS/Express loads a default font for these character sets: ISO8859-5 for Russian; ISO8859-7 for Greek; and ISO8859-9 for Turkish.
AVS/Express recognizes these Japanese locales:
Optional LANG codesets and modifiers are ignored when determining the AVS/Express locale.
The AVS/Express Japanese locale loads default fonts for these character sets:
The choice between ISO8859-1 and JIS X 0201 is left to the platform window system. Usually it will choose a JIS Roman font for single-byte text. The following character sets are not supported:
AVS/Express supports all three input and output encodings: EUC, JIS, Shift-JIS. An optional LANG codeset is used to set the AVS/Express output encoding.
These JIS escape sequences are recognized in input:
The escape sequences written on output are:
In AVS/Express, Japanese text is displayed from left to right, in rows from top to bottom, the same as English.
AVS/Express recognizes these Korean locales:
Optional LANG codesets and modifiers are ignored when determining the AVS/Express locale.
The AVS/Express Korean locale loads default fonts for these character sets:
The AVS/Express Korean locale supports EUC and 7-bit encodings for input and output. There is no Shift-JIS encoding for Korean. The LANG codeset is used to determine the AVS/Express output encoding.
These 7-bit escape sequences are recognized in input and written in output:
In AVS/Express, Korean text is displayed from left to right, in rows from top to bottom, the same as English.
North Korea has abolished the use of borrowed Chinese characters (hanja); they are passing out of use in South Korea.
In 1956 the People's Republic of China (PRC) simplified the traditional Chinese characters in an effort to improve literacy. The traditonal forms are still widely used outside the PRC: for Chinese in Taiwan, Hong Kong and Singapore; for Japanese in Japan (kanji); and for Korean in South Korea (hanja).
AVS/Express recognizes these Simplified Chinese locales:
Optional LANG modifiers are ignored when determining the AVS/Express locale.
The codeset is significant in determining the locale for Hong Kong. If the territory name is HK and the codeset is either absent, or one of a recognized set of simplified codeset aliases, then AVS/Express selects the Simplified Chinese locale. The recognized simplified codesets for Hong Kong are:
The AVS/Express Simplified Chinese locale loads default fonts for these character sets:
It is not an error if a default font is not found for GB Roman. If fonts are found for both ISO8859-1 and GB 1988-1980, then the choice of single-byte character set is left to the window system. Usually it will choose a GB Roman font when available.
The AVS/Express Simplified Chinese locale supports EUC and 7-bit encodings for input and output. There is no Shift-JIS encoding for Simplified Chinese. An optional LANG codeset is used to determine the AVS/Express output encoding.
These 7-bit escape sequences are recognized in input:
These 7-bit escape sequences are written in output:
In AVS/Express, Simplified Chinese text is displayed from left to right, in rows from top to bottom, the same as English.
AVS/Express enters, displays, and writes text in many ways. You must consider the following for worldwide language support in your application:
The V language is based on ASCII characters and the English language. Many components of the V and VCP streams will not change across locales.
Three pathways are not supported for international use:
The remaining pathways are supported for enabled locales.
Local language input to the User Interface and Network Editor is managed by the windowing system. AVS/Express expects to receive properly formed local strings from dialog and typein widgets, possibly via an FEP/IME.
Text display for the User Interface, Network Editor, 2D Graphics Display, and 3D software renderer is accomplished using the facilities of the underlying window system. Local language titles are rendered in window decoration by the local window manager.
The OpenGL renderer does support international 3D text on UNIX platforms. It borrows X Window fonts and renders 3D text as Z-buffered bitmapped images.
The Object Manager can read local language strings from V files, VCP terminal and dictionaries. In multi-byte locales, input and output can be in any appropriate encoding: EUC, 7-bit (JIS) and Shift-JIS (Microsoft Kanji). See Locales on page A-6 for more details.
The Object Manager is the hub of string processing in AVS/Express; most of the enabled pathways for local language strings radiate from the Object Manager. The next section explains how strings are defined in V and manipulated within the Object Manager, concentrating on those aspects important for worldwide language support. In a following section, the interfaces for writing V output are described.
There are three basic text items within the Object Manager:
The AVS/Express default language is English. The V language uses printable ASCII for its syntax, including all keywords, delimiters, and object basenames. V string literals are enclosed in double quotes and can contain characters that are not printable ASCII.
Consider this V fragment declaring integer and cmethod objects:
The object basenames are message and update; the string object value is initialized to "Connect two objects"; the cmethod object has the src_file property set with string value "update.c".
Since object basenames are part of the V language syntax, they must use the ASCII character set according to rules for identifiers in V. Properties are implemented as string objects within the Object Manager, so the behavior for strings applies to properties as well. Filenames can occur in properties or strings; they can be local strings when the host file system supports local pathnames.
Some property strings are taken from a small set of predefined string values. These enumerated string values should not be translated or set with local language strings. For example, the property NEdisplayMode can take only the string values "NEopened", "NEclosed" or "NEmaximized".
String objects can get their value from several sources:
String literals are lists of bytes between double quotes. The bytes can take any value, so they can represent characters from any character set.
String values can be set directly with characters, or encoded in the ANSI "C" hexadecimal format. For example, <ESC> is a non-printable ASCII character whose value is decimal 27, hexadecimal 0x1b. The escape character looks like this in ANSI "C" hex format: \x1b
Successive bytes can be concatenated with this representation. For example, the Japanese EUC encoding uses a pair of bytes for each character and both bytes have their most significant bit set. A Japanese kanji string object for "nihongo", which means "Japanese", could be initialized in hex format:
The text could be entered explicitly in a V file with a Japanese editor or at the VCP prompt in a Japanese terminal running AVS/Express:
In either case, this is how the string object would appear in an application workspace of the Network Editor, opened, ready for editing the string value:
There is a restriction on hexadecimal format for the 7-bit (JIS) encoding in multi-byte locales: hexadecimal format cannot be used in multi-byte substrings or in escape sequences. For example, the JIS encoding for "Japanese" has two escape sequences: <ESC>$@ and <ESC>(J. The hexadecimal format for <ESC> is \x1b, and the kanji text, F|K\8l, contains a backslash. This combination cannot be parsed correctly if hex format is used for the <ESC> characters. The raw byte value must be used in the string. This is invalid:
The function that writes V files is:
All V syntax is printable ASCII. The default behavior for writing string literals is to use ANSI "C" hexadecimal format for all characters that are not printable ASCII. If the V file is read back in to AVS/Express the strings will be restored to their original form, but if the V file is viewed in a local text editor, the hexadecimal representation will be displayed instead of the original character. This will make all MBCS strings illegible and will corrupt accented characters in European language strings.
If you are displaying in a Japanese terminal that uses JIS Roman as its single-byte character set, not only will the strings be displayed in hexadecimal format, but the ASCII backslash will be displayed as a JIS Roman Yen symbol, making the output even more obscure.
To save V strings with raw byte values for characters that are not printable ASCII, pass the flag OM_SAVE_8BIT in the mode argument of the OMsave_obj function. This will allow local language strings to be saved intact, so that they can be viewed, and possibly changed, in a local text editor.
Commands entered at the VCP, such as $save or $print, which result in V being written to a file or to the VCP terminal, use the OM_SAVE_8BIT flag. If you are running AVS/Express in a local terminal and your V code contains local string literals, then the V output from $print will display legible local strings.
In the Network Editor, the pulldown menu options Save Application and Save Objects will write V files with the OM_SAVE_8BIT flag set on the call to OMsave_obj. They are safe to use with local strings in European and Asian locales.
The encoding used for V output is determined by the global MBCS output encoding. See Output Encoding on page A-9.
String objects and object properties can hold local strings. Raw local string values are displayed on components of the User Interface. Object basenames must be in ASCII English, but they can be assigned local aliases with the "user_name" property. The object username is displayed on the object's icon in the Network Editor. You can develop, save, and run applications in your local language. These applications can be used only in the same AVS/Express locale as the original development. Porting this local application to another language requires a new set of local V files, or a revision for internationalization.
A username is a display alias for an object. The username is used in preference to the basename for displaying and interacting with the object in the Network Editor. The username is a property, so it is saved and restored with the object definition in V. You can define a username for any type of object; for example:
The Object Manager will still use the basename to refer to the object, so the username is not restricted by the usual rules governing object identifiers: it does not have to be unique; it can contain punctuation and delimiters, it can start with a numeral (as in the example), and it does not even have to be in ASCII. The username is just a string, so it can have any format allowed for string literals in V, including ANSI "C" hexadecimal format, extended ASCII or multi-byte characters.
The name alias applies only in the Network Editor. The VCP interface will still use the basename. Navigation and reference at the VCP prompt requires the basename, and the output from inquiries such as the $list command will return the basename.
The username itself is not part of the worldwide language support mechanism, but it will be translated just like any other string. It is not advisable to use translation on a username that is already in a local language. The main problem with this double-translation is for Asian languages where encodings must agree in the string and the dictionary key. For more information about dictionaries and translation strategies, see Internationalization on page A-26.
The Object Manager provides functions to access the username property:
They are wrappers for the underlying property inquiry routine:
The username is displayed on the object as it appears in the Network Editor: on the icon in a library palette, on the icon in the workspace, or as the title of an open or maximized object in the workspace.
You can set the user_name property from the Property Editor in the object icon's popup menu.
The object Rename operation changes the basename, not the username. The renamed object cannot have a local language basename.
A project is localized when it has locale-specific V files. These are files containing local language strings in object username properties or string object values. In general, it is not possible to internationalize these strings using the dictionary mechanism, so the V files must be edited by hand to port the application to a new locale. This is time consuming and error prone, and it creates a shadow set of V files for development, distribution and maintenance. This is not the recommended method of implementing internationalized applications. An application should be localized only when it is not intended to be run in another locale.
The benefits of localization are that the development can take place in the local language. You can access all of the Network Editor and User Interface features in your local language, including text typeins and dialogs. There are two important development processes that cause local language strings to be written in the V files for the application, either by editing of V files directly, using a local text editor, or through the Network Editor's visual programming interface:
Other transient actions relating to the use of AVS/Express do not write V files and so cannot tie the application to the current locale. The localized V files will be written at any subsequent "Save Application" or "Save Objects". The locale name is not saved with the application. It is your responsibility to run a localized application in the correct locale.
These examples of localized objects are given for the Japanese locale.
The example object is the same as that used for Internationalized examples in the previous section: an integer called kanji initialized to 1. You localize this object basename by adding a translation in the username property. This can be added directly in V, either in a V file or at the VCP prompt, with local username in ANSI "C" hexadecimal format:
or explicitly in Japanese, using a Japanese editor on a V file, or at the VCP prompt when running AVS/Express in a Japanese terminal
The username can also be added using the Properties Editor:
Note that the dialog user interface has been translated by the AVS/Express dictionaries.
The kanji object would now appear like this in the Network Editor: