Arabic Software Desktop Publishing OCR MultimediA
Summary: Since its initial design stages, Microsoft® Windows NT® has
incorporated international support through the Unicode character encoding system APIs,
which retrieve language-specific information and resource files that store user interface
(UI) elements in multiple languages. Windows 2000, a world-ready operating system that
supports more than 100 international locales, is the culmination of several years of
progressive improvements in the operating system's international support.
Microsoft Windows 2000 is a fully globalized operating system that will be released in
more than two dozen language editions.
With international support built into the system through the National Language Support API (NLSAPI), the Multilingual API (MLAPI), and Windows resource files, developers will find it easier to create globalized applications that support multilingual data and a multilingual UIwithout using special tools or multiple editions of the operating system, or writing complex, specialized codes. Using these APIs, developers can create applications that can run on any language edition of Windows 2000 and will allow for the editing and display of multiple languages.
For detailed information on the NLSAPI, please consult the Microsoft Windows Operating Systems NLSAPI Functional Specification. For detailed information on the MLAPI, please consult the Microsoft Windows 2000 Multilingual Functional Specification.
A Single, Worldwide Binary
All language editions of Windows 2000 are created from the same core code base. In previous editions of Windows NT, Asian and Middle Eastern editions were a superset of the core U.S. and European editions and contained additional APIs to handle more complex text input and layout requirements. In Windows 2000, all APIs are contained in all language editions, making possible scenarios that will be described later in this paper.
In addition, every language edition will ship with the components necessary to support the input, display, and formatting of text in all languages that Windows NT supports. For example, each CD will include at least one font to represent each script supported by the system. (Additional fonts may ship for the primary language of the packaged product).
The following concepts are key to understanding international support in Windows2000.
A locale is a set of user preference information related to the user's language and sublanguage. An example of a language is "French," where the sublanguage could be French as spoken in Canada, France, or Switzerland. Locale information includes currency symbol; date, time, and number formatting information; localized days of the week and months of the year; the standard abbreviation for the name of the country; and character encoding information. (For a more complete list see the NLSAPI specification.) Each Windows NT system has a default system locale and one user locale per user, which may be different from the default system locale. Both can be changed through the control panel. Applications can specify a locale on a per-thread basis when calling APIs.
Figure 1. The Regional Settings Properties in Windows 2000 Control Panel
A character encoding (also called a code page) is a set of numeric values, or code points, that represents a group of alphanumeric characters, punctuation, and symbols. Single-byte character encodings use 8 bits to encode 256 different characters. On Windows, the first 128 characters of all code pages consist of the standard ASCII set of characters. The characters from code point 128 to 255 represent additional characters and vary depending on the set of scripts represented by the character encoding (for a complete listing of character sets tables see Developing International Software for Windows 95 and Windows NT, published by Microsoft Press). Double-byte character encodings on Windows, used for Asian languages, use 8 to 16 bits to encode each character. Computers exchange information encoded in character encodings and render it on screens using fonts.
Figure 2. Code Page 1256, the Arabic character encoding
Windows NT supports OEM character encodings (those originally designed for MS-DOS®), ANSI character encodings (those introduced with Windows® 3.1) and Unicode. Unicode is a 16-bit character encoding that encompasses most of the scripts in wide computer use today (for more information on Unicode see The Unicode Standard published by the Unicode Consortium or visit http://www.unicode.org/). Windows 2000 uses Unicode as its base character encoding, meaning that all strings passed around internally in the system, including strings in Windows resource (.res) files, are encoded in Unicode. Windows NT also supports ANSI character encodings. Each API that takes a string as a parameter has two entry pointsan "A" or ANSI entry point and a "W" or wide-character (Unicode) entry point.
Windows NT supports additional code pages for translating data to and from Unicode, including Macintosh, EBCDIC, and ISO encodings. It also contains translation tables for the UTF-7 and UTF-8 standards, which are commonly used to send Unicode-based data across networks, in particular across the Internet.
National Language Support
National Language Support in Windows NT consists of a set of system tables that applications can access through the NLSAPI. The NLSAPI retrieves the following types of information:
On Windows 2000, users can install National Language Support for any locale through the control panel (see Figure 1).
A localizable resource is any piece of information in a software program that will change from language to language. Although certain algorithms may change depending on language (for example, spelling or hyphenation), localizable resources are generally UI elements. Examples include menus, dialog boxes, help text, icons, and bitmaps. On Windows, most of these resources are stored in Windows resource files. In text form, Windows resource files have the extension .rc. When compiled, they have the extension .res. With today's tools (see Figure 3), resource files are compiled directly into the application executable. On Windows 2000, all language editions share the same binary code. Only the localizable resources change.
Figure 3. Editing a resource file in Developer Studio
Operating in a multilingual environment introduces technical issues that have traditionally been difficult to address:
Users cannot read a document unless the system and the applications they are running understand the character encoding used to create these documents. For example, a document created on a Japanese Windows 95 system can be displayed on an English Windows 95 system if that system has Japanese fonts installed; it cannot, however, be easily edited, because English Windows 95 does not support a Japanese character encoding nor Japanese Input Method Editors (IMEs; programs that convert keystrokes into ideographic characters). In addition, applications have typically extended their file formats for Asian editions, making them unreadable in non-Asian editions of their products. To share data, then, users have been required to run compatible systems and applications, meaning their environments must support matching character encodings, fonts, and file formats.
Creating Multilingual Documents
The limitations of sharing data naturally extend to putting "incompatible" data in a single document. On Windows 3.1 and Windows 95, for example, it is not possible to combine Arabic, Greek, Russian, and Japanese text in a single document without special applications that contain complicated (and often proprietary) code. Windows 95 introduced a limited solution (similar to the Macintosh solution) that involves tagging data with font and character encoding information. Users could create and display multiscript documents that were portable in a rich text format. However, this solution did not enable portable documents stored in a plain-text format (important for communication across the Internet) and required that applications understand and support the font-tagging scheme. It limited multilingual documents to scripts of similar types (European text, for example) and did not support the mixing of complex scripts, such as Arabic, Japanese, and Chinese.
Supporting Multinational User Scenarios
Large organizations, such as banks, universities, or government agencies, often support staff, as well as customers, who speak more than one language. This may require that individuals who speak different languages use the same machine, or that a single individual use one machine to communicate in more than one language. Just as a user of an ATM machine may want to change the UI language of the ATM to conduct a transaction, PC users may want to change the UI language of either the application they are using or of the entire system without affecting data. Typically, a single PC will be dedicated to a single languagefor example, a Japanese system running Japanese applications to handle Japanese data. Although Windows 95 and Windows NT 4.0 both enable applications to change their UI language and system locale settings on the fly, they do not allow users to change the UI language of the entire system. As mentioned above, it is also difficult to create an environment in which a user can run a system in one default language and run applications that support other languages. For example, on existing Windows-based systems it is possible, but not transparent, to run an application with a Greek UI on an English edition of Windows and enter Greek text. It is also difficult to run an application with an English UI on an English system and be able to enter and edit Japanese text.
Creating Multiple Language Editions of an Application
A large part of the reason that multilingual user scenarios have been difficult to set up is that comprehensive multilingual applications have been difficult to create, due to limitations in the operating system. Creating a Japanese-language application, for example, used to require a Japanese edition of Windows, special Japanese editions of programming tools, and a separate Software Development Kit (SDK). Fortunately, it is now possible to use standard tools (the English edition of Visual Basic® or Visual C++®, for example) to create Japanese-language applications. In addition, the SDK has been unified, and it is possible to compile Japanese applications on any language edition of Windows NT, as long as the proper national language tables have been installed. However, it has still been necessary to run Japanese applications on Japanese Windows because non-Japanese language editions of the operating system have not supported additional APIs for Input Method Editors. Thus, testing additional language editions of an application could still require additional installations of the operating system.
In addition, to support different languages, developers often had to customize or add code. For example, supporting Asian languages on Windows 95 required changing pointer arithmetic to handle double-byte character encodings and adding support for Input Method Editors. Supporting Arabic and Hebrew required customizing dialogs and menus with right-to-left controls, and adding code to handle ligatures and other text layout issues. Supporting multilingual documents required tagging data with font and language information. Although system flags, messages, and APIs exist in both Windows 95 and Windows NT 4.0 to handle text input, layout, and UI issues, not all the necessary mechanisms exist in all language editions of the operating systems.
How Windows 2000's Worldwide Binary Addresses these Issues.
The unified architecture of Windows 2000's worldwide binary makes it much easier to create scenarios such as multilingual user environments, mixed language networks, and multilingual documents. Several key design decisions form the basis for the global operating system.
Windows NT is based on the Unicode standard
Support for the Unicode Standard was built into the Windows NT operating system from its early stages. The first release of Windows NT used Unicode as the system's base character encoding. Subsequent releases used Unicode as the basis for the file system, the UI, and for network communication. Windows 2000 supports version 2.0 of Unicode. It provides a Unicode-based application environment and includes forward migration tools for existing non-Unicode data (see the section "Windows NT provides a flexible application environment").
Unicode's most important benefit is that it allows for unambiguous plain text representation of data, ending the requirement of tagging text strings with code page information. As a uniformly 16-bit character encoding, it represents Asian languages without requiring the programming tricks necessary to support variable-width character encodings used in Windows 9x. As an industry standard, it simplifies sharing of data in mixed platform environments.
Windows 9x and Windows NT both contain tables for converting text from ANSI character encodings to Unicode and vice versa. Users and developers can add conversion tables for a variety of character encodings, including Macintosh and UNIX character encodings, through the regional settings control panel applet (see Figure 3). Conversion tables make it possible for non-Unicode enabled applications to operate in the Windows NT environment, and Unicode-enabled applications to operate in the Windows 9x environment. Although Windows 9x does not contain native support for Unicode, it supports several wide-character APIs, such as TextOutW.
Figure 4. The Advanced Regional Settings dialog allows users to install code page conversion tables for a variety of standards.
Windows NT includes transparent support for multiple languages
Developers can use system APIs to create generic code that will correctly handle data input, storage, and display for a wide range of languages. The National Language Support API (NLSAPI) contains functions for transforming strings, retrieving and manipulating code page information, and retrieving and manipulating locale information. These APIs are listed in Table 1. The NLSAPI functions allow applications to query the system for types of information that can change depending on language, country, or character encoding. For example, LCMapString converts a string to uppercase, lowercase, or to a sort key depending on the language parameter passed to the call. GetCurrencyFormat returns all the information an application needs to format a currency string for a particular countrywhat the currency symbol is, whether the symbol comes before the numerical amount or after, and so forth. MultiByteToWideChar will convert a string from an ANSI character encoding into the proper Unicode range.
Table 1. NLSAPI functions.
These APIs accept identifiers for languages, locales, or character encodings. Applications can therefore pass the system locale, user, or thread locale to an API, which will return the appropriate information from tables carried by the operating system. If the system or user locale changes, the application behavior will automatically adjust without requiring any code changes or action on the part of the user. Developers can set the locale of a thread before passing it to an API in order to retrieve information about a specific locale. For example, if one section of a document is tagged as German text, an application can set the thread's locale to German before calling GetDateFormat, so that any dates in this section of the document are formatted according to German conventions.
Applications can also create generic code for handling text input and display. Windows 9x and Windows NT allow users to install several keyboard layouts and change them on the fly, for example when creating a multilingual document. The Multilingual API contains functions for changing keyboard layout tables as well as fonts used to display text (see Table 2). It also contains APIs to handle text layout issues, for example, vertical text for Japanese or right-to-left text containing ligatures for Arabic. Applications that use these APIs will contain basic, transparent support for creating mixed-language documents. Supporting complex scripts such as Arabic, Hebrew, and Thai requires using these APIs (see Appendix B for details).
Table 2. Multilingual API functions
Through these APIs, developers can also create applications that can handle text input and display for any number of languages, even if a fully localized UI will not be available for all languages. For example, English-language applications running on Windows 2000 will automatically handle the input of Japanese text as long as the application is based on Unicode. This is works on Windows 2000 because all APIs are fully functional in all language editions of the operating system. In the past, IME APIs were either unavailable or simply stubbed on non-Asian editions of Windows NT. Non-Unicode applications can easily handle the input of Japanese text by adding code to trap IME-related window messages.
Windows NT makes it easy to change the language of an application's UI
As mentioned before, traditional Windows applications store localizable resources in a resource (.res) file that is compiled into the application executable. With a resource file editor (such as the one built into Visual Studio) it is possible to create multiple language versions of localizable resources (tagged with language IDs) and compile them into the same .exe. It is also possible to extract a set of resources and replace them with a translated version.
The APIs listed in Table 3 are dependent on language. Several APIsFindResourceEx, MessageBoxEx, and FormatMessageaccept a language ID as a parameter. Others retrieve the version of the menu, string, or icon that corresponds to the language ID of the calling thread. Since these APIs are language-sensitive, developers can create applications that display the UI in a different language depending on the user's locale ID or some other mechanism (for example, menu choice).
Table 3. APIs for retrieving UI elements
Windows 2000 can imitate the Win32® application environment of any non-Unicode language edition of Windowsfor example, any edition of Windows 95. This allows Win32 applications that are not enabled for Unicode to run on any language edition of Windows 2000. For example, a Win32 application that uses Code Page 1253 (Greek) can run on French Windows 2000 with the proper system settings and tables. The major limitation is that multiple language applications cannot run at the same time if they use different character encodings (for example, a Japanese application that expects Code Page 932 and a Russian application that expects Code Page 1251). Windows NT will require the user to reboot the system before changing application environments. Unicode-based applications are not subject to this limitationrunning two Unicode-based applications side-by-side does not require resetting the system locale.
The flexible application environment allows users to run localized, non-Unicode applications, but its major benefit may be to application developers, who can now test a myriad of localized applications on a single machine. It is no longer necessary to maintain several machines with different language editions of the operating system for development and testing.
Figure 5. English Windows 2000 running Arabic Word for Windows. The system locale is Arabic, which allows Word to run correctly. The user locale, however, is Japanese. Note the date in the bottom right hand corner, formatted in Japanese.
Sharing the Same Machine with Users Who Speak Different Languages
Today is your first day on the job at a multinational bank in New York City. Your native language is German. Because of space constraints, you have to share a machine with another part-time worker whose native language is Russian. When you arrive at your desk, the Russian worker is finishing her tasks. You notice that the machine is running in Russianthe UI is Russian, and when she types into a dialog box, the text appears in Russian characters. Before she leaves for the day, she logs off.
After she leaves, you log on to the machine. Instead of a Russian UI, however, the system appears with an English UI. When you launch an application and type, you notice that the keyboard behaves just like an English keyboardcharacters appear in the English alphabet, not in Cyrillic characters. Any dates you insert into the document appear in English. You call the system administrator and tell him you would prefer a German machine. He tells you to go to the control panel, click on regional settings, and select "Deutsch" in the drop-down list labeled "UI Language" (see Figure 5). You do so and a dialog box appears informing you that the system settings will change the next time you log on. Do you want to log off now? You log off and then log on again. This time your UI appears in German. When you launch your application and type, the keyboard now behaves like a German keyboard. Any dates you insert into the document are formatted according to German conventions.
Figure 6. Setting the UI language for Windows 2000 in the Control Panel
How it works
When the system administrator set up this workstation, he installed a feature called "Multilingual UI" from a special CD that contains language resources and a special administration tool. When he ran the tool for this particular workstation, it told him what the system's base language was and gave him a list of available UI languages on the CD. He then chose to install French, German, Russian, and Spanish UIs. The necessary resource files were copied to the workstation, and the registry was updated to reflect which languages are present on the system. (Note: Installing additional fonts, keyboard layouts, and UI resources will increase system requirements.) Now whenever the user runs the control panel, an additional list box appears beneath the sorting option in the regional settings dialog, giving a choice of UI languages available on the machine.
The system UI is a user property. Different users can set the UI language to different defaults. Administrators, for example, can stipulate that the UI language for the administrator account is always a particular language. Therefore, if you are supporting a network containing machines running in six different European languages and you only speak English, you can administer each machine in English.
Handling Multilingual Data
You work for the European Union as a translator and speak six languages. You want to create a single document that contains translations of a recent meeting in English, French, German, Dutch, and Greek. You open a document that contains your French notes, edit it, and check its spelling. You then click your task bar to change your "input language" to English. You translate the French text into English. As you type, the keyboard reacts as a French keyboard. When you are ready to begin the German section, you click the taskbar to select your German input language. The keyboard still reacts as a French keyboard. Before you begin the Greek section, you select the Greek Input language. The keyboard now reacts as a Greek keyboard, and the text appears in your document in a Greek font. When you are done, you move the cursor to the beginning of the document, and check the spelling in the entire document. You find two minor spelling errors in the English section and one in the Greek section. You print the document and send it for proofing.
How it works
Different countries have different standard keyboard layouts. For example, compared with the U.S. keyboard layout, the French keyboard layout supports additional characters (e.g. for accented letters) and places others in different physical positions (on a French keyboard, for example, z and w are reversed relative to their position on the U.S. keyboard). People who speak different languages may be able to type in different languages, but they generally prefer to use one keyboard layout to enter text for all languages. When a language uses a different script (such as Russian and Greek), however, it is necessary to change keyboard layouts.
Windows stores keyboard layout information in tables that determine which character gets generated when the user presses a particular key on the keyboard hardware. Since the character generation is a software issue, Windows can control which keyboard layout is active for which user and which application at any given time. Users can go to the control panel and create "input locales," best described as a language-keyboard layout pair. For example, the user in the above scenario set up her machine so that any time she typed English, she would be using the French layout. It would also be possible for her to assign a different keyboard layout to each input locale (see Figure 7).
Figure 7. Adding an input locale and assigning a keyboard layout
Using the taskbar indicator (see Figure 8) or a shortcut key combination, she can switch between any of these input locales. When she changes input locales, Windows generates a WM_INPUTLANGCHANGEREQUEST message that applications can accept, reject, or ignore. If an application accepts the message, Windows generates another message that gives the application the locale ID of the new input locale. Applications can use this ID to tag text with a language property, which is useful for operations like spelling or grammar checking. An application may choose to reject the requestfor example, if the system for some reason does not contain the proper fonts to display the requested language.
Figure 8. The taskbar indicator for input language
Windows NT stores locale-keyboard layout pairs as part of a user's profile. Different users may assign a different keyboard layout to a particular language. Each user session tracks current input locales by threadthat is, two applications running at the same time may be using different input locales. In addition, an application can change the input locale for the user. For example, if the translator in the above example moved her cursor from English text to Greek text, her application may choose to activate the Greek input locale.
Changing the UI Language
You work at an international research firm and are at the library, using the on-line catalog. The previous user ran the search application in Czech and left it running. You do not speak Czech. You right-click on a little globe icon in the corner of the application, and a list of languages pops up. You select Spanish. The application UI redraws and changes to Spanish. You run the application and then close it down when you are done. The next person who runs it sees a Spanish UI.
How it works
Applications can implement a multilingual UI in several different ways. They can base the UI language of the application on the system locale, on the user locale, or on a manually selected default. On the system described above, for example, the application may save information about the locale ID of the most recently selected UI language. The next time it is launched, it can call SetThreadLocale with that language ID so that any APIs that retrieve UI elements from the program files will retrieve elements in the appropriate language.
If the current user would like to change the UI language, they could do so from the application's menus or by using an application-supported hot key combination. This would in turn invoke a command to reset the thread locale. This scenario is useful if a number of people will be using the same machine with the same application running all the time, much like an ATM machine. If an application does not support menu or keyboard options, it is still possible for the user to change the application UI language by changing the user locale in the regional settings of the Windows NT control panel. If the application contains the proper language resources and retrieves resources based on the user locale, then it will automatically start drawing them in the language of the new user locale. This second type of mechanism is useful if more than one user is sharing the same machine, running the same applications but in separate sessions. When each person logs on, his user locale determines the UI language for the applications. This makes it possible to install one copy of an application with multiple language resources, rather than numerous copies of the same application in different language editions.
Figure 9. Notepad running in both English and Japanese on the same Windows NT Workstation
Running Applications that Require Different Language Environments
You are a student at a university taking a Japanese class. You are in the language lab, preparing to do your Japanese homework on a machine running English Windows NT Workstation. The teacher has provided an applet written for Japanese Windows 95 that will help you practice your Japanese characters. Following her instructions, you first set your system locale to Japanese in the Control Panel (see Figure 1). Then you reboot and run the application. The system UI remains in English, but the applet works perfectly, allowing you to read and type Japanese characters.
How it works
Since the application was written for Windows 95, it is based on the Shift-JIS character encoding (code page 932), and not Unicode. When the administrator set up the workstation, he installed support for the Japanese languagecharacter tables, keyboard support, fonts, and locale-based information (sorting, date and time formatting, and so forth). When the student sets the system locale to Japanese, Windows NT loads the Shift-JIS character tables and, upon reboot, simulates the Win32 environment for Japanese, which is based on code page 932. The system behaves as if code page 932 is its local character set, even though the system environment is still in English, and Unicode-based applications still run unchanged.
The Japanese-language support includes IME support, which takes advantage of the same input locale-keyboard layout mechanism described earlier in this paper. Input Method Editors contain more intelligence than a simple keyboard layout table, but users can treat IMEs as they would any other input method, assigning a particular IME algorithm to an Asian input locale, and switching among Asian and non-Asian input locales using the taskbar indicator.
The implementation of international support in Windows 98 and Windows 2000 differs. Both operating systems support the NLSAPI and the MLAPI, both handle input locale switching and multilingual fonts, and both will be released in multiple language versions (Windows 98 will ship in a few more languages than Windows 2000). However, key architectural differences mean that Windows 98 will not support multilingual applications to the same degree that Windows NT does.
Since Windows 98 has evolved from the Windows 3.x code base, it does not contain native Unicode support, but instead uses ANSI character encodings. The lack of native Unicode support makes sharing data between machines running different character encodings more difficult. It is still possible to write a Unicode-based application that runs on Windows 98 (Word 97, for example), but with the exception of a small subset of wide-character APIs that Windows 98 supports, Unicode data must be translated before it is sent to system calls. One of the wide-character APIs, TextOutW, allows applications to display Unicode-encoded data. This is the API that Internet Explorer uses, for example, to display Japanese text on an English system.
Windows 98 and Windows NT share a common resource file format. It is therefore possible to create applications that can run on Windows 98 and change UI language. However, Windows 98 does not support multilingual user profiles or thread locales, so some mechanisms for automating the change of an application's UI language do not exist. In addition, Windows 98 does not support the ability to change the UI of the system itself.
Unlike Windows NT, localized editions of Windows 98 do not share a single binary. Asian and Middle Eastern editions are still supersets of the European editions of the system. Input Method Editor support is limited to Asian editions of Windows 98.
The foundation of Microsoft's multilingual platform is the international support contained in the Windows NT operating system. With Windows NT, it is possible to create a solution that supports multiple language data and a multiple language UI without requiring specialized applications or creating incompatibilities for users in different countries. Since it was first released, Windows NT has used Unicode as its base character encoding, which ensures the integrity of multilingual data shared across networks, in e-mail, or in document files. Windows NT contains the font support, the keyboard support, and the APIs to allow for both the display and input of multiple languages (French, Russian, and Greek, for example) in a single document. In addition, the system carries information for formatting dates and currencies and sorting text in more than 100 international locales.
Solutions built using Windows NT and international-aware applications like Microsoft Office and Internet Explorer allow for universal storage of data in the Unicode format (translated to local character encodings when necessary through tables provided by the system). Users of the system can use any language edition of Windows NT, Word, or Internet Explorer to display any document, as long as they have installed the appropriate language support (fonts and locale information) through the control panel. With Windows 2000, users will also be able to enter any language into a document. For example, they could run a Russian word processor on English Windows NT and enter Japanese text. The system offers users the additional flexibility of changing the language of the system's UI or the UI of any application that supports multiple languages. Because Windows NT supports user profiles, users sharing the same machine at different times can log on with different language preferences.
For the latest information on Windows NT Server, check out our World Wide Web site at www.microsoft.com/backoffice/ or the Windows NT Server Forum on the Microsoft Network (GO WORD: MSNTS).
For information on globalizing applications, visit www.microsoft.com/globaldev/default.asp.
You can find more details on software internationalization in the Microsoft Windows Operating Systems NLSAPI Functional Specification and the Microsoft Windows 2000 Multilingual Functional Specification or from Developing International Software for Windows 95 and Windows NT by Nadine Kano, published Microsoft Press, ISBN1-55615-840-8.
Changing Input Language
This appendix presents design principles to consider when developing applications to support complex scripts such as Arabic, Hebrew, Thai, and Indic scripts. We will first go over the properties of these scripts that set them apart from traditional scripts used for written communication, such as the Latin and ideographic scripts. Then we will point out some conventional programming techniques that cause problems in processing complex scripts, and give guidelines on how to avoid these problems in your applications.
A script as used in this document is a collection of symbols used for written communication, usually with a common property that justifies their association as a set. For example, the Latin script consists of the uppercase letters A-Z and the lowercase letters a-z. Written English generally contains two scripts: Latin letters and Arabic numerals. Written Japanese can contain up to five scripts: Hiragana, Katakana, ideographs, Arabic numerals, and Latin script. Other examples are Hungarian (extended Latin script), Korean (Hangul and Hanja scripts) and Vietnamese (extended Latin) script. These scripts all share the property that they are displayed as discrete glyphs, one per character, one after another, progressing from left to right (or vertically from top to bottom).
A complex script is one in which this assumption of linear layout, from left to right, does not hold. The following are some examples of nonlinear processing required of complex scripts, including example languages:
Remember the good old days, when all characters were ASCII, and there was only one locale ("C")? You could make all kinds of assumptions that simplified programming (for example, that everyone uses the same date format and the same decimal point indicator). You did not sell much software outside North America, either.
Nowadays, most software designers are aware of the need to eliminate assumptions about locale and language from their software to make it acceptable to users in other locales. The message in this section is simply that you probably need to eliminate more assumptions in your software.
Multiple scripts per document
In the past, when Windows did not support mixing of scripts very well, you could get by with a monolingual application, using 8-bit character strings to store text, assuming the same code page throughout your application. However, as explained in the body of this document, Windows 2000 supports multilingual applications, and many of your customers will demand the ability to mix scripts in a single text document.
One approach, of course, is to use multiple 8-bit code pages, enough for each of the scripts you wish to support. This is cumbersome at best, and quite unnecessary. Instead, use Unicode, as explained earlier in this paper.
The second assumption you need to discard is that a given character in a given font always looks the same, and has the same properties. Characters in languages such as Arabic change shape depending on the surrounding characters. Specifically, Arabic characters take one of four formsinitial, medial, final, and stand-alonedepending on where they occur in the text stream. Moreover, adjacent Arabic characters often ligate, meaning they combine together in a single glyph called a ligature.
This means you cannot use the old trick of putting out characters one by one, as you get them in the wParam parameter from the WM_CHAR message. If you do, then the system cannot do the contextual shaping for you, because when it comes time to render a character, the system does not know what characters precede or follow. It also means that you should not cache character widths and compute line lengths yourself, since the width of the character depends on the context. For example, this code will produce incorrect results when displaying most complex scripts:
Instead, you should save characters in a buffer, and put out the entire buffer each time a new character is typed, as follows:
Another assumption is that a character always displays to the right of the characters that precede it in the text. Notice in the example above, we moved the x position to the right after each character was input, using these lines:
Correctly determining the position of the next character in the stream would require implementing the Unicode algorithm for layout of bidirectional text (BiDi algorithm), which is a major undertaking indeed. Instead, use ExtTextOut on the whole buffer, as shown above, and let the system implementation of the BiDi algorithm handle layout.
However, there may be other cases where your application assumes left to right (LTR) layout, such as the x position passed in the call to ExtTextOut. You can make this selectable by the user, and set the proper x value as follows:
Complex cursor positioning, highlighting, and selection
Because modern graphical interfaces handle glyphs of various widths, most applications that display a cursor as they put out text take this into account. However, you may find that your software assumes it can move the cursor over one character at a time as the user types the left or right arrow keys. This does not work for Thai and some Indic scripts, some of whose characters may be displayed above, below, or to the left of previous characters. In Thai, for example, if the cursor is positioned after a base consonant, vowel, and tone mark, the cursor should skip back over all three characters when the user types the back arrow.
This is just one example of the kinds of problems you can run into when you support direct editing of text. Others include split highlighting and selection when the user drags the mouse over bidirectional text, and improper assumptions about word breaking when you wrap text.
A complete description of how to handle all cases for every script you encounter is beyond the scope of this paper. Suffice it to say that the most convenient way to handle these cases is to leave it up to the system by using an edit control. Both the simple edit control and the rich edit control have been enabled for complex scripts in the Arabic, Hebrew, and Thai versions of Windows NT 4.0, and in Windows 2000 with the appropriate locale support installed.
Summary of Guidelines
Here is a summary of the guidelines to process complex scripts correctly:
© Microsoft Corporation. All rights reserved.
The information contained in this document represents the current view of Microsoft Corporation on the issues discussed as of the date of publication. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information presented after the date of publication.
This White Paper is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS DOCUMENT.
Microsoft, BackOffice, the BackOffice logo, MS-DOS, Visual Basic, Visual C++, Win32, Windows, and Windows NT are registered trademarks of Microsoft Corporation
Other product or company names mentioned herein may be the trademarks of their respective owners.
For non-native beginners who, apart from
Mailing List for the
Multilingual Universal Word 2000 | Universal Word
2000 Language and Price List
| Arabic Fonts | Arabic Language Tutor CDs | Arabic NewsStand | Arabic Resources
Copyright© 1995-2008 AramediA . All rights reserved.
Sakhr Islamic Software, Sakhr Arabic software,
Learn Arabic, Arabic for beginners
Arabic language, software localization, software localisation, translation, Arabic
translation, multimedia, educational programs, Arabic Islam, Moslem, Islamic, Hebrew,
Farsi, Persian, Persia, Iran, Persian Tutor, middle east, Iranian