CONTENTS | PREV | NEXT | INDEX Designing Enterprise Applications
with the J2EETM Platform, Second Edition



10.1 Internationalization Concepts and Terminology

Internationalization terminology is commonly used inconsistently, even within the internationalization field. This section presents definitions of common internationalization terms as they are used in the rest of the chapter. For more detail and precision, see Unicode Technical Report 17 (UTR-17; see the reference listed in Section 10.9 on page 345).


10.1.1 Internationalization, Localization, and Locale

The set of political, cultural, and region-specific elements represented in an application is called a locale. Applications should customize data presentation to each user's locale. Internationalization, also known as "I18n," is the process of separating locale dependencies from an application's source code. Examples of locale dependencies include messages and user interface labels, character set, encoding, and currency and time formats. Localization (also called "L10n") is the process of adapting an internationalized application to a specific locale. An application must first be internationalized before it can be localized. Internationalization and localization make a J2EE application available to a global audience.

Class java.util.Locale represents a locale in the Java 2 Platform, Standard Edition (J2SE). A Locale object is an identifier in a program that determines such factors as how numbers, currency, and time should be displayed, and the human language used to present data in a user interface.

An internationalized J2EE application does not assume a single locale. Requests from clients arrive with an associated locale, which implies a locale for the response. A J2EE application often serves requests for many locales simultaneously. Determining the request locale and enforcing an appropriate response locale are important issues covered in Section 10.3.1 on page 321.

In many projects, application internationalization is an afterthought. But internationalizing an existing application usually requires a great deal of refactoring. Because internationalization affects all J2EE application tiers, it is fundamentally an architectural issue. Internationalization is much easier to achieve if it is integrated into the application design. A project's design phase should identify and separate locale dependencies if the application might ever need to support multiple locales.

10.1.1.1 Standard Locale Naming Convention

The J2SE standard class java.util.ResourceBundle (see Section 10.2.1 on page 316) defines a naming convention for locales, which should be used whenever organizing resources by locale. A locale name consists of international standard 2-character abbreviations for language (ISO 639) and country (ISO 3166), and an optional variant, which is a browser- and vendor-specific code for identifying platform differences. The Java platform does not define the possible values and semantics of variants. Any of the three parts of a locale may be empty and are separated by underscore characters (`_'). Examples of locale names might include fr (French), de_CH (Swiss German), and en_US_POSIX (United States English on a POSIX-compliant platform). For more on this naming convention, see the J2SE javadoc for class java.util.ResourceBundle.


10.1.2 Character Sets

A character set is a set of graphic, textual symbols, each of which is mapped to a set of nonnegative integers. For example, ASCII (ANSI X3.4-1968, ISO 646) defines a character set that is commonly used for representing American English. Japanese schools and official Japanese government documents use the Joyo Kanji, a fixed set of 1,945 characters. For the purposes of this chapter, "character set" means "coded character set," as defined in UTR-17.

A character set assigns a nonnegative integer, called a code point, to each character. For example, the ASCII code point for lowercase `a' is 97 (hex 61).

10.1.2.2 ASCII

The most common character set used to represent American English is ASCII (American Standard Code for Information Interchange). Code points in 7-bit ASCII (called US-ASCII) range from 0 to 127. ASCII contains upper- and lower-case Roman alphabets, European numerals, punctuation, a set of control codes (nongraphic code points from decimal 0 to 31), and a few miscellaneous symbols.

Many early Internet protocols were based on 7-bit ASCII, greatly complicating Web application support of languages other than American English.

10.1.2.3 The 8859 Series

The ISO 8859 character set series was created to overcome some of the limitations of ASCII. Each ISO 8859 character set may have up to 256 characters. ISO 8859-1 ("Latin-1") comprises the ASCII character set, plus characters with diacritics (accents, diaereses, cedillas, circumflexes, and the like), and additional symbols. The ISO 8859 series defines thirteen character sets (ISO 8859-1 through -10 and ISO 8859-13 through -15) that can represent texts in dozens of languages.

10.1.2.4 Unicode

Unicode (ISO 10646) defines a standardized, universal character set with 21-bit code points. Unicode was designed to represent virtually all character sets currently in use around the world today and can be extended to accommodate additions. Unicode encompasses alphabetic scripts, ideographic writing systems, and phonetic syllabaries, and may be rendered in any direction.

The Java programming language internally represents characters and String objects as 16-bit encoded Unicode (version 3.0 for Java 1.4). Programs written in the Java programming language can therefore process data in multiple languages, natively performing localized operations such as string comparison, parsing, and collation.

Unicode characters in Java program source files may be represented as escape sequences, using the notation \uXXXX, where XXXX is the character's 16-bit code point in hexadecimal. Unicode-escaped strings are very useful when program source files are not encoded as Unicode. Unicode escape sequences also provide support for multiple scripts using a single file encoding.


10.1.3 Encodings

An encoding maps a character set's code points to units of a specific width, and defines byte serialization and ordering rules. The Unicode 3.0 encoding UTF-32BE encodes Unicode code points as 32-bit unsigned integers with big-endian byte ordering. For the purposes of this chapter, "encoding" means "character encoding form serialized by character encoding scheme," as defined by UTR-17.

Many character sets have more than one encoding. For example, Java programs can represent Japanese character sets using the EUC-JP or Shift-JIS encodings, among others. Each encoding has rules for representing and serializing a character set.

J2SE package java.io contains classes that support reading and writing character data streams in various encodings. These classes all have names that end in Reader (for example, BufferedReader and InputStreamReader) and Writer (BufferedWriter, PrintWriter).

JSP pages and servlets both use PrintWriter to produce responses, which automatically performs encoding. Servlets may output binary data with OutputStream classes, which perform no encoding. An application that uses a character set that cannot use the default encoding (ISO 8859-1) must explicitly set a different encoding. A reference to the encoding section of the J2SE documentation is listed in Section 10.9 on page 345.

10.1.3.5 UTF-8

UTF-8 (Unicode Transformation Format, 8 bit form) is a variable-width character encoding that encodes 16-bit Unicode characters as one to four 8-bit quantities. UTF-8 unifies US-ASCII with Unicode. A byte in UTF-8 is equivalent to 7-bit ASCII if its high-order bit is zero; otherwise, the character comprises a variable number of bytes. Another encoding, UCS-2, encodes each Unicode character in a fixed width of 16 bits. Documents encoded in UTF-8 tend to be smaller than documents encoded in UCS-2, because most characters are encoded in one byte instead of two.

Many new Web standards specify UTF-8 as their character encoding. UTF-8 is compatible with the majority of existing Web content and provides access to the Unicode character set. Current versions of browsers and email clients support UTF-8. UTF-8 is one of the two required encodings for XML documents (the other is UTF-16). Encoding internationalized content in UTF-8 is a BluePrints recommendation.



CONTENTS | PREV | NEXT | INDEX
Copyright © 2002 Sun Microsystems, Inc. All Rights Reserved.