Abstract
This article describes how supplementary characters are supported
in the Java platform.
Supplementary characters are characters in the Unicode standard whose
code points are above U+FFFF, and which therefore cannot be described
as single 16-bit entities such as the
char data type in
the Java programming language. Such characters are generally rare,
but some are used, for example, as part of Chinese and Japanese
personal names, and so support for them is commonly required for
government applications in East Asian countries.The Java platform is being enhanced to enable processing of supplementary characters with minimal impact on existing applications. New low-level APIs enable operations on individual characters where necessary. Most text-processing APIs, however, use character sequences, such as the String class or
character arrays. These are now interpreted as UTF-16 sequences, and
the implementations of these APIs is changed to correctly handle
supplementary characters. The enhancements are part of version 5.0 of
the Java 2 Platform, Standard Edition (J2SE).Besides explaining these enhancements in detail, this article also provides guidelines for application developers for determining and implementing necessary changes to enable use of the complete Unicode character set. Background
Unicode was originally designed as a fixed-width 16-bit character
encoding. The primitive data type
char in the Java
programming language was intended to take advantage of this design by
providing a simple data type that could hold any character. However,
it turned out that the 65,536 characters possible in a 16-bit encoding
are not sufficient to represent all characters that are or have been
used on planet Earth. The Unicode standard therefore has been
extended to allow up to 1,112,064 characters. Those characters that
go beyond the original 16-bit limit are called supplementary
characters. Version 2.0 of the Unicode standard was the first to
include a design to enable supplementary characters, but it was only
in version 3.1 that the first supplementary characters were assigned.
Version 5.0 of the J2SE is required to support version 4.0 of the
Unicode standard, so it has to support supplementary characters.Support for supplementary characters is likely to also become a common business requirement in East Asian markets. Government applications are going to require them in order to correctly represent names that include rare Chinese characters. Publishing applications may need them in order to represent the full set of historical and variant characters. The Chinese government requires support for GB18030, a character encoding that encodes the entire Unicode character set, and so includes supplementary characters if Unicode version 3.1 or later is assumed. The Taiwanese standard CNS-11643 includes numerous characters that have been included in Unicode 3.1 as supplementary characters. The Hong Kong government defined a collection of characters that are needed for Cantonese, and some of these characters are supplementary characters in Unicode. Finally, some vendors in Japan are planning to use the large private use area in the supplementary character space for more than 50,000 kanji character variants in order to migrate from their proprietary systems to solutions based on the Java platform. The Java platform therefore not only has to support supplementary characters, but it also has to make it easy for applications to do the same. Since supplementary characters break a fundamental assumption of the Java programming language and might require a fundamental change in the programming model, an expert group was convened under the Java Community Process to choose the right solution for the problem. The group is called the JSR-204 expert group, using the number of the Java Specification Request for Unicode Supplementary Character Support. Technically, the decisions of the expert group only apply to the J2SE platform, but since the Java 2 Platform, Enterprise Edition (J2EE) sits on top of the J2SE platform, it benefits directly, and we expect that the configurations of the Java 2 Platform, Micro Edition (J2ME) will adopt the same design approach. But before we can look at the solution that the JSR-204 expert group came up with, we need to learn some terminology. Code Points, Character Encoding Schemes, UTF-16: What's All This?
The introduction of supplementary characters unfortunately makes
the character model quite a bit more complicated. Where in the past
we could simply talk about "characters" and, in a Unicode based
environment such as the Java platform, assume that a character has 16
bits, we now need more terminology. We'll try to keep it relatively
simple -- for a full-blown discussion with all details you can read
Chapter
2 of The Unicode Standard or Unicode Technical Report 17
"Character Encoding
Model." Unicode experts may skip all but the last definition in
this section.
A character is just an abstract minimal unit of text. It doesn't have a fixed shape (that would be a glyph), and it doesn't have a value. "A" is a character, and so is "€", the symbol for the common currency of Germany, France, and numerous other European countries. A character set is a collection of characters. For example, the Han characters are the characters originally invented by the Chinese, which have been used to write Chinese, Japanese, Korean, and Vietnamese. A coded character set is a character set where each character has been assigned a unique number. At the core of the Unicode standard is a coded character set that assigns the letter "A" the number 004116 and the letter "€" the number 20AC16. The Unicode standard always uses hexadecimal numbers, and writes them with the prefix "U+", so the number for "A" is written as "U+0041". Code points are the numbers that can be used in a coded character set. A coded character set defines a range of valid code points, but doesn't necessarily assign characters to all those code points. The valid code points for Unicode are U+0000 to U+10FFFF. Unicode 4.0 assigns characters to 96,382 of these more than a million code points. Supplementary characters are characters with code points in the range U+10000 to U+10FFFF, that is, those characters that could not be represented in the original 16-bit design of Unicode. The set of characters from U+0000 to U+FFFF is sometimes referred to as the Basic Multilingual Plane (BMP). Thus, each Unicode character is either in the BMP or a supplementary character. A character encoding scheme is a mapping from the numbers of one or more coded character sets to sequences of one or more fixed-width code units. The most commonly used code units are bytes, but 16-bit or 32-bit integers can also be used for internal processing. UTF-32, UTF-16, and UTF-8 are character encoding schemes for the coded character set of the Unicode standard. UTF-32 simply represents each Unicode code point as the 32-bit integer of the same value. It's clearly the most convenient representation for internal processing, but uses significantly more memory than necessary if used as a general string representation. UTF-16 uses sequences of one or two unsigned 16-bit code units to encode Unicode code points. Values U+0000 to U+FFFF are encoded in one 16-bit unit with the same value. Supplementary characters are encoded in two code units, the first from the high-surrogates range (U+D800 to U+DBFF), the second from the low-surrogates range (U+DC00 to U+DFFF). This may seem similar in concept to multi-byte encodings, but there is an important difference: The values U+D800 to U+DFFF are reserved for use in UTF-16; no characters are assigned to them as code points. This means, software can tell for each individual code unit in a string whether it represents a one-unit character or whether it is the first or second unit of a two-unit character. This is a significant improvement over some traditional multi-byte character encodings, where the byte value 0x41 could mean the letter "A" or be the second byte of a two-byte character. UTF-8 uses sequences of one to four bytes to encode Unicode code points. U+0000 to U+007F are encoded in one byte, U+0080 to U+07FF in two bytes, U+0800 to U+FFFF in three bytes, and U+10000 to U+10FFFF in four bytes. UTF-8 is designed so that the byte values 0x00 to 0x7F always represent code points U+0000 to U+007F (the Basic Latin block, which corresponds to the ASCII character set). These byte values never occur in the representation of other code points, a characteristic that makes UTF-8 convenient to use in software that assigns special meanings to certain ASCII characters. The following table shows the different representations of a few characters in comparison: This article also uses the terms character sequence or char sequence in many places to summarize all the
containers of character sequences that the Java 2 Platform knows:
char[], implementations of
java.lang.CharSequence (such as the String
class), and implementations of
java.text.CharacterIterator.This is a lot of terminology. What does all this have to do with supporting supplementary characters in the Java platform? Design Approach for Supplementary Characters in the Java Platform
The main decision the JSR-204 expert group had to make was how to
represent supplementary characters in Java APIs, both for individual
characters and for character sequences in all forms. A number of
approaches were considered and rejected by the expert group:
char to have 32 bits, for example, might
have been very attractive for a brand-new platform, but for J2SE it
would have been incompatible with existing Java virtual machines1, serialization, and other interfaces, not to mention that UTF-32 based
strings use twice as much memory as UTF-16 based ones. Adding a new
type char32 would have been easier, but would still have
created problems for virtual machines and serialization. Also,
language changes usually have longer lead times than API changes, so
the two previous approaches might have unacceptably delayed support
for supplementary characters. To help determine the winner among the
remaining ones, the implementation team actually implemented
supplementary character support in a substantial piece of code that
does low-level character processing (the java.util.regex
package) using four different approaches, comparing them in terms of
ease of development and runtime performance.In the end, the decision was for a tiered approach:
With this approach, a char represents a UTF-16 code
unit, which is not always sufficient to represent a code point.
You'll see that the J2SE specifications now use the terms code
point and UTF-16 code unit where the representation is
relevant, and the generic term character where the
representation is irrelevant to the discussion. APIs usually use the
name codePoint for variables of type int
that represent code points, while UTF-16 code units of course have
type char.We'll take a look at the actual changes in the J2SE platform in the next two sections -- one for the low-level APIs that work on individual code points, one for higher-level interfaces that work on character sequences. Supplementary Characters in the Open: Code point-based APIs
The low-level APIs that were added fall into two broad categories:
Methods that convert between various
char and code point
based representations, and methods that analyze or map code points.The most basic conversion methods are Character.toCodePoint(char high, char low),
which converts two UTF-16 code units to a code point, and
Character.toChars(int codePoint), which converts
the given code point to one or two UTF-16 code units, wrapped into a
char[]. However, since text most of the time comes in
the form of a character sequence, there are also
codePointAt and codePointBefore methods to
extract a code point from the various character sequence
representations: Character.codePointAt(char[] a,
int index) and String.codePointBefore(int
index) are two typical examples. For the most common cases of
inserting code points into a character sequence, there are
appendCodePoint(int codePoint) methods for the
StringBuffer and StringBuilder classes and
a String constructor that takes an int[]
representing code points.A few methods that analyze code units and code points help in the conversion process: The isHighSurrogate and
isLowSurrogate methods in the Character class identify
the char values that are used to represent supplementary
characters, and the charCount(int codePoint) method
determines whether a code point needs to be converted to one or two
chars.But most code point-based methods perform functions for the complete range of Unicode characters that older char
based methods performed for BMP characters. Here are some typical
examples:
int value is in the range of valid
Unicode code points (as mentioned above, only the range from 0x0 to
0x10FFFF is valid). In most cases the value is produced in a way that
guarantees it is valid, and checking it repeatedly in these low-level
APIs might adversely affect system performance. In cases where
validity cannot be guaranteed, applications must use the
Character.isValidCodePoint method to make sure that the
code point is valid. The behavior of most methods for invalid code
points is intentionally unspecified and may vary between
implementations.The API contains a number of convenience methods which could be implemented using other, lower-level APIs, but where the expert group felt that the methods would be used sufficiently often that adding them to the J2SE platform made sense. However, the expert group also rejected some proposed convenience methods, which gives us an opportunity to show how you can implement such methods yourself. For example, the expert group debated and rejected a new constructor for the String class which would create a
String holding a single code point. Here's a simple way
that an application can provide the functionality using the existing API:
You'll notice that in this simple implementation the/** * Creates new String that contains just the given code point. */ String newString(int codePoint) { return new String(Character.toChars(codePoint)); } toChars method always creates an intermediate array,
which is used once and then immediately discarded. If the method
shows up in your performance measurements, you may want to optimize
for the very, very, very common case where the code point is a BMP
character:
Or, if you need to create a large number of such strings, you may want to write a bulk version that reuses the array used by the/** * Creates new String that contains just the given code point. * Version that optimizes for BMP characters. */ String newString(int codePoint) { if (Character.charCount(codePoint) == 1) { return String.valueOf((char) codePoint); } else { return new String(Character.toChars(codePoint)); } } toChars method:
However, it may turn out that you actually want an entirely different solution. The new constructor/** * Creates new Strings each of which contains one of the given * code points. * Version that optimizes for BMP characters. */ String[] newStrings(int[] codePoints) { String[] result = new String[codePoints.length]; char[] codeUnits = new char[2]; for (int i = 0; i < codePoints.length; i++) { int count = Character.toChars(codePoints[i], codeUnits, 0); result[i] = new String(codeUnits, 0, count); } return result; } String(int
codePoint) was actually proposed as a code point-based
alternative to String.valueOf(char). In many cases this
method is used in the context of message generation, such as:
The new formatting API, which supports supplementary characters, provides a much simpler alternative:System.out.println("Character " + String.valueOf(char) + " is invalid."); Using this higher-level API is not only simpler, it has additional advantages: It avoids the concatenation, which would make the message very hard to localize, and reduces the number of strings that need to be moved into a resource bundle from two to one.System.out.printf("Character %c is invalid.%n", codePoint); Supplementary Characters under the Hood: Functionality Enhancements
Most changes in the Java 2 Platform that enable the use of
supplementary characters are not reflected in new API. The general
expectation is that all interfaces that handle character sequences
will handle supplementary characters in a way that's appropriate for
their functionality. This section highlights some enhancements that
were made to meet this expectation.
Identifiers in the Java Programming Language
The Java Language Specification specifies that all Unicode letters
and digits can be used in identifiers. Many supplementary characters
are letters or digits, and so the Java Language Specification was
updated to refer to new code point-based methods to define the legal
characters in identifiers. The javac compiler and other tools that
need to detect identifiers were changed to use these new methods.
Supplementary Character Support in Libraries
Numerous J2SE libraries have been enhanced to support
supplementary characters through existing interfaces. Here are just a
few examples:
Character Conversion
There are only a small number of character encodings that can
represent supplementary characters. In the case of Unicode-based
encodings such as UTF-8 and UTF-16LE, the character converters in
previous releases of the J2RE already implemented the conversions in
a way that handled supplementary characters correctly. For J2RE 5.0,
the converters for other encodings that can represent supplementary
characters have been updated: GB18030, x-EUC-TW (which now implements
all planes of CNS 11643), Big5-HKSCS (which now implements
HKSCS-2001).
In Java programming language source files, supplementary
characters are easiest to use if a character encoding is used that
can represent supplementary characters directly. UTF-8 is an
excellent choice. For cases where the character encoding used cannot
represent the characters directly, the Java programming language
provides a Unicode escape syntax. This syntax has not been enhanced
to express supplementary characters directly. Instead, they are
represented by the two consecutive Unicode escapes for the two code
units in the UTF-16 representation of the character. For example, the
character U+20000 is written as "\uD840\uDC00". You probably don't
want to figure out these escape sequences yourself; it's best to
write in an encoding that supports the supplementary characters that
you need and then use a tool such as native2ascii to convert to
escape sequences.
Properties files, unfortunately, are still limited to ISO 8859-1 as their encoding (unless your application uses the new XML format). This means you always have to use escape sequences for supplementary characters, and again probably will want to write in a different encoding and then convert with a tool such as native2ascii. Modified UTF-8 is not new to the Java platform, but it's something
that application developers need to be more aware of when converting
text that might contain supplementary characters to and from UTF-8.
The main thing to remember is that some J2SE interfaces use an
encoding that's similar to UTF-8 but incompatible with it. This
encoding has in the past sometimes been called "Java modified UTF-8"
or (incorrectly) just "UTF-8". For J2SE 5.0, the documentation is
being updated to uniformly call it "modified UTF-8."
The incompatibility between modified UTF-8 and standard UTF-8 stems from two differences. First, modified UTF-8 represents the character U+0000 as the two-byte sequence 0xC0 0x80, whereas standard UTF-8 uses the single byte value 0x0. Second, modified UTF-8 represents supplementary characters by separately encoding the two surrogate code units of their UTF-16 representation. Each of the surrogate code units is represented by three bytes, for a total of six bytes. Standard UTF-8, on the other hand, uses a single four byte sequence for the complete character. Modified UTF-8 is used by the Java Virtual Machine and the interfaces attached to it (such as the Java Native Interface, the various tool interfaces, or Java class files), in the java.io.DataInput and DataOutput interfaces
and classes implementing or using them, and for serialization. The
Java Native Interface provides routines that convert to and from
modified UTF-8. Standard UTF-8, on the other hand, is supported by
the String class, by the
java.io.InputStreamReader and
OutputStreamWriter classes, the
java.nio.charset facilities, and many APIs layered on
top of them.Since modified UTF-8 is incompatible with standard UTF-8, it is critical not to use one where the other is needed. Modified UTF-8 can only be used with the Java interfaces described above. In all other cases, in particular for data streams that may come from or may be interpreted by software that's not based on the Java platform, standard UTF-8 must be used. The Java Native Interface routines that convert to and from modified UTF-8 cannot be used when standard UTF-8 is required. Supporting Supplementary Characters in Your Application
Now, the question that matters most to most readers: What changes
do you have to make to your application in order to support
supplementary characters?
The answer depends on what kind of text processing is done within the application and which Java platform APIs are used. Applications that deal with text only in the form of char sequences in all forms (char[],
implementations of java.lang.CharSequence,
implementations of java.text.CharacterIterator), and
only use Java APIs that accept and return such char
sequences, will likely not have to make any changes. The
implementation of the Java platform APIs should handle supplementary
characters for you.Applications that interpret individual characters themselves, pass individual characters to Java platform APIs, or call methods that return individual characters, need to consider the valid values for these characters. In many cases it turns out that support for supplementary characters is not required. For example, if an application scans a char sequence for HTML tags,
checking each char individually, it knows that these
tags only use characters from the Basic Latin block. If the text
being scanned contains supplementary characters, then these
characters cannot be confused with the tag characters, because UTF-16
represents supplementary characters using code units whose values are
not used for BMP characters.Only where applications interpret individual characters themselves, pass individual characters to Java platform APIs, or call methods that return individual characters, and these character can be supplementary characters, does the application have to be changed. Where parallel APIs are available that use char
sequences, it is best to convert to use such APIs. In the remaining
cases, it will be necessary to use the new API to convert between
char and code point-based representations, and call code
point-based APIs. Unless, of course, you're lucky and find that there
are newer and more convenient APIs in J2SE 5.0 that let you support
supplementary characters and simplify your code at the same time, as
in the formatting sample above.You might wonder whether it's better to convert all text into code point representation (say, an int[]) and process it in
that representation, or whether it's better to stick with
char sequences most of the time and only convert to code
points when needed. Well, the Java platform APIs in general certainly
have a preference for char sequences, and using them
will also save memory space.For applications that need conversion to and from UTF-8 you will also need to consider carefully whether standard or modified UTF-8 is required, and use the proper Java platform facilities for each. The section "Modified UTF-8" provides the information needed to choose the right one. Testing Your Application With Supplementary Characters
Whether the previous section led you to revise your application or
not, it's always a good idea to test that it behaves correctly. For
applications that don't include a graphical user interface, the
information on "Representing Supplementary
Characters in Source Files" helps in developing test cases.
Here's additional information on testing with graphical user
interfaces.
For text input, the Java 2 SDK provides a code point input method which accepts strings of the form "\Uxxxxxx", where the uppercase "U" indicates that the escape sequence contains six hexadecimal digits, thus allowing for supplementary characters. A lowercase "u" indicates the original form of the escape sequences, "\uxxxx". You can find this input method and its documentation in the directory demo/jfc/CodePointIM of the J2SDK. For font rendering, you will need a font that can render at least some supplementary characters. One such font is James Kass's Code2001 font, which provides glyphs for scripts such as Deseret and Old Italic. Thanks to a new feature in the Java 2D library, you can simply install the font into the J2RE's lib/fonts/fallback directory, and it will be automatically added to all logical fonts used in 2D and XAWT rendering - you don't need to edit font configuration files. And with that, you can call your application ready for supplementary characters! Conclusion
Support for supplementary characters has been introduced into the
Java platform with an approach that enables most applications to
handle these characters without code changes. Applications that
interpret individual characters can use new code point-based API in
the
Character class and various
CharSequence subclasses.
Acknowledgments
Supplementary character support in the Java platform was designed
by the JSR-204 expert group within the Java Community Process. The
specification leads are Masayoshi Okutsu and Brian Beck (Sun
Microsystems), the other members of the expert group are Craig
Cummings (Oracle), Mark Davis (IBM), Markus Eble (SAP AG), Jere
Käpyaho (Nokia Corp.), Kazuhiro Kazama (NTT), Kenji Kazumura
(Fujitsu Limited), Eiichi Kimura (NEC Corp.), Changshin Lee (Tmax
Soft Inc.), and Toshiki Murata (Oki Electric Industry Co.). The
reference implementation was done by the Java Internationalization
team at Sun Microsystems with contributions from the IBM
Globalization Center of Competency, San José. The technology
compatibility kit for the specification is the Java Compatibility
Kit, implemented by the JCK Team at Sun Microsystems.
References
Masayoshi Okutsu, Brian Beck (ed.):
Unicode
Supplementary Character Support. Proposed Final Draft. Sun
Microsystems, 2004.
Java 2 Platform Standard Edition 5.0 API Specification. Sun Microsystems, 2004. The Unicode Consortium: The Unicode Standard, Version 4.0. Addison-Wesley, 2003. Ken Whistler, Mark Davis: Character Encoding Model. Unicode Technical Report #17. The Unicode Consortium, 2000. James Kass: Code2001, a Plane 1 Unicode-based Font. Norbert Lindenberg is the technical lead for Java
Internationalization in Sun Microsystems' Java Web Services group.
Before joining Sun, he worked on a variety of internationalization
projects at General Magic and Apple Computer. He holds an M.S. degree
in Computer Science from Universität Karlsruhe, Germany.
Masayoshi Okutsu is an internationalization engineer in Sun Microsystems' Java Web Services group, and currently the specification lead for Java Specification Request 204, Unicode Supplementary Character Support. Before joining Sun Microsystems, he worked on a variety of internationalization projects at Digital Equipment Corporation. He holds a B.S. degree in Electronic Engineering from Yamagata University, Japan. 1 As used on this web site, the terms "Java virtual machine" or "JVM" mean a virtual machine for the Java platform. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
| ||||||||||||