|
By John O'Conner, August 24, 2006
|
|
|
In This Issue
Welcome to the Core Java Technologies Tech Tips for August 24, 2006. Here you'll get tips on using core Java technologies and APIs, such as those in Java 2 Platform, Standard Edition (J2SE).
This issue covers:
» How long is your String object?
» How should I compare String objects?
These tips were developed using Java 2 Platform, Standard Edition Development Kit 5.0 (JDK 5.0). You can download JDK 5.0 at http://java.sun.com/j2se/1.5.0/download.jsp.
This issue of the Core Java Technologies Tech Tips was written by John O'Conner, a Sr. Writer at Sun Microsystems, Inc.
See the Subscribe/Unsubscribe note at the end of this newsletter to subscribe to Tech Tips that focus on technologies and products in other Java platforms.
Tech Tip #1: How long is your String object?
How long is your text string? You might need to know that answer
to check whether user input conforms to data field length
constraints. Database text fields usually make you constrain entries
to a specific length, so you might need to confirm text length
before submitting it. Whatever the reason, we all occasionally need
to know the length of a text field. Many programmers use a String
object's length method to get that information. In many
situations, the length method provides the right
solution. However, this isn't the
only way to determine a String object's length, and
it's not always the correct way either.
You have at least three common ways to measure text length in
the Java platform:
- number of
char code units
- number of characters or code points
- number of bytes
Counting char Units
The Java platform uses the Unicode
Standard to define its characters. The Unicode Standard once
defined characters as fixed-width, 16-bit values in the range U+0000
through U+FFFF. The U+ prefix signifies a valid Unicode
character value as a hexadecimal number. The Java language
conveniently adopted the fixed-width standard for the char
type. Thus, a char value could represent any 16-bit
Unicode character.
Most programmers are familiar with the length
method. The following code counts the number of char
values in a sample string. Notice that the sample String
object contains a few simple characters and several characters
defined with the Java language's \u notation. The \u
notation defines a 16-bit char value as a hexadecimal number and
is similar to the U+ notation used by the Unicode
Standard.
private String testString = "abcd\u5B66\uD800\uDF30";
int charCount = testString.length();
System.out.printf("char count: %d\n", charCount);
The length method counts the number of char
values in a String object. The sample code prints this:
char count: 7
Counting Character Units
When Unicode version 4.0 defined a significant number of new
characters above U+FFFF, the 16-bit char type could no
longer represent all characters. Starting with the Java 2 Platform,
Standard Edition 5.0 (J2SE 5.0), the Java platform began to support
the new Unicode characters as pairs of 16-bit char
values called a surrogate pair. Two char units act as a
surrogate representation of Unicode characters in the
range U+10000 through U+10FFFF. Characters in this new range
are called supplementary characters.
Although a single char value can still represent a
Unicode value up to U+FFFF, only a char surrogate pair
can represent supplementary characters. The leading or high value of
the pair is in the U+D800 through U+DBFF range. The trailing or low
value is in the U+DC00 through U+DFFF range. The Unicode Standard
allocates these two ranges for special use in surrogate pairs. The
standard also defines an algorithm for mapping between a surrogate pair
and a character value above U+FFFF.
Using surrogate pairs, programmers can
represent any character in the Unicode Standard. This special use of
16-bit units is called UTF-16, and the Java Platform
uses UTF-16 to represent Unicode characters. The char
type is now a UTF-16 code unit, not necessarily a complete Unicode
character (code point).
The length
method cannot count supplementary characters since it only counts
char units.
Fortunately, the J2SE 5.0 API has a new String method:
codePointCount(int beginIndex, int endIndex). This
method tells you how many Unicode code points (characters) are
between the two indices. The index values refer to code unit or char
locations. The value of the expression endIndex - beginIndex
is the same value provided by the length method. This
difference is not always the same as the value returned by the
codePointCount method. If your text contains
surrogate pairs, the length counts are definitely different. A
surrogate pair defines a single character code point, which can
be either one or two char units.
To find out how many Unicode character code points are in a
string, use the codePointCount method:
private String testString = "abcd\u5B66\uD800\uDF30";
int charCount = testString.length();
int characterCount = testString.codePointCount(0, charCount);
System.out.printf("character count: %d\n", characterCount);
This example prints this:
character count: 6
The testString variable contains two interesting
characters, which are a Japanese character meaning "learning"
and a character named GOTHIC LETTER AHSA. The Japanese character has
Unicode code point U+5B66, which has the same hexadecimal char
value \u5B66. The Gothic letter's code point is U+10330. In UTF-16,
the Gothic letter is the surrogate pair \uD800\uDF30. The pair
represents a single Unicode code point, and so the character
code point count of the entire string is 6 instead of 7.
Counting Bytes
How many bytes are in a String? The answer depends on the
byte-oriented character set encoding used. One common reason for asking
"how many bytes?" is to make sure you're satisfying string
length constraints in a database. The getBytes method
converts its Unicode characters into a byte-oriented encoding, and
it returns a byte[]. One byte-oriented encoding is
UTF-8, which is unlike most other byte-oriented encodings since it
can accurately represent all Unicode code points.
The following code converts text into an array of byte
values:
byte[] utf8 = null;
int byteCount = 0;
try {
utf8 = str.getBytes("UTF-8");
byteCount = utf8.length;
} catch (UnsupportedEncodingException ex) {
ex.printStackTrace();
}
System.out.printf("UTF-8 Byte Count: %d\n", byteCount);
The target character set determines how many bytes are generated.
The UTF-8 encoding transforms a single Unicode code point into one
to four 8-bit code units (a byte). The characters a,
b, c, and d require a total
of only four bytes. The Japanese character turns into three bytes.
The Gothic letter takes four bytes. The total result is shown here:
UTF-8 Byte Count: 11
 Figure
1. Strings have varying lengths depending on what you count.
Summary
Unless you use supplementary characters, you will never see a
difference between the return values of length and
codePointCount. However, as soon as you use characters
above U+FFFF, you'll be glad to know about the different ways to
determine length. If you send your products to China or Japan,
you're almost certain to find a situation in which length
and codePointCount return different values. Database
character set encodings and some serialization formats encourage
UTF-8 as a best practice. In that case, the text length measurement
is different yet again. Depending on how you intend to use length,
you have a variety of options for measuring it.
More Information
Use the following resources to find more information about the material in this technical tip:
Tech Tip #2: How should I compare String objects
You can compare String objects in a variety of ways,
and the results are often different. The correctness of your result
depends largely on what type of comparison you need. Common
comparison techniques include the following:
- Compare with the
== operator.
- Compare with a
String object's equals method.
- Compare with a
String object's compareTo method.
- Compare with a
Collator object.
Comparing with the == Operator
The == operator works on String object
references. If two String variables point to the same
object in memory, the comparison returns a true result.
Otherwise, the comparison returns false, regardless
whether the text has the same character values. The ==
operator does not compare actual char data. Without
this clarification, you might be surprised that the following code
snippet prints The strings are unequal.
String name1 = "Michèle";
String name2 = new String("Michèle");
if (name1 == name2) {
System.out.println("The strings are equal.");
} else {
System.out.println("The strings are unequal.");
}
The Java platform creates an internal pool for string literals and
constants. String literals and constants that have the exact same
char values and length will exist exactly once in the
pool. Comparisons of String literals and constants with
the same char values will always be equal.
Comparing with the equals Method
The equals method compares the actual char
content of two strings. This method returns true when
two String objects hold char data with the
same values. This code sample prints The strings are equal.
String name1 = "Michèle";
String name2 = new String("Michèle");
if (name1.equals(name2) {
System.out.println("The strings are equal.");
} else {
System.out.println("The strings are unequal.");
}
Comparing with the compareTo Method
The compareTo method compares char
values similarly to the equals method. Additionally,
the method returns a negative integer if its own String
object precedes the argument string. It returns zero if the strings
are equal. It returns a positive integer if the object follows the
argument string. The compareTo, method says that cat
precedes hat. The most important information to understand
about this comparison is that the method compares the char
values literally. It determines that the value of 'c' in cat
has a numeric value less than the 'h' in hat.
String w1 = "cat";
String w2 = "hat";
int comparison = w1.compareTo(w2);
if (comparison < 0) {
System.out.printf("%s < %s\n", w1, w2);
} else {
System.out.printf("%s < %s\n", w2, w1);
}
The above code sample demonstrates the behavior of the compareTo
method and prints cat < hat. We expect that result, so
where's the weakness? Where's the problem?
Producing Errors
A problem appears when you want to compare text as natural
language, like you do when using a word dictionary. The String
class doesn't have the ability to compare text from a natural
language perspective. Its equals and compareTo
methods compare the individual char values in the
string. If the char value at index n in
name1 is the same as the char value at
index n in name2 for all n in
both strings, the equals method returns true.
Ask the same compareTo method to compare cat
and Hat, and the method produces results that would confuse
most students. Any second grader knows that cat still
precedes Hat regardless of capitalization. However, the
compareTo method will tell you Hat < cat.
The method determines this because the uppercase letters precede
lowercase letters in the Unicode character table. This is the same
ordering that appears in the ASCII character tables as well.
Clearly, this ordering is not always desirable when you want to
present your application users with sorted text.
Another potential problem appears when trying to determine string
equality. Text can have multiple internal representations. For
example, the name "Michèle" contains the Unicode
character sequence M i c h è l e. However, you
can also use the sequence M i c h e ` l e. The second
version of the name uses a "combining sequence" ('e' +
'`') to represent 'è'. Graphical systems that understand
Unicode will display these two representations so that they appear
the same even though their internal character sequences are slightly
different. A String object's simplistic equals
method says that these two strings have different text. They are not
lexicographically equal, but they are definitely equal
linguistically.
The following code snippet prints this: The strings are
unequal. Neither the equals nor compareTo
methods understand the linguistic equivalence of these strings.
String name1 = "Michèle";
String name2 = "Miche\u0300le"; //U+0300 is the COMBINING GRAVE ACCENT
if (name1.equals(name2)) {
System.out.println("The strings are equal.");
} else {
System.out.println("The strings are unequal.");
}
If you're trying to sort a list of names, the results of String's
compareTo method are almost certainly wrong. If you
want to search for a name, again the equals method will
subtly trip you up if your user enters combining sequences...or if
your database normalizes
data differently from how the user enters them. The point is that
String's simplistic comparisons are wrong whenever you are working
with natural language sorting or searching. For these operations,
you need something more powerful than simple char value
comparisons.
Using a Collator
The java.text.Collator class provides natural
language comparisons. Natural language comparisons depend upon
locale-specific rules that determine the equality and ordering of
characters in a particular writing system.
A Collator object understands that people expect
"cat" to come before
"Hat" in a dictionary. Using a collator comparison,
the following code prints cat < Hat.
Collator collator = Collator.getInstance(new Locale("en", "US"));
int comparison = collator.compare("cat", "Hat");
if (comparison < 0) {
System.out.printf("%s < %s\n", "cat", "Hat");
} else {
System.out.printf("%s < %s\n", "Hat", "cat" );
}
A collator knows that the character
sequence M i c h è l e is equal to M i c h
e ` l e in some situations, usually those in which natural
language processing is important.
The following comparison uses a Collator object. It
recognizes the combining sequence and evaluates the two strings as
equal. It prints this: The strings are equal.
Collator collator = Collator.getInstance(Locale.US);
String name1 = "Michèle";
String name2 = "Miche\u0300le";
int comparison = collator.compare(name1, name2);
if (comparison == 0) {
System.out.println("The strings are equal.");
} else {
System.out.println("The string are unequal.");
}
A Collator object can even understand several "levels"
of character differences. For example, e and d are
two different letters. Their difference is a "primary" difference. The
letters e and è are different too, but
the difference is a "secondary" one. Depending upon how you configure
a Collator instance, you can consider the words "Michèle"
and "Michele" to be equal. The following code will print The
strings are equal.
Collator collator = Collator.getInstance(Locale.US);
collator.setStrength(Collator.PRIMARY);
int comparison = collator.compare("Michèle", "Michele");
if (comparison == 0) {
System.out.println("The strings are equal.");
} else {
System.out.println("The string are unequal.");
}
Summary
Consider when the equals method is more appropriate
than the == operator. Also, when you need to order
text, consider whether a Collator object's natural
language comparison is needed. After you consider the subtle differences
among the various comparisons, you might discover that you've been
using the wrong API in some places. Knowing the differences helps you make the
right choices for your applications and customers.
Comments and Contact Information: To send feedback about the SDN Program News:
For comments about the content of this newsletter, fill out the Rate and Review form, above.
For technical assistance about Newsletter delivery, broken links, or subscribe/unsubscribe help, fill out the web form.
Subscribe/Unsubscribe: You can subscribe to other Sun Developer Network (SDN) publications here:
https://softwarereg.sun.com/registration/developer/en_US/subscriptions
- To subscribe, select the newsletters you want to subscribe to and click "Update."
- To unsubscribe, uncheck the appropriate checkbox, and click "Update."
IMPORTANT: Please read our Licensing, Terms of Use, and Privacy policies:
http://developer.java.sun.com/berkeley_license.html
http://www.sun.com/share/text/termsofuse.html
Privacy Statement: Sun respects your online time and privacy (http://www.sun.com/privacy).
|
|