Welcome to the Core Java Technologies Tech Tips, January 10, 2003. Here you'll get tips on using core Java technologies and APIs, such as those in Java 2 Platform, Standard Edition (J2SE). This issue covers:
These tips were developed using Java 2 SDK, Standard Edition, v 1.4. This issue of the Core Java Technologies Tech Tips is written by Glen McCluskey. USING CHARSETS AND ENCODINGSSuppose that you're doing some Java programming, and have need to write characters to a file:
When you run this program in the United States in the Solaris Operating Environment or on the Windows platform, the result is a text file "out" of 7 bytes. This is what you would expect. But there is an important issue here. Java characters are 16-bit, that is, each character is two bytes long. The Encode1 program writes a 7-character string to a file, and the result is a 7-byte file. You might ask: what happened to the other bytes, shouldn't there be 14 bytes written? This issue falls under the title "character encodings". The problem is how to map between 16-bit characters representing Java data, and 8-bit bytes stored in data files. And in fact, it's trickier than simply "widening" or "narrowing" the character between 8 and 16 bits because there are literally hundreds of different character encoding schemes in use around the world. This means that the specific sequence of 8-bit bytes needed to represent a particular Java string changes from platform to platform and from locale to locale.
The Java system solves this problem by allowing you to choose the particular encoding scheme that's required when writing out characters. It also provides a reasonable default encoding based on your platform and locale. The Java system supports default encodings for performing I/O, as in the example above. In addition, you can also specify other named encodings ("charsets"). These encodings are described by string names, such as "UTF-8", and by instances of the In the Encode1 example, one way of solving the encoding problem is to always write two bytes out for each character. However the file will have null bytes interspersed. Another approach is to throw away the high byte of each Java character. This will work in the example above, but it wouldn't work if you tried to write a string of Greek or Japanese instead.. What actually happens in this example is that the second approach is used -- the high byte is discarded. If you change the output line in the Encode1 program from:
writer.write("testing");
to:
writer.write("testing\u1234");
the total output length will be 8 bytes instead of 7, even though the Unicode character \u1234 cannot be represented using a single byte. ""Discard" in the previous discussion can have a couple of meanings. If the high byte of a Java character is 0, as is the case for characters representing 7-bit ASCII, then discard means to omit the high byte. However, another meaning applies to the situation where you have a Java character that is not mappable using a particular encoding. In such a case the character (two bytes) may be replaced by a default substitution byte. In the case above, \u1234 is replaced with 0x3f. Let's now look at how to use charsets, mappings between characters and bytes. One basic question you might have is: what charsets are available? Here's a program that displays a list:
The output should look something like this (but without the "*" character):
The "*" is shown here to identify charsets that must be supported on all Java platforms. Another basic question: what is the default charset on my local system? Here's a program that displays the name of the default:
When you run this program, you might see a result like this:
default charset is: Cp1252
Notice that this charset is not on the list of required charsets that every Java implementation must support. There is no requirement that the default charset must be one of the required charsets. This example also has some commented-out logic that shows how you can determine whether two charsets are equal or not. It turns out that "windows-1252" and "Cp1252" are in fact names for a single charset. The logic is commented out because there is no requirement that the Cp1252 charset be supported, and so the logic here might not be meaningful to you. You may have seen other ways to get the default local charset name, such as querying the "file.encoding" system property. This approach might work, but this property is not guaranteed to be defined on all Java platforms.
In the Encode3 program,
If you run the program, like this:
$ java Encode4 XYZ
it will check whether "XYZ" is a supported Charset on the local system, and if so, obtain the Charset object. Given all this background, how do you actually make use of charsets? Here's a rework of the first example, Encode1:
The Encode1 program is not portable. It applies the default charset, which can vary based on platform and locale. By contrast, the Encode5 program uses a standard charset (UTF-8). As mentioned earlier, the default encoding used in the Encode1 example discards the high byte of Java characters. Using the UTF-8 encoding solves this problem. If you change the output line in the Encode program from:
it still works. And UTF-8 has the advantage of handling 7-bit ASCII in a graceful way. Here's another example. It shows how you can convert Java strings to byte vectors, specifying an encoding:
The output on your system should look something like this:
bytevec1 length = 7
bytevec2 length = 16
The first conversion applies the default charset. The second conversion uses the UTF-16 charset.
There's one final thing to discuss about character encodings. You might wonder what a typical mapping or encoding algorithm really looks like. Here is some actual code taken from
Characters are taken from charr, converted into 1-3 bytes, and written into bytearr. Characters in the range 0x1 - 0x7f (7-bit ASCII) are mapped into themselves. Characters with value 0x0 and in the range 0x80 - 0x7ff are mapped into two bytes. All other characters are mapped into three bytes. For more information about charsets and encodings, see section 9.7.1, Character Encodings, in "The Java Programming Language Third Edition" by Arnold, Gosling, and Holmes. Also see the documentation for Supported Encodings and Charset. The document Unicode Transformation Formats: UTF-8 & Co. is another good place to learn about charsets and encodings. USING REFLECTION TO CREATE CLASS INSTANCESImagine that you're doing some Java programming, and you need to create a new instance of the A class. You write some code like this:
A aref = new A();
Pretty obvious, right? Suppose, however, you take a step further and specify that the name of the class is found in a string made available at run time. It's still possible to proceed, like this:
This code works, but it's cumbersome. Also, it can't be expanded much further without major effort.
There's another approach that works much better in this kind of situation. The basic idea is that you use
Class cls = Class.forName(classname);
Object obj = cls.newInstance();
This sequence creates an object of the class whose string name is
After you have a
Let's look at an example to make these ideas a little more concrete. The example uses
$ java NewDemo A string1 string2 @ f2 string3 string4 string5
the driver creates an object of class A, using string1/string2 as string arguments to the A constructor. The driver then calls A.f2 for the created object, using string3/string4/string5 as arguments to the f2 method.
Note that the driver program doesn't know anything about the A class. It's written in a general way to work with any class. The driver looks up and manipulates class and method names using Here's what the code looks like:
Here is a test class you can use with the demo:
You need to compile this class in the usual way.
The
The constructor and method are found by creating a If you run the driver, by saying:
java NewDemo A @ f1
The output is:
call: A.A()
call: A.f1()
return value: null
Here are additional driver runs:
And here are their respective results:
Some examples of driver runs with bad input are:
The results are:
The techniques illustrated here are extremely powerful, and allow you to manipulate types and methods by name at run time. These techniques are used by tools such as interpreters, debuggers, and object exercisers. For more information about using reflection to create class instances see section 11.2.1, The Class class, and section 11.2.6, The Method Class, in "The Java Programming Language Third Edition" by Arnold, Gosling, and Holmes. IMPORTANT: Please read our Terms of Use, Privacy, and Licensing policies: http://www.sun.com/share/text/termsofuse.html http://www.sun.com/privacy/ http://developer.java.sun.com/berkeley_license.html Comments? Send your feedback on the Core Java Technologies Tech Tips to: jdc-webmaster@sun.com Subscribe to other Java developer Tech Tips: - Enterprise Java Technologies Tech Tips. Get tips on using enterprise Java technologies and APIs, such as those in the Java 2 Platform, Enterprise Edition (J2EE). - Wireless Developer Tech Tips. Get tips on using wireless Java technologies and APIs, such as those in the Java 2 Platform, Micro Edition (J2ME). To subscribe to these and other JDC publications: - Go to the JDC Newsletters and Publications page, choose the newsletters you want to subscribe to and click "Update". - To unsubscribe, go to the subscriptions page, uncheck the appropriate checkbox, and click "Update". ARCHIVES: You'll find the Core Java Technologies Tech Tips archives at: http://java.sun.com/jdc/TechTips/index.html Copyright 2003 Sun Microsystems, Inc. All rights reserved. 901 San Antonio Road, Palo Alto, California 94303 USA. This document is protected by copyright. For more information, see: http://java.sun.com/jdc/copyright.html Sun, Sun Microsystems, Java, Java Developer Connection, J2SE, J2EE, and J2ME are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States and other countries. | |||||||||||||||||||||||||||||
|
| ||||||||||||