|
In This Issue
Welcome to the Core Java Technologies Tech Tips for February 2007. Core Java Technologies Tech Tips provide tips and
hints for using core Java technologies and APIs provided in the Java Platform, Standard Edition (Java SE).
In this issue provides tips for the following:
» Text Normalization
» Monitoring Image I/O Events
NOTE: This document's text is encoded as UTF-8. You may need to set this encoding preference in your document reader to view some of the characters correctly.
by Sergey Groznyh
Text normalization is a text transformation that makes the text
consistent with some pre-defined rules. Examples include white-space
stripping, punctuation removal, uppercase/lowercase translation, and so
on. This tech tip discusses one important form of text normalization,
Unicode text normalization.
Unless otherwise specified, the term Unicode
in this article refers to Unicode 4.0
because this is the Unicode version supported by the Java SE 6
platform.
The Unicode standard defines two equivalences between characters
and sequences of characters. They are the canonical equivalence and the
compatibility equivalence. One example of canonical equivalence is a
precomposed character and its equivalent combining sequence. For
example, the Unicode character 'Ç' (LATIN CAPITAL
LETTER C WITH CEDILLA) has the Unicode character value U+00C7.
The Unicode character sequence U+0043 U+0327 also creates
the 'Ç' character. The sequence contains the character values
for LATIN CAPITAL LETTER C followed by the COMBINING CEDILLA.
The single character and the character sequence are canonically equivalent
because they are visually indistinguishable and mean exactly the same
for the purposes of text comparison and rendering.
Compatibility equivalence, on the other hand, deals mostly with
legacy character sets which define alternate visual representations of
the same character or character sequence. An example of a compatibility
equivalence is equivalence between the DIGIT TWO character
'2' (U+0032) and the SUPERSCRIPT TWO character '²'
(U+00B2). Both characters exists in the character set ISO/IEC
8859-1 (Latin1).
The DIGIT TWO character '2' and the SUPERSCRIPT TWO
character '²' are compatibility equivalent because they are
variants of the same basic character. Because the characters are
visually distinguishable and have additional semantic information,
the characters are not canonically equivalent.
Unicode text normalization is a process of translating characters
and character sequences from one equivalent form into another. Unicode
defines four normalization standards.
- NFC
-
Normalization Form Canonical Composition. Characters are
decomposed and then recomposed by canonical equivalence. For example,
sequences like "letter+combining marks" are composed to form a single
character if possible.
- NFD
-
Normalization Form Canonical Decomposition. Characters are
decomposed by canonical equivalence. For example, the precomposed
character 'Ç' (
U+00C7) transforms to a combining sequence
containing a base character and a combining accent.
- NFKC
-
Normalization Form Compatibility Composition. Characters are
decomposed by compatibility equivalence then recomposed by canonical
equivalence.
- NFKD
-
Normalization Form Compatibility Decomposition. Characters are
decomposed by compatibility equivalence. For example, the fraction
'½' (
U+00BD)
transforms into a sequence of three characters: 1/2.
The Normalizer class
Java SE 6 supports Unicode text normalization by providing the now
public class
java.text.Normalizer.
This class defines both the normalize method that
transforms text and the Form enumeration that represents
the Unicode normalization forms NFC, NFD, NFKC, and NFKD.
Possible applications of various Unicode normalization forms are
shown below:
Example: NFC
Suppose you want to publish a document on the Web. The Character Model for the
World Wide Web
specification recommends that in order to improve
indexing, searching and other text related functionality of the Web,
data should be normalized before publishing (early normalization). The
specification states that NFC is preferred because almost all legacy
data as well as data created by current software is already normalized
to NFC. The following code reads data from standard input and writes
NFC-normalized data to standard output. The UTF-8 encoding is used for
both input and output.
import java.io.*;
import java.text.Normalizer;
import java.text.Normalizer.Form;
public class NFC {
public static void main(String[] args) {
final String INPUT_ENC = "UTF-8";
final String OUTPUT_ENC = "UTF-8";
try {
BufferedReader r = new BufferedReader(
new InputStreamReader(System.in, INPUT_ENC));
PrintWriter w = new PrintWriter(
new OutputStreamWriter(System.out, OUTPUT_ENC), true);
String s;
while ((s = r.readLine()) != null) {
w.println(Normalizer.normalize(s, Form.NFC));
}
}
catch (Exception ex) {
ex.printStackTrace();
}
}
}
|
The NFC normalization is also well suited for string equality
tests. Note that the java.text.Collator class,
initialized with the appropriate locale, should be used for string
comparisons. The reason for using the Collator class is
that sorting order for accented letters differs in different languages.
For sorting purposes, some languages place accented letters right after
the base letter, and some place accented letters after all base
letters.
Example: NFD
Suppose you are developing a phone directory application. You store
the directory data in some database and have a search form to look up
the data. As people names around the world contain accented characters,
you have two problems: many databases do not like accented characters,
and many users of your application will not bother to enter, or just
can not enter the correct (accented) names into the search form of your
application. So you must remove all accents from both the data stored
in the database, and the data read from the search form.
The following code reads standard input line by line, strips
accented characters from each line and writes the result to standard
output. The UTF-8 encoding is used for both input and output.
import java.io.*;
import java.text.Normalizer;
import java.text.Normalizer.Form;
public class NFD {
public static void main(String[] args) {
final String INPUT_ENC = "UTF-8";
final String OUTPUT_ENC = "UTF-8";
try {
BufferedReader r = new BufferedReader(
new InputStreamReader(System.in, INPUT_ENC));
PrintWriter w = new PrintWriter(
new OutputStreamWriter(System.out, OUTPUT_ENC), true);
String s;
while ((s = r.readLine()) != null) {
// decompose and remove accents
String decomposed = Normalizer.normalize(s, Form.NFD);
String accentsGone =
decomposed.replaceAll("\\p{InCombiningDiacriticalMarks}+", "");
w.println(accentsGone);
}
} catch (Exception ex) {
ex.printStackTrace();
}
}
}
|
Example: NFKC
NFKC normalization affects characters with combining marks that
have a compatibility decomposition form. So, the character sequence
U+1E9B U+0323 (LATIN SMALL
LETTER LONG S WITH DOT ABOVE followed by the COMBINING DOT
BELOW) is transformed to the single character value U+1E69.
The normalized character is 'ṩ' (LATIN SMALL LETTER S WITH DOT BELOW AND DOT ABOVE).
This normalization form is required to comply with the string
profile specification for the International Domain Names (RFC 3491). If a
domain name contains non-ASCII characters, it must be normalized to the
NFKC form. So if you are building an application that registers
international domain names, you must encode the names to this form.
Note that the International Domain Names encoding specifies Unicode
version 3.2, and there are differences in normalization forms for some
CJK ideographic characters between Unicode versions 3.2 and 4.0. If you
are not implementing RFC 3491 and just want to get the normalized
domain name, you may use the facilities provided by the java.net.IDN
class.
The encoding process is similar to those showed in the NFC example,
the only difference is that Form.NFKC encoding form
should be used instead of Form.NFC.
Example: NFKD
This form of normalization is useful when legacy text data is
converted to XML format. The Unicode in XML and other
Markup Languages
specification defines several rules for dealing
with compatibility characters. For example, it recommends using <sup>
and <sub> markup for superscripts and subscripts,
using MathML markup for
expressing fractions, using list item marker styles instead of circled
digits, and so on. If you are building an application that converts
legacy data to XML, you should consider applying the appropriate markup and/or
styles to text data that has been normalized to NFKD.
In order to convert data to NFKD form, you should pass Form.NFKD
as the second parameter to the Normalizer.normalize
method:
Normalizer.normalize(s, Form.NFKD);
Normalization Testing
The java.text.Normalizer class defines the isNormalized
method, which checks whether a given character sequence is normalized
according to one of the four normalization forms. The following code
reads lines from standard input and reports whether the line is
normalized to any of the four forms. Input is UTF-8 encoded.
import java.io.*;
import java.text.Normalizer;
import java.text.Normalizer.Form;
public class IsNormalized {
public static void main(String[] args) {
final String INPUT_ENC = "UTF-8";
final Form[] forms = { Form.NFC, Form.NFD, Form.NFKC, Form.NFKD };
try {
BufferedReader r = new BufferedReader(
new InputStreamReader(System.in, INPUT_ENC));
String s;
int line = 1;
while ((s = r.readLine()) != null) {
System.out.printf("%5d:", line++);
for (Form f : forms) {
if (Normalizer.isNormalized(s, f)) {
System.out.print(" " + f.toString());
}
}
System.out.println();
}
} catch (Exception ex) {
ex.printStackTrace();
}
}
}
|
For More Information
The following documents provide more inforamation about normalization:
Monitoring Image I/O Events
by Brian Burkhalter
The
JavaTM Image I/O API
provides a framework for working
with images on the Java platform. Within this framework, the
javax.imageio.event
package defines several interfaces for
monitoring synchronous events that are emitted while reading or writing
images.
You can use these interfaces for the following purposes:
- monitor the progress of image reading
- receive notifications as each region of an image is read
- trap warning messages about image reading
Each of these varieties of event monitoring is enabled by
registering a listener with the ImageReader
being used. You can obtain an ImageReader instance from
the ImageIO
class. The following code shows how to retrieve a reader for the TIFF
image format:
ImageReader reader;
Iterator<ImageReader> readers =
ImageIO.getImageReadersByMIMEType("image/tiff");
if (readers.hasNext()) {
reader = readers.next();
}
|
Monitoring reading progress
The
IIOReadProgressListener interface is used to monitor image reading
progress. An implementation of IIOReadProgressListener is
registered with an ImageReader using the addIIOReadProgressListener
method. For example, the following code uses a ProgressMonitor
to display image reading progress:
Component parentComponent;
String imageName;
ImageReader reader;
// Create a ProgressMonitor which displays percentage completed.
final ProgressMonitor pm =
new ProgressMonitor(parentComponent, imageName, "0 %", 0, 100);
// Register an anonymous inner class implementing IIOReadProgressListener
reader.addIIOReadProgressListener(new IIOReadProgressListener() {
// Close the ProgressMonitor if the read is aborted.
public void readAborted(ImageReader source) {
pm.close();
}
public void imageStarted(ImageReader source,
int imageIndex) {
// Abort the read if "cancel" pressed.
if(pm.isCanceled()) {
source.abort();
}
}
// Set image progress to 100% upon completion.
public void imageComplete(ImageReader source) {
imageProgress(source, 100.0F);
}
// Update the progress bar and its label each time the reader
// notifies the IIOReadProgressListener of a new percentage.
public void imageProgress(ImageReader source, float percentageDone) {
// Abort the read if "cancel" pressed.
if(pm.isCanceled()) {
source.abort();
return;
}
// Update the progress and label.
final int nv = (int)percentageDone;
SwingUtilities.invokeLater(new Runnable() {
public void run() {
pm.setProgress(nv);
pm.setNote(nv+" %");
}
});
}
public void thumbnailStarted(ImageReader source, int imageIndex,
int thumbnailIndex) {
}
public void thumbnailProgress(ImageReader source, float percentageDone) {}
public void thumbnailComplete(ImageReader source) {}
public void sequenceStarted(ImageReader source, int minIndex) {}
public void sequenceComplete(ImageReader source) {}
});
|
The ProgressMonitor displays a progress bar
which is filled from left to right as image loading progresses. When
image reading begins, the reader invokes the imageStarted
method, and the
listener checks whether the ProgressMonitor's "e;cancel"
button has been pressed. If canceled, the reading stops. When canceled,
the reader calls the readAborted method, which closes the
ProgressMonitor.
During reading, the reader will periodically invoke the imageProgress
method with an updated progress value. After checking
whether reading has been canceled, the listener updates the ProgressMonitor
with the new progress value and percentage value string. Note that the
listener updates the ProgressMonitor
on the Swing event thread, not on the thread which invoked the imageProgress
method. When image reading is complete, the reader invokes the imageComplete
method to set the percentage completion value to 100.
The following image shows the ProgressMonitor as it
updates its progress bar:

You can use the thumbnail methods of the IIOReadProgressListener
class to monitor the progress of thumbnail
loading when applicable.
Receiving region update notifications
The IIOReadUpdateListener interface is used to receive notifications
when each region of an image is read. An image region could consist of
a single row of pixels for scanline-oriented imagery or of a single tile
for tiled imagery. Use the addIIOReadUpdateListener method to
register an IIOReadUpdateListener implementation with an ImageReader
instance.
For example an image display component might implement IIOReadProgressListener
as follows:
public void imageUpdate(ImageReader reader, BufferedImage image,
int minX, int minY, int width, int height,
int periodX, int periodY, int[] bands) {
// Set the displayed image to the parameter image.
setImage(image);
// Repaint the sections of the image just updated.
repaint(0L, minX, minY, width, height);
}
public void passStarted(ImageReader reader, BufferedImage image,
int pass, int minPass, int maxPass,
int minX, int minY,
int periodX, int periodY, int[] bands) {}
public void passComplete(ImageReader reader, BufferedImage image) {}
public void thumbnailPassStarted(ImageReader reader, BufferedImage image,
int pass, int minPass, int maxPass,
int minX, int minY,
int periodX, int periodY, int[] bands) {}
public void thumbnailPassComplete(ImageReader reader,
BufferedImage image) {}
|
When an ImageReader invokes the imageUpate
method, the method first calls the setImage method to
perform any initialization that might be required for the display
component:
/** Instance variable for the image being displayed. */
BufferedImage theImage = null;
/** Initialize the display component using the provided image. */
protected void setImage(BufferedImage image) {
if(image != theImage) {
// Save the reference.
theImage = image;
// Initialize the display component as needed.
// --- CODE OMITTED ---
}
}
|
The display component then calls the repaint method to
draw the
image region that has just been loaded.
Being notified when an image region is updated is especially
important when reading large images. The update notification permits
the display component to paint each region as it is read rather than
wait for the entire image to load before drawing any of the
image.
Trapping warning messages
The IIOReadWarningListener interface is used to trap warning messages
emitted by an ImageReader while reading an image. An
implementation of IIOReadWarningListener is registered
with an ImageReader using the addIIOReadWarningListener()
method. For example, this code uses a JDialog
to display the text of warning messages:
Component parentComponent;
String imageName;
ImageReader reader;
// Create a text area to contain the warning message(s).
final JTextArea text = new JTextArea();
text.setColumns(60);
text.setLineWrap(true);
text.setWrapStyleWord(true);
// Create a warning option pane to contain the text area.
JOptionPane opt = new JOptionPane(new JScrollPane(text),
JOptionPane.WARNING_MESSAGE, JOptionPane.DEFAULT_OPTION);
// Create a modal dialog for the option pane.
final JDialog dialog =
opt.createDialog(JOptionPane.getFrameForComponent(parentComponent),
"Warnings: "+imageName);
dialog.setModal(false);
// Register an anonymous inner class implementing IIOReadWarningListener
reader.addIIOReadWarningListener(new IIOReadWarningListener() {
public void warningOccurred(ImageReader source,
final String warning) {
// Append the current warning to the text area.
SwingUtilities.invokeLater(new Runnable() {
public void run() {
text.append("[WARNING]: "+ warning+"\n");
dialog.pack();
dialog.setVisible(true);
}
});
}
});
|
When the ImageReader emits a warning message, it invokes
the warningOccured
method, which appends the message to the text area and displays the
warning dialog. Note that the warning dialog is
updated on the Swing event thread, not on the thread which invoked the warningOccurred
method.
The following image shows the warning dialog:

Monitoring image writing events
Monitoring Image I/O events while writing images is very similar to
monitoring events while reading images. The IIOWriteProgressListener
and IIOWriteWarningListener interfaces are analogs of IIOReadProgressListener
and IIOReadWarningListener, respectively. Use them with
an ImageWriter like the latter interfaces are used
with an ImageReader as described above.
For More Information
The following documents provide additional information about the Image I/O API:
|