Sun Java Solaris Communities My SDN Account Join SDN
 
Core Java Technologies Tech Tips

Text Normalization and Monitoring Image I/O Events

 
In This Issue

Welcome to the Core Java Technologies Tech Tips for February 2007. Core Java Technologies Tech Tips provide tips and hints for using core Java technologies and APIs provided in the Java Platform, Standard Edition (Java SE).

In this issue provides tips for the following:

» Text Normalization
» Monitoring Image I/O Events

NOTE: This document's text is encoded as UTF-8. You may need to set this encoding preference in your document reader to view some of the characters correctly.

Text Normalization

by Sergey Groznyh

Text normalization is a text transformation that makes the text consistent with some pre-defined rules. Examples include white-space stripping, punctuation removal, uppercase/lowercase translation, and so on. This tech tip discusses one important form of text normalization, Unicode text normalization. Unless otherwise specified, the term Unicode in this article refers to Unicode 4.0 because this is the Unicode version supported by the Java SE 6 platform.

The Unicode standard defines two equivalences between characters and sequences of characters. They are the canonical equivalence and the compatibility equivalence. One example of canonical equivalence is a precomposed character and its equivalent combining sequence. For example, the Unicode character 'Ç' (LATIN CAPITAL LETTER C WITH CEDILLA) has the Unicode character value U+00C7. The Unicode character sequence U+0043 U+0327 also creates the 'Ç' character. The sequence contains the character values for LATIN CAPITAL LETTER C followed by the COMBINING CEDILLA. The single character and the character sequence are canonically equivalent because they are visually indistinguishable and mean exactly the same for the purposes of text comparison and rendering.

Compatibility equivalence, on the other hand, deals mostly with legacy character sets which define alternate visual representations of the same character or character sequence. An example of a compatibility equivalence is equivalence between the DIGIT TWO character '2' (U+0032) and the SUPERSCRIPT TWO character '²' (U+00B2). Both characters exists in the character set ISO/IEC 8859-1 (Latin1). The DIGIT TWO character '2' and the SUPERSCRIPT TWO character '²' are compatibility equivalent because they are variants of the same basic character. Because the characters are visually distinguishable and have additional semantic information, the characters are not canonically equivalent.

Unicode text normalization is a process of translating characters and character sequences from one equivalent form into another. Unicode defines four normalization standards.

NFC
Normalization Form Canonical Composition. Characters are decomposed and then recomposed by canonical equivalence. For example, sequences like "letter+combining marks" are composed to form a single character if possible.
NFD
Normalization Form Canonical Decomposition. Characters are decomposed by canonical equivalence. For example, the precomposed character 'Ç' (U+00C7) transforms to a combining sequence containing a base character and a combining accent.
NFKC
Normalization Form Compatibility Composition. Characters are decomposed by compatibility equivalence then recomposed by canonical equivalence.
NFKD
Normalization Form Compatibility Decomposition. Characters are decomposed by compatibility equivalence. For example, the fraction '½' (U+00BD) transforms into a sequence of three characters: 1/2.

The Normalizer class

Java SE 6 supports Unicode text normalization by providing the now public class java.text.Normalizer. This class defines both the normalize method that transforms text and the Form enumeration that represents the Unicode normalization forms NFC, NFD, NFKC, and NFKD.

Possible applications of various Unicode normalization forms are shown below:

Example: NFC

Suppose you want to publish a document on the Web. The Character Model for the World Wide Web specification recommends that in order to improve indexing, searching and other text related functionality of the Web, data should be normalized before publishing (early normalization). The specification states that NFC is preferred because almost all legacy data as well as data created by current software is already normalized to NFC. The following code reads data from standard input and writes NFC-normalized data to standard output. The UTF-8 encoding is used for both input and output.

import java.io.*;
import java.text.Normalizer;
import java.text.Normalizer.Form;

public class NFC {
  public static void main(String[] args) {
    final String INPUT_ENC = "UTF-8";
    final String OUTPUT_ENC = "UTF-8";
    try {
      BufferedReader r = new BufferedReader(
          new InputStreamReader(System.in, INPUT_ENC));
      PrintWriter w = new PrintWriter(
          new OutputStreamWriter(System.out, OUTPUT_ENC), true);
      String s;
      while ((s = r.readLine()) != null) {
        w.println(Normalizer.normalize(s, Form.NFC));
      }
    } 
    catch (Exception ex) {
      ex.printStackTrace();
    }
  }
}
 

The NFC normalization is also well suited for string equality tests. Note that the java.text.Collator class, initialized with the appropriate locale, should be used for string comparisons. The reason for using the Collator class is that sorting order for accented letters differs in different languages. For sorting purposes, some languages place accented letters right after the base letter, and some place accented letters after all base letters.

Example: NFD

Suppose you are developing a phone directory application. You store the directory data in some database and have a search form to look up the data. As people names around the world contain accented characters, you have two problems: many databases do not like accented characters, and many users of your application will not bother to enter, or just can not enter the correct (accented) names into the search form of your application. So you must remove all accents from both the data stored in the database, and the data read from the search form.

The following code reads standard input line by line, strips accented characters from each line and writes the result to standard output. The UTF-8 encoding is used for both input and output.

import java.io.*;
import java.text.Normalizer;
import java.text.Normalizer.Form;

public class NFD {
  public static void main(String[] args) {
    final String INPUT_ENC = "UTF-8";
    final String OUTPUT_ENC = "UTF-8";
    try {                
      BufferedReader r = new BufferedReader(
        new InputStreamReader(System.in, INPUT_ENC));
      PrintWriter w = new PrintWriter(
        new OutputStreamWriter(System.out, OUTPUT_ENC), true);
      String s;
      while ((s = r.readLine()) != null) {
        // decompose and remove accents
        String decomposed = Normalizer.normalize(s, Form.NFD);
        String accentsGone = 
            decomposed.replaceAll("\\p{InCombiningDiacriticalMarks}+", "");
        w.println(accentsGone);
      }
    } catch (Exception ex) {
      ex.printStackTrace();
    }
  }
}
 

Example: NFKC

NFKC normalization affects characters with combining marks that have a compatibility decomposition form. So, the character sequence U+1E9B U+0323 (LATIN SMALL LETTER LONG S WITH DOT ABOVE followed by the COMBINING DOT BELOW) is transformed to the single character value U+1E69. The normalized character is 'ṩ' (LATIN SMALL LETTER S WITH DOT BELOW AND DOT ABOVE).

This normalization form is required to comply with the string profile specification for the International Domain Names (RFC 3491). If a domain name contains non-ASCII characters, it must be normalized to the NFKC form. So if you are building an application that registers international domain names, you must encode the names to this form.

Note that the International Domain Names encoding specifies Unicode version 3.2, and there are differences in normalization forms for some CJK ideographic characters between Unicode versions 3.2 and 4.0. If you are not implementing RFC 3491 and just want to get the normalized domain name, you may use the facilities provided by the java.net.IDN class.

The encoding process is similar to those showed in the NFC example, the only difference is that Form.NFKC encoding form should be used instead of Form.NFC.

Example: NFKD

This form of normalization is useful when legacy text data is converted to XML format. The Unicode in XML and other Markup Languages specification defines several rules for dealing with compatibility characters. For example, it recommends using <sup> and <sub> markup for superscripts and subscripts, using MathML markup for expressing fractions, using list item marker styles instead of circled digits, and so on. If you are building an application that converts legacy data to XML, you should consider applying the appropriate markup and/or styles to text data that has been normalized to NFKD.

In order to convert data to NFKD form, you should pass Form.NFKD as the second parameter to the Normalizer.normalize method:

Normalizer.normalize(s, Form.NFKD); 

Normalization Testing

The java.text.Normalizer class defines the isNormalized method, which checks whether a given character sequence is normalized according to one of the four normalization forms. The following code reads lines from standard input and reports whether the line is normalized to any of the four forms. Input is UTF-8 encoded.

import java.io.*;
import java.text.Normalizer;
import java.text.Normalizer.Form;

public class IsNormalized {
  public static void main(String[] args) {
    final String INPUT_ENC = "UTF-8";
    final Form[] forms = { Form.NFC, Form.NFD, Form.NFKC, Form.NFKD };
    try {                
      BufferedReader r = new BufferedReader(
          new InputStreamReader(System.in, INPUT_ENC));
      String s;
      int line = 1;
      while ((s = r.readLine()) != null) {
        System.out.printf("%5d:", line++);
        for (Form f : forms) {
          if (Normalizer.isNormalized(s, f)) {
            System.out.print(" " + f.toString());
          }
        }
        System.out.println();
      }
    } catch (Exception ex) {
      ex.printStackTrace();
    }
  }
}
 

For More Information

The following documents provide more inforamation about normalization:

Monitoring Image I/O Events

by Brian Burkhalter

The JavaTM Image I/O API provides a framework for working with images on the Java platform. Within this framework, the javax.imageio.event package defines several interfaces for monitoring synchronous events that are emitted while reading or writing images.

You can use these interfaces for the following purposes:

  1. monitor the progress of image reading
  2. receive notifications as each region of an image is read
  3. trap warning messages about image reading

Each of these varieties of event monitoring is enabled by registering a listener with the ImageReader being used. You can obtain an ImageReader instance from the ImageIO class. The following code shows how to retrieve a reader for the TIFF image format:

ImageReader reader;
Iterator<ImageReader> readers = 
  ImageIO.getImageReadersByMIMEType("image/tiff");
if (readers.hasNext()) {
  reader = readers.next();
}
 

Monitoring reading progress

The IIOReadProgressListener interface is used to monitor image reading progress. An implementation of IIOReadProgressListener is registered with an ImageReader using the addIIOReadProgressListener method. For example, the following code uses a ProgressMonitor to display image reading progress:

Component parentComponent;
String imageName;
ImageReader reader;

// Create a ProgressMonitor which displays percentage completed.
final ProgressMonitor pm =
    new ProgressMonitor(parentComponent, imageName, "0 %", 0, 100);

// Register an anonymous inner class implementing IIOReadProgressListener
reader.addIIOReadProgressListener(new IIOReadProgressListener() {
  // Close the ProgressMonitor if the read is aborted.
  public void readAborted(ImageReader source) {
    pm.close();
  }

  public void imageStarted(ImageReader source,
      int imageIndex) {
    // Abort the read if "cancel" pressed.
    if(pm.isCanceled()) {
      source.abort();
    }
  }

  // Set image progress to 100% upon completion.
  public void imageComplete(ImageReader source) {
    imageProgress(source, 100.0F);
  }
  
  // Update the progress bar and its label each time the reader
  // notifies the IIOReadProgressListener of a new percentage.
  public void imageProgress(ImageReader source, float percentageDone) {
    // Abort the read if "cancel" pressed.
    if(pm.isCanceled()) {
      source.abort();
      return;
    }
  
    // Update the progress and label.
    final int nv = (int)percentageDone;
    SwingUtilities.invokeLater(new Runnable() {
      public void run() {
        pm.setProgress(nv);
        pm.setNote(nv+" %");
      }
    });
  }
  
  public void thumbnailStarted(ImageReader source, int imageIndex, 
      int thumbnailIndex) { 
  }
  
  public void thumbnailProgress(ImageReader source, float percentageDone) {}
  
  public void thumbnailComplete(ImageReader source) {}
  
  public void sequenceStarted(ImageReader source, int minIndex) {}
  
  public void sequenceComplete(ImageReader source) {}
});
 

The ProgressMonitor displays a progress bar which is filled from left to right as image loading progresses. When image reading begins, the reader invokes the imageStarted method, and the listener checks whether the ProgressMonitor's "e;cancel" button has been pressed. If canceled, the reading stops. When canceled, the reader calls the readAborted method, which closes the ProgressMonitor.

During reading, the reader will periodically invoke the imageProgress method with an updated progress value. After checking whether reading has been canceled, the listener updates the ProgressMonitor with the new progress value and percentage value string. Note that the listener updates the  ProgressMonitor on the Swing event thread, not on the thread which invoked the imageProgress method. When image reading is complete, the reader invokes the imageComplete method to set the percentage completion value to 100.

The following image shows the ProgressMonitor as it updates its progress bar:

ProgressMonitor

You can use the thumbnail methods of the IIOReadProgressListener class to monitor the progress of thumbnail loading when applicable.

Receiving region update notifications

The IIOReadUpdateListener interface is used to receive notifications when each region of an image is read. An image region could consist of a single row of pixels for scanline-oriented imagery or of a single tile for tiled imagery. Use the addIIOReadUpdateListener method to register an IIOReadUpdateListener implementation with an ImageReader instance.

For example an image display component might implement IIOReadProgressListener as follows:

public void imageUpdate(ImageReader reader, BufferedImage image,
    int minX, int minY, int width, int height,
    int periodX, int periodY, int[] bands) {
  // Set the displayed image to the parameter image.
  setImage(image);
  
  // Repaint the sections of the image just updated.
  repaint(0L, minX, minY, width, height);
}
  
public void passStarted(ImageReader reader, BufferedImage image,
    int pass, int minPass, int maxPass,
    int minX, int minY,
    int periodX, int periodY, int[] bands) {}
      
public void passComplete(ImageReader reader, BufferedImage image) {}
      
public void thumbnailPassStarted(ImageReader reader, BufferedImage image,
    int pass, int minPass, int maxPass,
    int minX, int minY,
    int periodX, int periodY, int[] bands) {}
          
public void thumbnailPassComplete(ImageReader reader, 
    BufferedImage image) {}
 

When an ImageReader invokes the imageUpate method, the method first calls the setImage method to perform any initialization that might be required for the display component:

/** Instance variable for the image being displayed. */
BufferedImage theImage = null;

/** Initialize the display component using the provided image. */
protected void setImage(BufferedImage image) {
  if(image != theImage) {
    // Save the reference.
    theImage = image;
    // Initialize the display component as needed.
    // --- CODE OMITTED ---
  }
}        
 

The display component then calls the repaint method to draw the image region that has just been loaded.

Being notified when an image region is updated is especially important when reading large images. The update notification permits the display component to paint each region as it is read rather than wait for the entire image to load before drawing any of the image.

Trapping warning messages

The IIOReadWarningListener interface is used to trap warning messages emitted by an ImageReader while reading an image. An implementation of IIOReadWarningListener is registered with an ImageReader using the addIIOReadWarningListener() method. For example, this code uses a JDialog to display the text of warning messages:

Component parentComponent;
String imageName;
ImageReader reader;

// Create a text area to contain the warning message(s).
final JTextArea text = new JTextArea();
text.setColumns(60);
text.setLineWrap(true);
text.setWrapStyleWord(true);

// Create a warning option pane to contain the text area.
JOptionPane opt = new JOptionPane(new JScrollPane(text),
    JOptionPane.WARNING_MESSAGE, JOptionPane.DEFAULT_OPTION);

// Create a modal dialog for the option pane.
final JDialog dialog =
    opt.createDialog(JOptionPane.getFrameForComponent(parentComponent),
    "Warnings: "+imageName);
dialog.setModal(false);

// Register an anonymous inner class implementing IIOReadWarningListener
reader.addIIOReadWarningListener(new IIOReadWarningListener() {
  public void warningOccurred(ImageReader source,
      final String warning) {
    // Append the current warning to the text area.
    SwingUtilities.invokeLater(new Runnable() {
      public void run() {
        text.append("[WARNING]: "+ warning+"\n");
        dialog.pack();
        dialog.setVisible(true);
      }
    });
  }
});
 

When the ImageReader emits a warning message, it invokes the warningOccured method, which appends the message to the text area and displays the warning dialog. Note that the warning dialog is updated on the Swing event thread, not on the thread which invoked the warningOccurred method.

The following image shows the warning dialog:

Warning dialog

Monitoring image writing events

Monitoring Image I/O events while writing images is very similar to monitoring events while reading images. The IIOWriteProgressListener and IIOWriteWarningListener interfaces are analogs of IIOReadProgressListener and IIOReadWarningListener, respectively. Use them with an ImageWriter like the latter interfaces are used with an ImageReader as described above.

For More Information

The following documents provide additional information about the Image I/O API:

Rate and Review
Tell us what you think of the content of this page.
Excellent   Good   Fair   Poor  
Comments:
Your email address (no reply is possible without an address):
Sun Privacy Policy

Note: We are not able to respond to all submitted comments.