|
Articles Index
Technology and XML
Part 1 -- An Introduction to APIs for XML Processing
By Thierry Violleau
November 2001
Introduction
With the growing interest in web services and e-business platforms, XML is
joining Java in the developer's toolbox. As
of today, no less than six extensions to the Java Platform empower the Java
developer when building XML-based applications:
- Java API for XML Processing (JAXP)
- Java API for XML/Java Binding (JAXB)
- Long Term JavaBeans Persistence
- Java API for XML Messaging (JAXM)
- Java API for XML RPC (JAX RPC)
- Java API for XML Registry (JAXR)
This article addresses the first one, the Java API for XML Processing (JAXP). It also addresses the technologies that JAXP directly or indirectly provides to the Java developer, or technologies that rely on it in order to process XML
documents, among others:
- SAX, the Simple API for XML
- DOM, the Document Object Model API from W3C
- XSLT, the XML Style Sheet Language Transformations from W3C
- XPath, the XML Path Language from W3C
- JDOM, the "Java optimized" document object model API from jdom.org
This article -- the first in a series of three -- is intended
to give an overview of the different APIs available to the developer
by presenting some sample programs. The differences in performance will
be addressed in a second article, and a third article will attempt to
give tips on how to improve the performance of XML-based applications
from a programmatic and architectural point of view. In order to
compare the sample programs, we applied them to solving the same
problem and providing an identical solution. The problem has been kept
simple to accommodate the different capabilities of those technologies.
The sample programs presented here may be considered to be micro-benchmarks.
The demonstration essentially focuses on understanding what technologies
can be applied to process input XML documents; it does not cover in
detail the generation of output XML documents.
This series of articles does not cover XML and XML Java
processing in depth. The reader must be familiar both with XML and the Java
programming language, and will find the article even more valuable if they
already have experience with one of the APIs and wants to compare it with
some of the others.
Overview of Different XML Processing Models
What does it mean for an application to be XML-based or, to some
extent, to be a Web Service? With respect to XML, it mainly means
that it may consume XML documents, apply its business logic on
retrieved information and generate resulting XML documents. Those
three phases can typically be described in more detail as:
- XML input processing
- Parsing and validating the source document
- Recognizing/searching for relevant information
based on its location or its tagging in the source document
- Extracting the relevant information once it has
been located
- Optionally, mapping/binding the retrieved
information to business objects
- Business logic handling
- The actual processing
of the input information optionally resulting in the generation of
output information
- XML output processing
- Constructing a model of the document to
generate with DOM, JDOM, and so on
- Applying XSLT style sheets or directly
serializing to XML
SAX and DOM are the most common processing models. Using
SAX to process an XML document, one has to code methods to
handle events thrown by the parser as it encounters the different
tokens of the markup language. With DOM, one has to write code to
walk through a tree-like data structure created by the parser from
the source document. Since a SAX parser generates a transient flow
of events, the four steps of XML input processing described above
(namely: parsing, recognizing, extracting and mapping) must be done
in one single cycle: each caught event is handled immediately and
the relevant information passed on with the event. When using DOM,
the XML input processing is done in at least two cycles: first, the
DOM parser creates a tree-like data structure that models the XML
source document (DOM tree), then the DOM tree is walked through,
searching for relevant information to extract and further process;
this last cycle can be repeated as many times as necessary since
the DOM tree persists in memory.
Table 1: SAX and DOM features
|
SAX
|
DOM
|
|
Event based model
|
Tree data structure
|
|
Serial access (flow of events)
|
Random access (in-memory data structure)
|
|
Low memory usage (only events are generated)
|
High memory usage (the document is loaded in memory)
|
|
To process parts of the document (catching relevant events)
|
To edit the document (processing the in-memory data structure)
|
|
To process the document only once (transient flow of events)
|
To process multiple times (document loaded in memory)
|
XSLT is a much higher-level processing model than
SAX and DOM. XSLT mostly requires the developer to code rules
(templates) that will be applied when specified patterns are
encountered in the source document. These patterns are specified
using the Xpath language. Xpath is used to locate and extract
information from the source document and specifically address the
steps 1.b and 1.c in the detailed XML processing. While SAX and DOM
require the Java developer to write Java code, XSLT (apart from the
engine invocation itself) mostly requires writing style sheets which
are themselves XML documents. Compared to DOM and SAX programming,
XSLT programming may be viewed as scripting.
Table 2: SAX, DOM and XSLT processing phases
|
Processing Phase
|
SAX
|
DOM
|
XSLT
|
|
XML input
processing
|
|
Parsing and validating
|
Built in
|
Built in or based
on SAX
|
Based on SAX or
DOM
|
|
Recognizing/searching
|
Catching events
with event handlers
|
Searching the tree
with tree walkers
|
Xpath patterns
|
|
Extracting
|
Catching events
|
Getting attribute
values, node content: API methods
|
Getting attribute
values, node contents: Xpath statements
|
|
Mapping/binding
|
Creating business
objects from the extracted information
|
Creating business
objects from the extracted information
|
If ever, through
DOM or SAX (pipelining)
|
|
XML output
processing
|
|
Constructing
|
No default support
but can be done by generating a properly balanced sequence of
method calls to event handlers
|
Implicitly part of
the model: API factory methods
|
Implicitly part of
the model: XSL statements
|
|
Serializing
|
No default support
but can be done with a custom event handler
|
Implementation
specific support, or through XSLT identity transformation
|
Implicitly part of
the model: XSL output method statement
|
Some of the APIs available to the Java developer in
order to process XML (that is DOM, XSLT) may be built on top of
others, adding more levels of abstraction and hence more
power; but they also overlap leaving to the developer the choice
of when to use one instead of another. The drawing below sketches the
dependencies between those technologies as well as some of the
possibilities (represented as mono- or bi-directional arrows) offered
to the developer who needs to implement an XML-based application.

Illustration 1: An application may process/generate XML documents into/from
internal business objects, using SAX, DOM, XPath or XSLT; an application may
even sometimes use a document model like JDOM to represent its core data
structures and apply the business logic.
For each of the different technologies, SAX, DOM,
XSLT, XPath and JDOM, we implemented sample programs intended to
compare the different APIs being used both from the programmatic
(capabilities, ease of use) and performance points of view. First, we
will present the sample application domain, then we will introduce
each API and describe the corresponding sample program.
Sample Application Domain
The sample programs that will be presented in this article are
based on different XML processing APIs applied to the same set of
documents and provide the exact same outputs. The XML documents
processed using those different techniques conform to the same
Document Type Definition or XML Schema. Those schemas specify the
representation of a chessboard configuration. In order to later on
benchmark those sample programs, we augmented the size of the
processed documents (to increase the work load); the schemas have
been extended accordingly to describe sets of such chessboard
configurations (from 10 to 5000).
Illustration 2: A chessboard configuration
Each program processing a set of
chessboard configurations outputs the same configurations in a simple
human-readable text format. The different implementations based on
SAX, DOM, XSLT and XPath when applied to the same input documents
provide the same outputs.
Below are two representations of the same chessboard
configuration, one is in XML, the other is in plain text. The second
has been generated from the first one using one of the sample
programs.
Code Sample 1: An XML representation of a chessboard configuration
<CHESSBOARD>
<WHITEPIECES>
<KING><POSITION COLUMN="G" ROW="1"/></KING>
<BISHOP><POSITION COLUMN="D" ROW="6"/></BISHOP>
<ROOK><POSITION COLUMN="E" ROW="1"/></ROOK>
<PAWN><POSITION COLUMN="A" ROW="4"/></PAWN>
<PAWN><POSITION COLUMN="B" ROW="3"/></PAWN>
<PAWN><POSITION COLUMN="C" ROW="2"/></PAWN>
<PAWN><POSITION COLUMN="F" ROW="2"/></PAWN>
<PAWN><POSITION COLUMN="G" ROW="2"/></PAWN>
<PAWN><POSITION COLUMN="H" ROW="5"/></PAWN>
</WHITEPIECES>
<BLACKPIECES>
<KING><POSITION COLUMN="B" ROW="6"/></KING>
<QUEEN><POSITION COLUMN="A" ROW="7"/></QUEEN>
<PAWN><POSITION COLUMN="A" ROW="5"/></PAWN>
<PAWN><POSITION COLUMN="D" ROW="4"/></PAWN>
</BLACKPIECES>
</CHESSBOARD>
|
Code Sample 2: A simple human-readable text format representation of a
chessboard configuration
White king: G1
White bishop: D6
White rook: E1
White pawn: A4
White pawn: B3
White pawn: C2
White pawn: F2
White pawn: G2
White pawn: H5
Black king: B6
Black queen: A7
Black pawn: A5
Black pawn: D4
|
XML Documents Processed
Two sets of sample input XML documents have been produced:
- A set of XML documents conforming to a Document Type
Definition
- A set of XML documents conforming to an equivalent XML Schema
Because there are very few XML parsers supporting the XML Schema
specifications, most of the sample programs use the first set of XML
documents, based on DTD. The second set of documents is only used for
a benchmark comparing DTD and XML Schema validation respective
performances.
DTD Based Documents
Two DTDs specify the format of the XML documents which are used as
input to the sample programs:
- A DTD specifying a single chessboard configuration
- A DTD relying on the previous one to specify a set of
chessboard configurations
Code Sample 3: The DTD for a chessboard configuration (dtd/Chessboard.dtd)
<!ELEMENT CHESSBOARD (WHITEPIECES, BLACKPIECES)>
<!ENTITY % pieces
"KING,
QUEEN?,
BISHOP?, BISHOP?,
ROOK?, ROOK?,
KNIGHT?, KNIGHT?,
PAWN?, PAWN?, PAWN?, PAWN?,
PAWN?, PAWN?, PAWN?, PAWN?"
>
<!ELEMENT WHITEPIECES (%pieces;)>
<!ELEMENT BLACKPIECES (%pieces;)>
<!ELEMENT POSITION EMPTY>
<!ATTLIST POSITION
COLUMN (A|B|C|D|E|F|G|H) #REQUIRED
ROW (1|2|3|4|5|6|7|8) #REQUIRED
>
<!ELEMENT KING (POSITION)>
<!ELEMENT QUEEN (POSITION)>
<!ELEMENT BISHOP (POSITION)>
<!ELEMENT ROOK (POSITION)>
<!ELEMENT KNIGHT (POSITION)>
<!ELEMENT PAWN (POSITION)>
|
This schema enforces several constraints from the application
domain:
- One king per color present on the chessboard at any time
- Zero or one queen per color
- Zero up to two bishops per color
- Zero up to two rooks per color
- Zero up to eight pawns per color
- A piece must be on one of the columns A, B, C, D, E, F, G and H
- A piece must be on one of the rows 1, 2, 3, 4, 5, 6, 7 and 8
Nevertheless, this schema does not prevent two pieces from sharing
the exact same position.
The following DTD uses the previous one to specify a set of
chessboard configurations which will be children of a single
CHESSBOARDS element.
Code Sample 4: The DTD for multiple chessboard
configurations (dtd/Chessboards.dt)
<!ELEMENT CHESSBOARDS (CHESSBOARD*)>
<!ENTITY % chessboard SYSTEM "Chessboard.dtd">
%chessboard;
The number of chessboard configurations
is unbounded, allowing any number of chessboard configurations to be
defined in a single document. In order to benchmark the different
sample programs, documents with 10, 100, 200,
300, 400, 500, 1000, 2000, 3000, 4000 and 5000 chessboard
configurations have been created.
Code Sample 5: An XML document conforming to the DTD
(Chessboards-[10-5000].xml)
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE CHESSBOARDS SYSTEM "dtd/Chessboards.dtd">
<CHESSBOARDS>
<CHESSBOARD>
<WHITEPIECES>
<KING><POSITION COLUMN="G" ROW="1" /></KING>
<BISHOP><POSITION COLUMN="D" ROW="6" /></BISHOP>
<ROOK><POSITION COLUMN="E" ROW="1" /></ROOK>
<PAWN><POSITION COLUMN="A" ROW="4" /></PAWN>
<PAWN><POSITION COLUMN="B" ROW="3" /></PAWN>
<PAWN><POSITION COLUMN="C" ROW="2" /></PAWN>
<PAWN><POSITION COLUMN="F" ROW="2" /></PAWN>
<PAWN><POSITION COLUMN="G" ROW="2" /></PAWN>
<PAWN><POSITION COLUMN="H" ROW="5" /></PAWN>
</WHITEPIECES>
<BLACKPIECES>
<KING><POSITION COLUMN="B" ROW="6" /></KING>
<QUEEN><POSITION COLUMN="A" ROW="7" /></QUEEN>
<PAWN><POSITION COLUMN="A" ROW="5" /></PAWN>
<PAWN><POSITION COLUMN="D" ROW="4" /></PAWN>
</BLACKPIECES>
</CHESSBOARD>
<CHESSBOARD>
...
</CHESSBOARD>
</CHESSBOARDS>
|
XML Schema Based Documents
Similar to the DTDs presented above, two XML Schemas specify the
format of the XML documents that are used as input to the benchmark
comparing DTD and XML Schema validations:
- An XML Schema specifying a single chessboard configuration
- An XML Schema relying on the previous one to specify a set of
chessboard configurations
Code Sample 6: The XML Schema for a chessboard
configuration (xsd/Chessboard.xsd)
<?xml version="1.0" encoding="UTF-8"?>
<xsd:schema
xmlns:xsd="http://www.w3.org/2001/XMLSchema"
targetNamespace="http://mde.sun.com/Chessboard"
xmlns="http://mde.sun.com/Chessboard"
elementFormDefault="qualified">
<xsd:element name="CHESSBOARD">
<xsd:complexType>
<xsd:sequence>
<xsd:element name="WHITEPIECES" type="pieces" />
<xsd:element name="BLACKPIECES" type="pieces" />
</xsd:sequence>
</xsd:complexType>
</xsd:element>
<xsd:complexType name="pieces">
<xsd:sequence>
<xsd:element name="KING" type="piece"
minOccurs='1' maxOccurs='1'/>
<xsd:element name="QUEEN" type="piece"
minOccurs='0' maxOccurs='1'/>
<xsd:element name="BISHOP" type="piece"
minOccurs='0' maxOccurs='2'/>
<xsd:element name="ROOK" type="piece"
minOccurs='0' maxOccurs='2'/>
<xsd:element name="KNIGHT" type="piece"
minOccurs='0' maxOccurs='2'/>
<xsd:element name="PAWN" type="piece"
minOccurs='0' maxOccurs='8'/>
</xsd:sequence>
</xsd:complexType>
<xsd:complexType name="piece">
<xsd:sequence>
<xsd:element name="POSITION"
minOccurs='1' maxOccurs='1'>
<xsd:complexType>
<xsd:attribute name="COLUMN" use='required'>
<xsd:simpleType>
<xsd:restriction base="xsd:string">
<xsd:pattern value="[A-H]"/>
</xsd:restriction>
</xsd:simpleType>
</xsd:attribute>
<xsd:attribute name="ROW" use='required'>
<xsd:simpleType>
<xsd:restriction base="xsd:positiveInteger">
<xsd:minInclusive value="1"/>
<xsd:maxInclusive value="8"/>
</xsd:restriction>
</xsd:simpleType>
</xsd:attribute>
</xsd:complexType>
</xsd:element>
</xsd:sequence>
</xsd:complexType>
</xsd:schema>
|
This XML Schema enforces the same application domain constraints
as the DTD, but it does so using ranges and patterns such as
"minOccurs='0' maxOccurs='8'"
and "value="[A-H]".
Code Sample 7: The XML Schema for multiple chessboard
configurations (xsd/Chessboards.xsd)
<?xml version="1.0" encoding="UTF-8"?>
<xsd:schema
xmlns:xsd="http://www.w3.org/2001/XMLSchema"
targetNamespace="http://mde.sun.com/Chessboards"
xmlns="http://mde.sun.com/Chessboards"
xmlns:cb="http://mde.sun.com/Chessboard">
<xsd:import namespace="http://mde.sun.com/Chessboard"
schemaLocation='Chessboard.xsd' />
<xsd:element name="CHESSBOARDS">
<xsd:complexType>
<xsd:sequence>
<xsd:element ref="cb:CHESSBOARD"
maxOccurs='unbounded'/>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
</xsd:schema>
|
XML documents conforming to the Chessboards.dtd
and Chessboards.xsd schema
have identical structures and contents, the only exception being the
replacement of the Document Type Declaration by a reference to the
XML Schema.
For the benchmarks involving XML
Schemas, a document with 100 chessboard configurations was used.
Code Sample 8: The XML document conforming to the XML
Schema (Chessboards-schema-100.xml)
<?xml version="1.0" encoding="UTF-8"?>
<cbs:CHESSBOARDS
xmlns:cbs="http://mde.sun.com/Chessboards"
xmlns="http://mde.sun.com/Chessboard"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation='http://mde.sun.com/Chessboard
xsd/Chessboard.xsd
http://mde.sun.com/Chessboards
xsd/Chessboards.xsd'>
<CHESSBOARD>
<WHITEPIECES>
<KING><POSITION COLUMN="G" ROW="1" /></KING>
<BISHOP><POSITION COLUMN="D" ROW="6" /></BISHOP>
<ROOK><POSITION COLUMN="E" ROW="1" /></ROOK>
<PAWN><POSITION COLUMN="A" ROW="4" /></PAWN>
<PAWN><POSITION COLUMN="B" ROW="3" /></PAWN>
<PAWN><POSITION COLUMN="C" ROW="2" /></PAWN>
<PAWN><POSITION COLUMN="F" ROW="2" /></PAWN>
<PAWN><POSITION COLUMN="G" ROW="2" /></PAWN>
<PAWN><POSITION COLUMN="H" ROW="5" /></PAWN>
</WHITEPIECES>
<BLACKPIECES>
<KING><POSITION COLUMN="B" ROW="6" /></KING>
<QUEEN><POSITION COLUMN="A" ROW="7" /></QUEEN>
<PAWN><POSITION COLUMN="A" ROW="5" /></PAWN>
<PAWN><POSITION COLUMN="D" ROW="4" /></PAWN>
</BLACKPIECES>
</CHESSBOARD>
<CHESSBOARD>
...
</CHESSBOARD>
</cbs:CHESSBOARDS>
|
Sample Programs
All of the programs presented in this paper use the Java API for
XML Processing (JAXP) to interface with different underlying
implementations of the SAX parser, DOM document builder and XSL
Transformation engine.
Java API for XML Processing (JAXP)
The Java API for XML Processing (JAXP) allows applications to
parse and transform XML documents using an API that is independent of
any particular XML processor implementation. Through a plug-in
scheme, developers may change the XML processor implementations
without impacting their applications. JAXP 1.1 supports the following
standards:
- SAX version 2.0
- DOM Level 2
- XSLT 1.0
JAXP was used to create and invoke the different XML parser and
XSLT engine implementations. For the benchmark of a particular API
(SAX, DOM, XSLT), the same program is used; only the underlying XML
parser or XSLT engine implementation was changed.
The API relies on the factory design pattern to create new SAX
parser, DOM document builder or style sheet engines. Below are
typical examples of using the API to process an XML document using
respectively, SAX, DOM and XSLT.
When using the SAX API through JAXP, one must:
- Create a new SAX parser factory
- Configure the factory
- Create a new parser from the factory
- Set the parser's document handler, error handler, DTD handler
and entity resolver
- Parse the XML document(s)
Code Sample 9: Invoking a SAX parser using JAXP and
parsing the document as a stream of SAX events
import org.xml.sax.*;
import org.xml.sax.helpers.*;
import javax.xml.parsers.*;
boolean validating;
String fileToProcess;
...
SAXParserFactory factory
= SAXParserFactory.newInstance();
factory.setValidating(validating);
SAXParser parser = factory.newSAXParser();
...
parser.parse(fileToProcess, new HandlerBase() {
... // Custom implementation of the HandlerBase
// to process the document as SAX events
});
...
|
The typical steps when using DOM through JAXP are:
- Create a new DOM document builder factory
- Configure the factory
- Create a new document builder from the factory
- Set the underlying parser's error handler and entity resolver
- Parse the XML document(s) to generate a DOM tree
Code Sample 10: Invoking a document builder using
JAXP, parsing and processing the document as a DOM tree
import org.w3c.dom.*;
import javax.xml.parsers.*;
boolean validating;
String fileToProcess;
...
DocumentBuilderFactory factory
= DocumentBuilderFactory.newInstance();
factory.setValidating(validating);
DocumentBuilder builder
= factory.newDocumentBuilder();
builder.setErrorHandler(new ErrorHandler() {
...
});
...
Document document = builder.parse(fileToProcess);
... // Processing the document as a DOM tree
|
When using JAXP to perform XSL Transformations, the steps to
follow are not very different from SAX or DOM:
- Create a new transformer factory
- Configure the factory
- Create a new transformer from the factory with a particular
style sheet
- Set the error listener and URI resolver
- Apply the style sheet to the XML document(s) to generate DOM
trees, SAX events or write to an output stream
Code Sample 11: Invoking a XSLT engine using JAXP and
processing the document with a XSLT style sheet; the API allows
passing parameters to the style sheet engines.
import javax.xml.transform.*;
String styleSheetFile;
String fileToProcess;
OutputStream out;
Properties properties;
...
TransformerFactory factory
= TransformerFactory.newInstance();
Transformer transformer = factory.newTransformer(
new SAXSource(new InputSource(styleSheetFile)));
for (Enumeration i = properties.propertyNames();
i.hasMoreElements();) {
String name = (String) i.nextElement();
transformer.setParameter(name,
"\'" + properties.getProperty(name) + "\'");
}
transformer.transform(
new SAXSource(new InputSource(fileToProcess)),
new StreamResult(out));
...
|
Common Structure of the Sample Programs
All the sample programs share a common structure that will later
on allow them to be benchmarked. They all follow the typical steps of
using JAXP presented above, and additionally include two loops to
process the same document multiple times and in multiple runs. For
each run, a factory is used to created a parser or a style sheet
processor which is in turn used to process an XML source document
several times. The validation of the source document against its
declared DTDs or XML Schemas may be specified when invoking the
program and is implemented by configuring the factory through its
setValidating method so that it creates a validating or
non-validating parser.
Code Sample 12: Example of the structure of the
sample programs (ChessboardSAXPrinter.java)
import org.xml.sax.*;
import org.xml.sax.helpers.*;
import javax.xml.parsers.*;
public class ChessboardSAXPrinter {
private SAXParser parser;
public ChessboardSAXPrinter(boolean validating)
throws Exception {
SAXParserFactory factory
= SAXParserFactory.newInstance();
factory.setValidating(validating);
parser = factory.newSAXParser();
...
return;
}
public void print(String fileName, PrintStream out)
throws SAXException, IOException {
...
parser.parse(fileName, ...);
return;
}
public static void main(String[] args) {
...
for (int k = 0; k < r; k++) {
// r: number of runs
ChessboardSAXPrinter saxPrinter
= new ChessboardSAXPrinter(validating);
long time = System.currentTimeMillis();
for (int i = 0; i < n; i++) {
// n: number of document processed per run
saxPrinter.print(args[0], out);
}
// print out the average time (s)
// to process a document during the current run
System.err.print(
(((double) (System.currentTimeMillis()
- time)) / 1000 / n) + "\t");
}
...
}
}
|
SAX Sample Program
The SAX API (Simple API for XML) uses an event-based model and
allows the processing of a source document as a stream of events. The
events are fired while parsing as a continuous flow of callback
method invocations. The events are nested in the same way as the
document elements, therefore no intermediate document model is
created. While the memory usage is low, the programming model can be
complex especially if the document structure doesn't faithfully match
the application data structures. Since it generates a transient flow
of events, the SAX API cannot be used when a document model has to
be edited or processed several times.
The SAX API defines several interfaces (some of the interfaces
from SAX 1.0 were renamed in SAX 2.0):
org.xml.sax.Parser (XMLReader
in SAX 2.0) interface for SAX parsers:
- Parses an XML document
- Allows an application to register:
- A document event handler
- An error handler
- A DTD handler
- An entity resolver
org.xml.sax.DocumentHandler
(ContentHandler in SAX 2.0) interface to receive document
events, notification of:
- The start or end of a document
- The start or end of an element
- Character data
- Ignorable whitespace in element content
- A processing instruction
org.xml.sax.ErrorHandler
interface to receive SAX error events, notification of:
- A recoverable error
- A non-recoverable/fatal error
- A warning
org.xml.sax.DTDHandler
interface to receive notification of basic DTD-related events,
notification of:
- A notation declaration event
- An unparsed entity declaration event
org.xml.sax.EntityResolver
interface for resolving external entity references
org.xml.sax.HandlerBase
(DefaultHandler in SAX 2.0) default implementation of
the four previous interfaces.
An application must provide at least a document (or content) handler in
order to catch relevant events and process them.

Illustration 3: When using the SAX API, the bare minimum a developer
has to do is to implement a DocumentHandler (ContentHandler in SAX 2.0)
or subclass BaseHandler (DefaultHandler) so that relevant events can be
caught and, optionally mapped to business objects or directly handled
by the business logic.
The sample program presented below implements the interface, HandlerBase,
and especially the startElement callback method. A
SAXParserFactory is used to create a new SAXParser.
The custom implementation of the interface HandlerBase
and the path of the XML source document to be processed is then
passed to the parser. While parsing, the startElement
method is called for every single start tag in the source document.
Code Sample 13: The XML document processing program
based on the SAX API (ChessboardSAXPrinter.java);
the startElement method print
information based on the start tags that it catches.
import org.xml.sax.*;
import org.xml.sax.helpers.*;
import javax.xml.parsers.*;
public class ChessboardSAXPrinter {
private SAXParser parser;
private PrintStream out;
public class ChessboardHandler extends HandlerBase {
private boolean whitePiece = false;
public void startElement(String name,
AttributeList attrs) {
if (name.equals("WHITEPIECES")) {
whitePiece = true;
} else if (name.equals("BLACKPIECES")) {
whitePiece = false;
} else if (name.equals("KING")
|| name.equals("QUEEN")
|| name.equals("BISHOP")
|| name.equals("ROOK")
|| name.equals("KNIGHT")
|| name.equals("PAWN")) {
out.print((whitePiece ? "White" : "Black")
+ " "+ name.toLowerCase() + ": ");
} else if (name.equals("POSITION")) {
if (attrs != null) {
out.print(attrs.getValue("COLUMN"));
out.println(attrs.getValue("ROW"));
}
}
return;
}
...
}
...
}
|
While parsing, the startElement
method catches the following start tag events:
BLACKPIECES and
WHITEPIECES to keep track of
the color of the nested pieces.
KING, QUEEN,
BISHOP, ROOK,
KNIGHT, PAWN
to print out the piece name and its color.
POSITION to print out
the piece position, given by the element attribute ROW
and COLUMN.
The other events were not caught, but a real world application may
require catching events such as endElement,
characters (which notifies of
a textual content) and keep a more complex context in order to map
the event stream into business objects on which it may later on apply
business logic. Using SAX may be tedious, but may be the
fastest technique.
While not being an obligation, the SAX API may be used by DOM
document builders in order to construct a DOM tree from an XML source
document.
DOM Sample Program
DOM (Document Object Model) is a W3C specification presented as "a
platform- and language-neutral interface that will allow programs and
scripts to dynamically access and update the content, structure and
style of documents." It is essentially a tree data structure and
a set of methods to access and edit that structure. Since it's an in-memory data
structure, the memory usage is much higher than for SAX, but the document
model can be accessed randomly and processed multiple times.
The DOM API defines interfaces for each of the entities of an XML
document:
org.w3c.dom.Node interface: a single node in the document tree
- Defines methods to access, insert, remove, replace the child nodes
- Defines methods to access the parent node
- Defines methods to access the document
org.w3c.dom.Document interface is a Node that represents the entire XML document
org.w3c.dom.Element interface is a Node that represents an XML element
org.w3c.dom.Text interface is a Node that represents the textual content of an XML element
The application may apply its business logic directly to the DOM
tree, or go first through an additional stage of mapping relevant
information from the DOM tree to business objects.

Illustration 4: When using the DOM API, the application has to
access or edit an in-memory representation of the source document.
The program presented below uses the DOM API to parse and load in
memory an XML document describing a set of chessboard configurations.
It then walks through the resulting DOM tree and outputs the same
chessboard configurations in text format.
Two different implementations are used to highlight the potential
differences in performance when using different methods of the DOM
API: either accessing the elements by their
names or relative to their parents.
Code Sample 14: The XML document processing program
based on the DOM API (ChessboardDOMPrinter.java),
the print method walks down the tree accessing the elements
relative to their parent nodes.
import org.w3c.dom.*;
import org.xml.sax.*;
import javax.xml.parsers.*;
public class ChessboardDOMPrinter {
private DocumentBuilder builder;
public void print(String fileName, PrintStream out)
throws SAXException, IOException {
Document document = builder.parse(fileName);
NodeList nodes_i
= document.getDocumentElement().getChildNodes();
for (int i = 0; i < nodes_i.getLength(); i++) {
Node node_i = nodes_i.item(i);
if (node_i.getNodeType() == Node.ELEMENT_NODE
&& ((Element) node_i).getTagName()
.equals("CHESSBOARD")) {
Element chessboard = (Element) node_i;
NodeList nodes_j = chessboard.getChildNodes();
for (int j = 0; j < nodes_j.getLength(); j++) {
Node node_j = nodes_j.item(j);
if (node_j.getNodeType() == Node.ELEMENT_NODE) {
Element pieces = (Element) node_j;
NodeList nodes_k = pieces.getChildNodes();
for (int k = 0; k < nodes_k.getLength(); k++) {
Node node_k = nodes_k.item(k);
if (node_k.getNodeType() == Node.ELEMENT_NODE) {
Element piece = (Element) node_k;
Element position
= (Element) piece.getChildNodes().item(0);
out.println((pieces.getTagName()
.equals("WHITEPIECES")
? "White " : "Black ")
+ piece.getTagName().toLowerCase()
+ ": "
+ position.getAttribute("COLUMN")
+ position.getAttribute("ROW"));
}
}
}
}
}
}
return;
}
}
|
This program walks down the DOM tree generated by the document
builder from the input XML document and:
- Gets all the
CHESSBOARD
elements.
- For each of the
CHESSBOARD
elements it gets the BLACKPIECES
and WHITEPIECES subelements.
- For each of these
BLACKPIECES
and WHITEPIECES elements, it
gets the KING, QUEEN,
BISHOP,ROOK,
KNIGHT and PAWN
subelements.
- For each of these
KING,
QUEEN,BISHOP,
ROOK, KNIGHT
and PAWN elements it prints
the color, the name and the position as specified by the ROW
and COLUMN attributes.
The following alternative implementation does not differ in the
algorithm but it retrieves them by name. While it seems more elegant
it may be less effective since this method looks up the elements
in the entire subtree, not just at the next level. Globally, in sample
programs, DOM seems more complicated than SAX. Actually, this
last implementation could be made much simpler. Since the
getElementsByTagName method retrieves all the
children and grand-children having a specified name, we could
collect all the POSITION elements starting from the
document root and then for each element returned, retrieve the name
of the piece (the tag name of the parent), and the color (based on
the tag name of the grand-parent).
Code Sample 15: A "naïve"
implementation of the print method based on the DOM API
and accessing the elements by name rather than by their
relative position in the tree (ChessboardDOMPrinter.java).
public void print(String fileName, PrintStream out)
throws SAXException, IOException {
Document document = builder.parse(fileName);
NodeList positions
= document.getElementsByTagName("POSITION");
for (int i = 0; i < positions.getLength(); i++) {
Element position = (Element) positions.item(i);
Element piece = (Element) position.getParentNode();
Element pieces = (Element) piece.getParentNode();
out.println(
(pieces.getTagName().equals("WHITEPIECES")
? "White " : "Black ")
+ piece.getTagName().toLowerCase() + ": "
+ position.getAttribute("COLUMN")
+ position.getAttribute("ROW"));
}
return;
}
|
XSLT Sample Program
XSL Transformations is a language for describing how to transform
an XML document (explicitly or implicitly represented as a tree) into
another. XSLT is a tree to tree transformation, from a source tree
to a result tree. It allows defining templates (rules) that will be
applied on elements from the source document and insert elements in
the result tree. The resulting document can be another well-formed
XML document (XML, WML), an HTML document, a text document, or any
other format provided that the proper output method is available.
XSLT uses XPath expressions to query elements from the source tree or
to evaluate document fragments to be inserted into the result tree.
Illustration 5: XST Transformations transforms an XML source document
into another document which can be of any format XML, HTML, text, and
so on, by applying a style sheet.
An XSLT processor reads both a source XML document and an XSL
style sheet. The XSL style sheet is itself a well-formed XML
document. Depending on the implementation, an XSLT engine may be able
to read input source as SAX events or DOM trees and also generates
SAX events or DOM trees.
This program uses an XSLT engine and a style sheet to transform an
XML document describing a set of chessboard configurations into its
corresponding text format. A TransformerFactory
is used to create a new Transformer
from the style sheet and the Transformer
is then used to process the source document generated.
The XSLT engine applies the following style sheet to each input document:
Code Sample 16: The style sheet transforming the XML
documents representing chessboard configurations
(ChessboardPrinter.xsl)
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="1.0">
<xsl:strip-space elements="*" />
<xsl:output method="text"/>
<xsl:template match="/">
<xsl:apply-templates />
</xsl:template>
<xsl:template match="WHITEPIECES/*">
<xsl:value-of select="concat('White ', name(),
': ', POSITION/@COLUMN, POSITION/@ROW)" />
<xsl:text>
</xsl:text>
</xsl:template>
<xsl:template match="BLACKPIECES/*">
<xsl:value-of select="concat('Black ', name(),
': ', POSITION/@COLUMN, POSITION/@ROW)" />
<xsl:text>
</xsl:text>
</xsl:template>
</xsl:stylesheet>
|
This style sheet is quite simple, it defines two templates, one
that matches the white pieces (pattern"WHITEPIECES/*")
and the other one that matches the black pieces (pattern
"BLACKPIECES/*").
When one of those templates applies, it inserts in the result tree a
text node containing a string describing the color, the name and the
position of the matched piece. When no template applies any more, the
result tree is printed using the text output method.
Illustration 6: XSLT can be used to preprocess or postprocess XML documents
before or after being handled by the application business logic
XPath Sample Program
Some XSLT engines (such as Xalan from Apache.org)
allow their XPath implementation to be invoked independently through
a specific API. An application may directly use XPath to query
information from an XML source document or to evaluate expression
against the source document.
The following program uses the DOM API to parse and load in memory
an XML document describing a set of chessboard configurations. It
then evaluates XPath expressions to locate the elements to be
processed and outputs the chessboard configurations in text format.
The XPathAPI selectNodeList
and eval methods are invoked
to evaluate the XPath expressions.
Code Sample 17: The XML document processing program
based on the XPath API (ChessboardXPathPrinter.java)
import javax.xml.parsers.*;
import javax.xml.transform.*;
import org.apache.xpath.*;
public class ChessboardXPathPrinter {
private DocumentBuilder builder;
public void print(String fileName, PrintStream out)
throws TransformerException, SAXException, ... {
Document document = builder.parse(fileName);
NodeList allPieces = XpathAPI.selectNodeList(
document.getDocumentElement(),
"//*[self::WHITEPIECES or self::BLACKPIECES]/*");
for (int i = 0; i < allPieces.getLength(); i++) {
Element piece = (Element) allPieces.item(i);
Element pieces = (Element) piece.getParentNode();
Element position
= (Element) piece.getChildNodes().item(0);
out.println((pieces.getTagName()
.equals("WHITEPIECES")
? "White " : "Black ")
+ piece.getTagName().toLowerCase()
+ ": "
+ position.getAttribute("COLUMN")
+ position.getAttribute("ROW"));
// out.println(XPathAPI.eval(piece,
// "concat(substring-before(name(..), \'PIECES\'),"
// + "\' \', name(),\': \',"
// + "POSITION/@COLUMN, POSITION/@ROW)"));
}
return;
}
...
}
|
The print method first uses an XPath expression to locate the piece
elements, then it prints using the DOM API; the second XPath expression
which has been commented out could have been used for the same purpose.
This program uses an XPath expression to retrieve the elements
that match both the white pieces and the black pieces (pattern
"//*[self::WHITEPIECES"
or "self::BLACKPIECES]/*"). Then for each of the
KING, QUEEN,
BISHOP, ROOK,
KNIGHT and PAWN
elements retrieved it gets the POSITION
subelement and prints the color (based on the grandparent element tag
name), the name and the position as specified by the ROW
and COLUMN attributes.
JDOM Sample Program
On its web site, JDOM
is presented in the following terms:
"JDOM is, quite simply, a Java
representation of an XML document. JDOM provides a way to represent
that document for easy and efficient reading, manipulation, and
writing. It has a straightforward API, is lightweight and fast, and
is optimized for the Java programmer. It's an alternative to DOM and
SAX, although it integrates well with both DOM and SAX."
The following program using the JDOM API parses and loads in
memory an XML document describing a set of chessboard configurations,
then walks through the resulting JDOM document and outputs the same
chessboard configurations in a text format. A SAXBuilder
is used to create a JDOM document from the XML source document. By
default, the JDOM SAXBuilder
relies on JAXP to internally create a SAX parser.
Code Sample 18: The XML document processing program
based on the JDOM API (ChessboardJDOMPrinter.java)
import org.jdom.*;
import org.jdom.input.*;
import org.xml.sax.*;
public class ChessboardJDOMPrinter {
private static boolean verbose = false;
private SAXBuilder builder;
public ChessboardJDOMPrinter(boolean validating)
throws Exception {
builder = new SAXBuilder();
builder.setValidation(validating);
...
return;
}
public void print(String fileName, PrintStream out)
throws JDOMException {
Document document = builder.build(fileName);
Element root = document.getRootElement();
List chessboards = root.getChildren("CHESSBOARD");
for (int i = 0; i < chessboards.size(); i++) {
Element chessboard = (Element) chessboards.get(i);
String[] pieceSetTags = { "WHITEPIECES",
"BLACKPIECES" };
for (int j = 0; j < pieceSetTags.length; j++) {
List pieceSets
= chessboard.getChildren(pieceSetTags[j]);
for (int k = 0; k < pieceSets.size(); k++) {
Element pieceSet = (Element) pieceSets.get(k);
String[] pieceTags = { "KING", "QUEEN", "BISHOP",
"ROOK", "KNIGHT", "PAWN" };
for (int l = 0; l < pieceTags.length; l++) {
List pieces = pieceSet.getChildren(pieceTags[l]);
for (int m = 0; m < pieces.size(); m++) {
Element piece = (Element) pieces.get(m);
Element position = piece.getChild("POSITION");
out.println(
(j == 0 ? "White " : "Black ")
+ pieceTags[l].toLowerCase() + ": "
+ position.getAttributeValue("COLUMN")
+ position.getAttributeValue("ROW"));
}
}
}
}
}
return;
}
...
}
|
This program is very similar to the DOM sample
program. It walks down the JDOM tree generated by the document
builder from the input XML document. It:
- Gets all the
CHESSBOARD
elements.
- For each of the
CHESSBOARD
elements, it gets the BLACKPIECES
and WHITEPIECES subelements.
- For each of those
BLACKPIECES
and WHITEPIECES elements, it
gets the KING, QUEEN,
BISHOP, ROOK,
KNIGHT and PAWN
subelements.
- For each of these
KING,
QUEEN, BISHOP,
ROOK, KNIGHT
and PAWN elements it gets the
POSITION subelement and it
prints the color, the name and the position as specified by the ROW
and COLUMN attributes.
YAXA Sample Program
This program uses the YAXA API to parse XML documents describing a
set of chessboard configurations, and outputs the same configurations
in a simple human-readable text format. YAXA is an experimental API
which works on top of SAX and adds the classical Java
Event/Source/Listener paradigm. The program below implements the
elementStarted method of an
ElementEvent.Adapter. This
ElementEvent.Adapter is
registered with an XMLPathTracker
(which subclasses a SAX DefaultHandler)
and listens for ElementEvents that match a specific pattern.
Code Sample 19: The XML document processing program
based on the YAXA API (ChessboardYAXAPrinter.java)
import java.io.*;
import org.xml.sax.*;
import org.xml.sax.helpers.*;
import javax.xml.parsers.*;
import yaxa.xml.*;
import yaxa.xml.events.*;
public class ChessboardYAXAPrinter {
private SAXParser parser;
public class ChessboardHandler
extends XMLEventTracker {
public ChessboardHandler() {
addElementListener(new XMLEventTrack.Pattern(
"//(WHITEPIECES|BLACKPIECES)/*/POSITION"),
new ElementEvent.Adapter() {
public void elementStarted(ElementEvent event) {
XMLEventTracker tracker
= (XMLEventTracker) event.getSource();
XMLEventTrack.Builder builder
= tracker.getCurrentTrackBuilder();
String colorName
= builder.getEventName(builder.getLength()-3)
.equals("WHITEPIECES")
? "White " : "Black ");
String pieceName
= builder.getEventName(builder.getLength()-2);
out.println(colorName + pieceName + ": "
+ event.getAttributes().getValue("COLUMN")
+ event.getAttributes().getValue("ROW"));
}
});
return;
}
}
}
|
This program catches all the element events that match a specific
pattern ("//(WHITEPIECE|BLACKPIECE)/*/POSITION").
Then for each of the matching elements (that is POSITION
elements) it prints out the color, the name and the position of the
corresponding piece.
YAXA also implements an event stream editor based on the previous
model which directly edits the flow of events fired by a SAX parser.
Code Sample 20: The program using the YAXA XML stream
editor to process chessboard configurations
(ChessboardXSEPrinter.java)
import javax.xml.parsers.*;
import org.xml.sax.*;
import yaxa.xml.*;
import yaxa.editor.*;
public class ChessboardXSEPrinter {
public static void main(String[] args) {
...
XMLStreamEditor editor= new XMLStreamEditor(args[0]);
XMLEventWriter writer = new XMLEventWriter("UTF-8",
false,
editor.getOutputFormat()
.equals(XMLEventWriter.TEXT_FORMAT),
editor.isStrippingSpace());
writer.setWriter(out);
editor.setRedispatcher(writer);
SAXParser parser
= SAXParserFactory.newInstance().newSAXParser();
parser.parse(args[1], editor);
editor.clear();
rewriter.flush();
...
}
}
|
Code Sample 21: The stream edition script
transforming the XML documents representing chessboard configurations
(ChessboardPrinter.xse)
<?xml version="1.0"?>
<!DOCTYPE xse:edit SYSTEM "file:./xse.dtd" >
<xse:edit>
<xse:output xse:format="text" xse:strip-space="true" />
<xse:print
xse:at='CHESSBOARDS/CHESSBOARD/WHITEPIECES/*/POSITION'
xse:data="White $[track{-1}]: \
$[element{@COLUMN}]$[element{@ROW}]" />
<xse:print
xse:at='CHESSBOARDS/CHESSBOARD/BLACKPIECES/*/POSITION'
xse:data="Black $[track{-1}]: \
$[element{@COLUMN}]$[element{@ROW}]" />
</xse:edit>
|
This script defines two commands, one that will be the trigger for
each white piece (pattern
"CHESSBOARDS/CHESSBOARD/WHITEPIECES/*/POSITION")
and the other that will be the trigger for each black piece (pattern
"CHESSBOARDS/CHESSBOARD/BLACKPIECES/*/POSITION").
These editing commands are actually implemented as event listeners
that watch for those particular events. When one of those commands is
triggered, it inserts in the event output stream a new text event
containing a string describing the color, the name and the position
of the matched piece. The event output stream is written as plain
text.
Generating XML Documents
So far, in this article, we have focused on the parsing of XML
document input and the processing of the explicit or implicit
resulting internal data structures. We have not yet covered the
generation of XML documents from an internal data structure -- the
serialization to XML.
To serialize a data structure to XML, two main solutions are
available:
- Generating the XML document directly "by hand".
- Constructing a corresponding DOM or a
JDOM tree and serializing it to generate the XML
document.
The second solution is preferable because it ensures a cleanly generated
document (no unclosed tags). Moreover, having the XML document
represented internally as a DOM tree, though it requires more memory
resources, allows for more effective post-processing, such as
applying XSLT style sheets.
The eMobile sample end-to-end application
demonstrated at the JavaOne 2000
Conference provided a good example of how servlets
can generate content targeted at different devices from value objects
returned by an EJB application using the following steps:
- Constructing the corresponding DOM tree.
- Generating the content from the DOM tree as follows:
- If an appropriate XSLT style sheet is available, transforming
the DOM tree to the targeted content type by applying the style
sheet.
- If no appropriate XSLT style sheet is available,
directly serializing the DOM tree to generate the XML stream.
At the time the eMobile application was demonstrated,
JAXP (version 1.0) only included support for SAX and DOM processing
and therefore it could not be used to apply XSLT style sheets.
Adapters had to be implemented to abstract the invocation of the
style sheet engines. With the latest release and its adoption by the
many XSLT engines and XML parsers, applications such as eMobile can
completely rely on JAXP and benefit from its plugability feature.
Conclusion
In this article, we demonstrated a sample of the
technologies available to developers in order to process XML
documents. Those technologies address different levels of abstraction
and provide different levels of usability for the Java programmer.
Some of those technologies like SAX, DOM, XPath and XSLT may rely on
each other. For example, a DOM document builder may use a SAX parser
to generate a DOM tree from an XML document, or an XPath processor
may apply expressions to a source DOM tree, or an XSLT engine may use
an XPath processor to match and evaluate style sheet templates. Since
they stack up, an XML application developer may be tempted to choose
any of them to achieve the same result, but will do so facing
difficulties and trade-offs between performance, memory usage and
flexibility.
XSLT and XPath may bring flexibility to your application due to
their scripting nature, and may be used to post and pre-process XML
documents generated and acquired by the application. SAX may bring
efficiency when mapping XML data to Java business objects. DOM and
especially JDOM may bring simplicity and efficiency when intensively
editing in-memory document models. In some cases, a document object
model may be suitable as the core data structure of the application.
All those APIs need to be considered when implementing an XML
application. Moreover with the SAX 2.0 API, and the
interchangeability (as source and result) of SAX and DOM in some
APIs, complex XML processing pipelines can be built combining
standard and custom implementations.
In the second article, we will benchmark the different sample
programs presented in this document and analyze their respective
performance when run with different XML parser implementations.
Resources
Java Technology & XML
Java APIs for XML Processing (JAXP)
The Simple API for XML (SAX)
Document Object Model (DOM)
Extensible Stylesheet Language (XSL)
XML Path Language (Xpath)
Xerces - Apache XML Parser for Java
Crimson - Apache XML Parser for Java
Xalan - Apache XSLT Style Sheet Engine
JDOM
YAXA
eMobile End-to-End Application using the Java 2, Enterprise
Edition - Part II
About the Author
Thierry Violleau is a staff engineer at Sun Microsystems where he works
on the J2EE BluePrints program. Previously, he worked in Market Development
Engineering - Enabling Technologies group where he helped ISVs integrate Java and XML technologies in their products and solutions.
Have a question about programming? Use
Java Online
Support.
|
|