|
Articles Index
This article explains some of the new concepts and important
features introduced in the Java API for XML Processing (JAXP) 1.3.
JSR 206 was
developed with performance and ease of use in mind. The new
Validation Framework gives much more power to any application
dealing with XML schema and improves performance significantly.
XPath APIs provide access to the XPath evaluation environment. JAXP
1.3 brings richer XML Schema data type support to the Java platform
by defining new data types that map to data types defined in W3C XML
Schema: Datatypes specification.
Keeping pace with the evolution of XML standards, JAXP 1.3 also
adds complete support for the following standards: XML 1.1, Document Object Model (DOM) L3, XInclude,
and Simple API
for XML (SAX) 2.0.2. All this has already gone into the Java
platform in the latest release of the Java Platform,
Standard Edition (J2SE) 5.0, code-named Tiger. If you are using
J2SE 1.3 or 1.4, you can download a
stand-alone stable implementation of JAXP 1.3 from java.net.
This article mainly concentrates on the work done as part of the
JSR 206 effort and explains new Schema Validation Framework concepts,
along with providing working code and diagrams. All the samples are
available for download from here.
The major new features introduced are the following:
Schema Validation Framework
JAXP 1.3 introduces a new
schema-independent Validation Framework (called the Validation
APIs). This new framework gives much more power to the application
dealing with XML schema and can accomplish things that were not
possible before. The new approach makes a fundamental shift in the
way XML processing and validation are performed. Validation used to
be considered an integral part of XML parsing, and previous versions
of JAXP supported validation as a feature of an XML parser: a
SAXParser or DocumentBuilder instance.
The new Validation APIs decouple
the validation of an instance document as a process independent of
parsing. This new approach has several advantages. Applications that
rely heavily on XML schema can greatly improve the performance of
schema validation. Perhaps more importantly, many previously
unsolvable problems can now be solved in an efficient, easy, and
secure way. Let's look at what you can do with the new Schema
Validation Framework.
Validate XML Against Any Schema
Though JAXP 1.3 requires
support only for W3C XML schema language, you can easily plug in
support for other schema languages, such as RELAX
NG. The Validation APIs provide a pluggability layer through
which applications can provide specialized validation libraries
supporting additional schema languages. This is achieved
using a SchemaFactory class that is capable of locating
implementations for the schema languages at runtime. The first step
is to specify the schema language to be used and obtain the concrete
factory implementation:
SchemaFactory sf = SchemaFactory.newInstance(<SCHEMA LANGUAGE>);
<SCHEMA LANGUAGE> could be W3C XML Schema, Relax NG etc.
|
If this function returns
successfully, it means that an implementation capable of supporting
specified schema language is available. Getting the SchemaFactory
implementation is the entry point to the Validation APIs. This step
goes through the pluggability mechanism that has long been at the
core of JAXP. You can write the code in such a way that applications
can switch between W3C XML Schema and RELAX NG validation without
changing a single line of code.
Compile
Schema
With the new Validation APIs,
an application has the option to parse only the schema, checking
schema syntax and semantics against the constraints that the
particular schema language imposes. This is quite useful when you are
writing a schema and want to make sure that the schema conforms to
the specification. The SchemaFactory class does this
job, loading the schemas and also preparing them in a special form
represented as a javax.xml.validation.Schema object that
can be used for validating instance documents against the schema. A
schema may include or import other schemas. In that case, those
schemas are also loaded.
When reading a schema, a
SchemaFactory may need to resolve resources and can
encounter errors. As Figure 1 indicates, LSResourceResolver
and an ErrorHandler can be registered on SchemaFactory.
The ErrorHandler is used to report any errors
encountered during schema compilation. The LSResourceResolver
is used to customize resolution of resources. This is a new interface
introduced as part of DOM L3. Functionally, it is the same as SAX
EntityResolver, except that it also provides the
information about the namespace of the resource being resolved -- for
example, the targetNamespace of the W3C XML schema.
 |
|
Figure 1. Getting the Schema Object
|
Here is a code sample that
shows how SchemaFactory can be used to compile schema
and get a Schema object:
String language = XMLConstants.W3C_XML_SCHEMA_NS_URI;
SchemaFactory factory = SchemaFactory.newInstance(language);
factory.setErrorHandler(new MyErrorHandler());
factory.setResourceResolver( new MyLSResourceResolver());
StreamSource ss = new StreamSource(new File("mySchema.xsd")));
Schema schema = factory.newSchema(ss);
|
A Schema object
is an immutable memory representation of schema. A
Schema instance can be shared with many different
parser instances, even if they are running in different threads. You
can write applications so that the same set of schema are parsed
only once and the same Schema instance is passed to
different instances of the parser.
Validate XML Using Compiled Schema
Before we look at this approach,
let's look at how we have been doing schema validation using the
schema properties that were defined in JAXP 1.2:
http://java.sun.com/xml/properties/jaxp/schemaLanguage
http://java.sun.com/xml/properties/jaxp/schemaSource |
Here is an example showing how these two properties are used in JAXP
1.3:
SAXParserFactory spf = SAXParserFactory.newInstance();
spf.setNamespace(true);
spf.setValidating(true);
SAXParser sp = spf.newSAXParser();
sp.setProperty("http://java.sun.com/xml/properties/jaxp/schemaLanguage",
"http://www.w3.org/2001/XMLSchema");
sp.setProperty("http://java.sun.com/xml/properties/jaxp/schemaSource",
"mySchema.xsd") ;
sp.parse(<XML Document>, <ContentHandler);
|
The user sets the
schemaLanguage and/or the schemaSource
property on SAXParser and sets the validation to
true. Generally, a business application defines a set
of schemas containing the business rules against which XML documents
must be validated. To accomplish this, an application sets the
schema using the schemaSource property or relies on the
xsi:schemaLocation attribute in the
instance document to specify the schema location(s).
This approach works well, but there
is a tremendous performance penalty: The specified schemas are loaded
again and again for every XML document that needs to be validated!
However, with the new Validation APIs, an application needs to parse
a set of schemas only once. See Figure 2.
 |
|
Figure 2. Set Compiled Schema on DocumentBuilder/SAXParserFactory
|
After the Compile
Schema step, do the following.
SAXParserFactory spf = SAXParserFactory.newInstance();
spf.setSchema(schema);
SAXParser saxParser = spf.newSAXParser();
saxParser.parse(new File("instance.xml"), myHandler);
|
Just set the Schema
instance on the factory and you are done. There is no need to set the
validation to true and no need to set the schemaLanguage
or schemaSource property. Validation of XML documents is
done against the compiled schema set on the factory. You will be
amazed by the performance gain using this approach. Try it yourself.
Run the sample ComparePerformance.java,
which can be downloaded from here.
Performance gain largely depends on the ratio of the size of the XML
schema to the size of the XML document. Larger ratios lead to a
larger performance gain. Look at the Reusing
a Parser Instance section to further improve the performance.
Note that it is an error to use either of the following properties:
http://java.sun.com/xml/jaxp/properties/schemaLanguage
http://java.sun.com/xml/jaxp/properties/schemaSource
|
in conjunction with a non-null Schema object. Such
configuration will cause a SAXException when those
properties are set on SAXParser or
DocumentBuilderFactory.
Validate a
SAXSource or DOMSource
As we mentioned earlier, there has been fundamental shift in XML
parsing and validation. Now XML validation is considered a process
independent from XML parsing. Once you have the Schema
instance loaded into memory, you can do many things. You can create
a ValidatorHandler that can validate a SAX stream or create a
stand-alone Validator (see Figure 3). A stand-alone
Validator can validate a SAXSource, a
DOMSource, or an XML document against any schema. In
fact, a Validator can still work if the
SAX stream or DOM object comes from a
different implementation.
 |
|
Figure 3. Validate a SAXSource or DOMSource Using a Validator
|
To receive any errors during the validation, an
ErrorHandler should be registered with the
Validator. Let's look at some working code. (Note: For
clarity, only a section of code is shown here. For the complete
source, look at the sample Validate.java,
which can be downloaded here.)
Validator validator = schema.newValidator();
validator.setErrorHandler( new ErrorHandlerImpl());
validator.validate(new StreamSource(<XML Document>));
|
Validator can also be used to validate the instance
document or DOM object in memory, with the augmented
result sent to DOMResult.
Document document = //DOM object
validator.validate(new DOMSource(document), new DOMResult());
|
The Validation APIs can validate a SAX stream and work
in conjunction with Transformation APIs to achieve pipeline
processing, as we will see in the next section.
Validate XML
After Transformation
Transformation APIs are used to transform one XML document into
another by applying a style sheet. There are times when we need to
validate the transformed XML document against a schema. Should we
feed that XML document to a parser and then use the schema feature to
do the schema validation? No. The new Validation APIs give you the
power to validate the transformed XML document against a different
schema by allowing the application to create a pipeline and pass the
output of a transformer to the Validation APIs to validate against
the desired schema. It doesn't matter if the output of the
transformation is a SAX stream or a DOM in
memory.
Validate a SAX Stream
The following code snippet shows you how to use specially
designed javax.xml.validation.ValidatorHandler
to validate a SAX stream. In the downloadable
source, look at the sample ValidateSAXStream.java
for more detail. Also look at the sample
TransformerValidationHandler.java, which
shows how to chain the output of Transformer
to ValidatorHandler. Here
is a section of the code:
String language = XMLConstants.W3C_XML_SCHEMA_NS_URI ;
SchemaFactory sf = SchemaFactory.newInstance(language);
Schema schema = sf.newSchema(new File(<SCHEMA>));
ValidatorHandler vh = schema.newValidatorHandler();
vh.setErrorHandler(new ErrorHandlerImpl());
vh.setContentHandler(new ApplicationContentHandler());
TransformerFactory tf = TransformerFactory.newInstance();
StreamSource ss = new new StreamSource(<STYLESHEET>);
Transformer t = tf.newTransformer(ss);
StreamSource xml = new StreamSource(<XML DOCUMENT>);
t.transform(new StreamSource(xml, new SAXResult(vh));
|
Figure 4 shows the whole flow, with an XML
document and a style sheet given as input to a
Transformer and a SAX stream as the
output. We take advantage of the modular approach of doing
validation independent from parsing. The
ValidatorHandler is a special handler that is capable
of working directly with a SAX stream. It validates the
stream and passes it to the application.
 |
|
Figure 4. Validating a SAX Stream
|
Validate DOM in memory
The Transformation APIs also allow a transformed result to be
obtained as a DOM object. The DOM object in
memory can be validated against a schema. This can be done as
follows:
DOMResult dr = new DOMResult();
t.transform(xml , dr);
DOMSource ds = new DOMSource();
schema.newValidator().validate(ds(dr.getNode()));
|
So you see that the Validation APIs can be used with the
Transformation APIs to do complex things easily. This approach also
boosts performance because it avoids the step of parsing the XML
again when validating a transformed XML document.
Validate a JDOM Document
The ValidatorHandler can be used to validate various
object models such as JDOM against the schema(s). In
fact, any object model (XOM, DOM4J, and so
on) that can be built on top of a SAX stream or can emit
SAX events can be used with the Schema Validation
Framework to validate an XML document against a schema. This is
possible because ValidationHandler can validate a SAX
stream.
Let's see how a JDOM document can be validated
against schema(s):
SAXOutputter so = new SAXOutputter(vh);
so.output(jdomDocument);
|
It is that simple. JDOM has a way to output a JDOM
document as a stream of SAX events. SAXOutputter
fires SAX events that are validated by ValidatorHandler.
Any error encountered is reported through ErrorHandler
set on ValidatorHandler.
Obtain Schema Type
Information
ValidatorHandler can give access to TypeInfoProvider,
which can be queried to access the type information determined by the
validator. This object is dynamic in nature and returns the type
information of the current element or attribute assessed by the
ValidationHandler during validation of the XML document.
This interface allows an application to know three things:
Whether the attribute is declared
as an ID type
Whether the attribute was declared
in the original XML document or was added by Validator
during validation
What type information of the element or attribute as declared
in the schema is associated with the document
Type information is returned as an org.w3c.dom.TypeInfo
object, which is defined as part of DOM L3. The TypeInfo
object returned is immutable, and the caller can keep references to
the obtained TypeInfo object longer than the callback scope. The
methods of this interface may only be called by the startElement
event of the ContentHandler that the application sets on
the ValidatorHandler. For example, look at the section
of the code below. (Note: For clarity, only part of the code is
shown here. For the complete source, look at the sample
SchemaTypeInformation.java, which can be
downloaded from here.)
ValidatorHandler vh = schema.newValidatorHandler();
vh.setErrorHandler(eh);
vh.setContentHandler(new MyContentHandler(vh.getTypeInfoProvider()));
SAXParserFactory spf = SAXParserFactory.newInstance();
spf.setNamespaceAware(true);
XMLReader reader = spf.newSAXParser().getXMLReader();
reader.setContentHandler(vh);
reader.parse(<XML Document>);
|
Ensure Data Security
Validating an XML document against an untrusted schema could have
serious consequences, as validation may modify the actual data by
adding default attributes and possibly corrupting the data.
Validation against an untrusted schema may also mean that an incoming
instance document might not conform to your business's constraints or
rules.
With the new Validation APIs, getting a Schema
instance is the first step before being able to validate an instance
document, and it is the application that determines how to create the
Schema instance. Validation using the Schema
instance makes sure that an incoming instance document is not
validated against any other (untrusted) schema(s) but only against
the schema(s) from which the instance is created. If the instance XML
document has elements or attributes that refer to schema(s) from a
different targetNamespace and are not part of
javax.xml.validation.Schema representation, an error
will be thrown. This approach protects you from accidental mistakes
and malicious documents.
Reusing a Parser Instance
Is it possible to use the same parser instance to parse multiple
XML documents? This was not clear, and the behavior was
implementation dependent. JAXP 1.3 has added the new function
reset() on SAXParser,
DocumentBuilder, and Transformer. This
guarantees that the same instance can be reused. The
reset function improves the overall performance by
saving resources, time associated with creating memory instances,
and garbage collection time. Let's see how the reset()
function can be used.
SAXParserFactory spf = SAXParserFactory.newInstance() ;
spf.setSchema(schema);
SAXParser saxParser = spf.newSAXParser();
for(int i = 0 ; i < n ; i++){
saxParser.parse(new File(args[i]), myHandler);
saxParser.reset(); }
|
The same function has also been added to newly designed
javax.xml.validation.Validator, as well as to
javax.xml.xpath.XPath. Applications are encouraged to
reuse the parser, transformer,
validator and XPath instance by calling
reset() when processing multiple XML documents. Note
that reset() sets the instance back to factory
settings.
Accessing XML is made simple using XPath: A single XPath
expression can be used to replace many lines of DOM API
code. JAXP 1.3 has defined XPath
APIs that conform to the XPath 1.0 specification and provide
object-model-neutral APIs for the evaluation of XPath expressions and
access to the evaluation environment. Though current APIs conform to
XPath 1.0, the APIs have been designed with future XPath 2.0 support
in mind.
To use JAXP 1.3 XPath APIs, the first step is to get the instance
of XPathFactory. Though the default model is W3C DOM,
it can be changed by specifying the object model URI:
XPathFactory factory = XPathFactory.newInstance();
XPathFactory factory = XpathFactory.newInstance(<OBJECT MODEL URI>);
|
Evaluate the XPath Expression
XpathFactory is used to create XPath
objects. The XPath interface provides access to the
XPath evaluation environment and expressions. XPath has
overloaded the evaluate() function, which can return
the result by evaluating an XPath expression based on the return type
set by the application. For example, look at the following XML document:
<Books>
<Book>
<Author> Author1 </Author>
<Name> Name1 </Name>
<ISBN> ISBN1 </ISBN>
</Book>
<Book>
<Author> Author2 </Author>
<Name> Name2 </Name>
<ISBN> ISBN2 </ISBN>
</Book>
</Books>
|
Following is the working code to evaluate the XPath
expression and print the contents of all the Book
elements in the XML document:
XPath xpath = XpathFactory.newInstance().newXPath();
String expression = "/Books/Book/Name/text()";
NodeSet nameNodes = (NodeSet) xpath.evaluate(expression, new
InputSource("Books.xml"), XpathConstants.NODESET);
//print all the names of the books
for(int i = 0 ; i < result.getLength(); i++){
System.out.println("Book name " + (i+1) + " is " +
result.item(i).getNodeValue());
}
|
Evaluate With
Context Specified
XPath is also capable of evaluating an expression
based on the context set by the application. The following example
sets the Document node as the context for evaluation:
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document d = db.parse(new File("Books.xml"));
XPath xpath = XPathFactory.newInstance().newXPath();
String exp = "/Books/Book";
NodeSet books = (NodeSet) xpath.evaluate(exp,d,XpathConstants.NODESET);
|
With a reference to a Book element, a relative
XPath expression can now be written to select the Name
element as follows:
String expression = "Name";
Node book = xpath.evaluate(exp, books.item(0), XpathConstants.NODE);
|
NamespaceContext
XPath Evaluation
What happens if the XML document is namespace aware? Look at the
following XML document, in which the first Book element
is in the publisher1 domain and the second in the
publisher2 domain:
<Books >
<Book xmlns="www.publisher1.com">
<Author>Author1</Author>
<Name>Name1</Name>
<ISBN>ISBN1</ISBN>
</Book>
<Book xmlns="www.publisher2.com">
<Author>Author2</Author>
<Name>Name2</Name>
<ISBN>ISBN2</ISBN>
<Cover>Hard</Cover>
</Book>
</Books>
|
In this case, the XPath expression
/Books/Book/Name/text() won't give any
result because the expression is not fully qualified. You can use an
expression such as /Books/p1:Book/p1:Name with a
p1 prefix. However, you should set
NamespaceContext on the XPath instance so
that the p1 prefix can be resolved. In the following
sample, the NamespaceContext capable of resolving
p1 is set on the XPath instance. Note that
the two Book elements are in different namespaces, so
the expression would result in only one node.
XPath xpath = XpathFactory.newInstance().newXPath();
String exp = "/Books/p1:Book/p1:Name" ;
xpath.setNamespaceContext(new MyNamespaceContext());
InputSource is = new InputSource("Books.xml");
NodeSet nn = (NodeSet)xpath.evaluate(exp, is, XpathConstants.NODESET);
// Print the count.
System.out.println("Node count = " + nn.getLength());
|
XPathVariableResolver
The XPath specification allows variables to be used in the XPath
expressions. XPathVariableResolver is defined to provide
access to the set of user-defined XPath variables. Here is an example
of an XPath expression using Variable:
String exp = "/Books/j:Book[j:Name=$bookName]";
xpath.setXPathVariableResolver(new SimpleXPathVariableResolver());
InputSource is = new InputSource("Books.xml");
Node n = (Node) xpath.evaluate(exp, is, XPathConstants.NODE);
System.out.println("Node name is " + n.getNodeName());
|
A SimpleXPathVariableResolver can implement the
resolveVariable() function as follows. (Note: For clarity,
only the relevant code is shown here.)
public Object resolveVariable(javax.xml.namespace.QName qName) {
if(qName.getLocalPart().equals("bookName"))
return "Name1";
....
}
}
|
JAXP 1.3 has introduced new data types in the Java platform,
the javax.xml.datatypes package, that directly map to some
of the XML schema data types, thus bringing XML schema data type
support directly into the Java platform.
The DatatypeFactory has functions to create
different types of data types -- for example, xs:data,
xs:dateTime, xs:duration, and so on. The
javax.xml.datatype.XMLGregorianCalendar takes care of
many W3C XML Schema 1.0 date and time data types, specifically,
dateTime, time, date,
gYearMonth, gMonthDay, gYear
gMonth, and gDay defined in this XML
namespace:
http://www.w3.org/2001/XMLSchema
|
These data types are normatively defined in W3C XML Schema 1.0, Part 2, Section
3.2.7-14.
The data type javax.xml.validation.Duration is an
immutable representation of a time span as defined in the W3C XML
Schema 1.0 specification. A Duration object represents
a period of Gregorian time, which consists of six fields (years,
months, days, hours, minutes, and seconds) as well as a sign field
(+ or -).
Table 1 shows the mapping of XML schema data types to Java
platform data types. Table 2 shows the mapping of XPath data types
and Java Platform data types.
Table 2. XPath and Java Platform Data
Type Mapping
|
XPath Data Type
|
Java Platform Data Type
|
|
xdt:dayTimeDuration
|
Duration
|
|
xdt:yearMonthDuration
|
Duration
|
These data types have a rich set of functions introduced to
perform basic operations over data types, for example, addition,
subtraction, and multiplication.
Also, there are ways to get the
lexicalRepresentation of a particular data type that is
defined at XML
Schema 1.0, Part 2, Section 3.2.[7-14].1, Lexical
Representation. There is no need to understand the complexities
of XML schema data types such as what types of operations are
allowed on a data type, how to write a lexical representation, and
so on. The javax.xml.datatype APIs have defined a rich
set of functions to make it easy for you.
JAXP 1.3 has also defined the support for XInclude.
SAXParserFactory/DocumentBuilderFactory
should be configured to make it XInclude aware. Do this by setting
setXIncludeAware() to true.
JAXP 1.3 has defined a security feature:
http://javax.xml.XMLConstants/feature/secure-processing
|
When set to true, this operates the parser in secure
manner and instructs the implementation to process XML securely and
avoid conditions such as denial-of-service attacks. Examples include
restricting the number of entities that can be expanded, the number
of attributes an element can have, and the XML schema constructs
that would consume large amounts of resources, such as large values
for minOccurs and maxOccurs. If XML
processing is limited for security reasons, it will be reported by a
call to the registered ErrorHandler.fatalError().
Summary
This article has introduced you to some of the new features in
JAXP 1.3. You have seen the benefits of the Schema Validation
Framework and seen how it can be used to improve the performance of
schema validation. Developers working with applications using JAXP
1.2 schema properties to validate XML document against schemas
should upgrade to JAXP 1.3 and use this framework. Remember to reuse
the parser instance by calling the reset() method to
improve performance.
New object-model-neutral XPath APIs bring XPath support and can
work with different object models. XML schema data type support is
brought directly into the Java platform with the introduction of new
data types. Security features introduced in JAXP 1.3 can help
protect the application from denial-of-service attacks. Also, JAXP
1.3 provides complete support for the latest standards: XML 1.1, DOM
L3, XInclude, and SAX 2.0.2. These are enough reasons to upgrade to
JAXP 1.3, and the implementation is available for downloading
from java.net.
For More Information
W3C XML Schema: Datatypes
RELAX NG home
page
XPath
APIs
XML 1.1
specification
DOM
L3 specification
XInclude
specification
SAX 2.0.2 home page
Neeraj
Bajaj is a member of the technical staff in the Web Technology
and Standards group at Sun Microsystems. He has been working in the
area of core XML processing-related technologies for more than four
years. He is the architect of the Sun Java Streaming XML Parser and
the co-specification lead of JAXP 1.4. He has contributed to the development
of Apache's open-source Xerces2-J project and to the implementation of
JSR 60 (JAXP 1.2), JSR 206 (JAXP 1.3), JSR 173 (StAX), and JAXP 1.4.
|