|
Neither Java nor XML Technology need an introduction, nor the synergy between the two: "Portable Code and Portable Data." With the growing interest in web services and e-business platforms, XML is joining Java in the developer's toolbox. As of today, no less than six extensions to the Java Platform empower the developer when building XML-based applications:
The first of the three articles in this series gave an overview of the different APIs available to the developer by presenting some sample programs. The differences in performance were addressed in the second article. This third article gives tips on improving the performance of XML-based applications from a programmatic and architectural point of view. XML processing is very CPU, memory, and I/O or network intensive. XML documents are text documents that need to be parsed before any meaningful application processing can be performed. The parsing of an XML document may result either in a stream of events if the SAX API is used, or in an in-memory document model if the DOM API is used. During parsing, a validating parser may additionally perform some validity checking of the document against a predefined schema (a Document Type Definition or an XML Schema). Processing an XML document means recognizing, extracting and directly processing the element contents and attribute values or mapping them to other business objects that are processed further on. Before an application can apply any business logic, the following steps must take place:
Parsing XML documents implies a lot of character encoding and decoding and string processing. Then, depending on the chosen API, recognition and extraction of content may correspond to walking through a tree data structure, or catching the events generated by the parser and processing them according to some context. If an application uses XSLT to preprocess an XML document, even more processing is added before the real business logic work can take place. Using the DOM API implies the creation in memory of a representation of the document as a DOM tree. If the document is large, so is the DOM tree and the memory consumption. The physical structure and the logical structure of an XML document may be different. An XML document may contain references to external entities which are substituted in the document content while parsing and prior to validating. Those external entities and the schema itself (such as DTD) may be located on remote systems, especially if the document itself is originating from another system. In order to proceed with the parsing and the validation, the external entities must first be loaded (downloaded). Documents with a complex physical structure may therefore be very I/O or network intensive. In this article, we will give some tips for improving performance when processing XML documents, articulated around improving the CPU, memory, and I/O or network consumption. Using the Most Appropriate API: Choosing Between SAX and DOMBoth DOM and SAX have features that make them more suitable for certain tasks than others: Table 1: SAX and DOM features
Omitting the impact of memory consumption on overall system performance, processing using the DOM API is usually slower than processing using the SAX API, mainly because the DOM API may have to load the whole document in-memory first in order to allow it to be edited or data to be easily retrieved, while the SAX API allows immediate processing as the document is being parsed. Therefore, DOM should be used when the source document is to be edited or processed multiple times. SAX is very convenient when you want to extract information from an XML document (an element content or an attribute value) regardless of its overall context -- its position in the XML document tree, or when the document structure maps exactly to the business object structure. Otherwise, keeping track of the element nesting may be very tedious and one may better end up using DOM. Nevertheless, when the source document is to be mapped to a business object which is not primarily represented as a DOM tree, it's recommended to use SAX to map directly to the business object, avoiding an intermediate resource-consuming representation. Of course, if the business object has a direct representation in Java, technologies like XML Data Binding (JAXB) can be used.
Since high level technologies like XSLT rely on lower level technologies like SAX and DOM, the performance when using those
technologies may be impacted by their
use of SAX or DOM. JAXP provides support for XSLT engine implementations that accept source input and result output in the form of
SAX events. When building complex
XML processing pipelines, one can use JAXP Considering Alternative APIs JDOM is not a wrapper around DOM, although it shares the same purpose as DOM with regard to XML. It has been made generic enough to address any document model. JDOM has been optimized for Java and moreover, by the use of the Java Collection API, it has been made straightforward for the Java developer. JDOM documents can be built directly from, and converted to, SAX events and DOM trees, allowing JDOM to be seamlessly integrated in XML processing pipelines and in particular as the source or result of XSLT transformations.
dom4j is another alternative API very similar to JDOM. It additionally comes with a tight integration to Xpath: the
If a document model fits the core data structure of an application, JDOM and dom4j should be seriously considered. Additionally, as opposed to DOM1, JDOM or dom4j documents are serializable, which gives even more options when architecting complex inter-communicating applications. Using alternative APIs like JDOM and dom4j, a developer may avoid some performance pitfalls like the one described in the second article, when accessing elements by their tag names, since the API through the support of the Java Collection API is more straightforward. Since it is lightweight and optimized for Java, you may often expect a sensitive gain in performance. Be Aware of the Differences in the ImplementationsAs we highlighted in the second part of this series, implementations differ. Some emphasize functionality, others performance. The plugability feature of JAXP allows the developer to swap between implementations and select the most appropriate one to achieve the application requirements. As an example, when using DOM, a common complaint is the lack of support in the API itself for serialization (that is, transformation of a DOM tree to a XML document). Therefore, it's tempting to step out of the standard API and call implementation-dependent serialization features at the cost of losing JAXP's plugability benefits. Below are code samples for serializing a DOM tree to an XML stream with both Xerces and Crimson. Code Sample 1: Serialization with Xerces relies on a separate API which is packaged along with the DOM implementation
Code Sample 2: Serialization with Crimson relies on methods specific to the DOM implementation
JAXP addresses the serialization of a DOM tree through the use of the Identity Transformer as presented in the example below. The
identity transformer just copies
the source tree to the result tree and applies the specified output method. To output in XML, the output method needs only to be
set to Code Sample 3: Implementation-independent serialization with the identity transformer (no argument passed to the factory method TransformerFactory.newInstance)
JAXP, with its support by many parsers and style sheet engines, is a strong asset for your application. It's worth capitalizing on so that later on, the underlying parser implementations can be swapped easily without requiring any application code changes. Tuning the Underlying Implementations
The JAXP API defines methods to set/get features and properties in order to configure the underlying implementations. Apart from
the standard properties and
features such as the
Setting specific features and properties should be done with care to preserve the interchangeability of the underlying
implementation. When a feature or a
property is not supported or not recognized by the underlying implementation, a Reusing and Pooling ParsersAn XML application may have to process different types of documents (such as documents conforming to different schemas), and these documents can be accessed from different sources. A single parser may be used (per thread of execution) to handle documents of different types successively just by reassigning the handlers according to the source documents to be processed. Since they are complex objects, parsers may be pooled so that they can be reused by other threads of execution, reducing the burden on memory allocation and garbage collection. Additionally if the number of different document types is large and if the handlers are expensive to create, handlers may be pooled as well. The same considerations apply to style sheets and transformers. Partial Parsing with SAXIf you can use SAX, and the information you want to extract from the document is located at the beginning or at least not located at the very end, you may have better performance if you can interrupt the parsing as soon as all the information has been extracted. You can achieve this by throwing a SAX exception. This may be especially useful when a document is wrapped inside another document (the envelope) and you need to get some information like the recipient to be able to route it. You may only want to extract information from the envelope without parsing the contained document which may be much bigger. Code Sample 4: When the first occurrence of the targeted element (variable target) has been extracted, an EndOfProcessingException is thrown to stop the parsing
When run against the XML source documents used for the benchmark to search for the first occurrence of the KING element, such a test program executes in the same time regardless of the document size. Reducing Validation CostValidation is important and may be required to guarantee the reliability of an XML application. An application may legitimately rely on validation by the parser to avoid double-checking the validity of element nesting and attribute values. A valid XML document may still be invalid in the application domain. The capabilities of Document Type Definitions are limited. For example, in the Chessboard application domain, nothing could prevent two pieces from having the same row and column attribute values. XML validation doesn't discharge the application from validating other uncovered constraints that may be violated without invalidating a document. Not relying on XML validation may put more burden on the application; on the other hand, validation affects performance. In the following discussion, we mainly refer to DTD but the principles discussed can be extended to other XML schema languages.
Code Sample 5: A valid invalid document. XML valid, but application domain invalid: two pawns are at the same position; XML validation doesn't discharge the application from enforcing some domain specific constraints
In a system2 with components exchanging documents, the cost of validation can be efficiently reduced by taking into account the following observations (see Figure 1):
For example, a multitier e-business application exchanging documents with trading partners through a front-end will enforce validity at the web tier (front-end) of any incoming document. It will not only check the validity of the document against its schema, but also ensure that the document type is of one (or the one) it can accept. The documents may then be rerouted to other servers to be handled by the proper services. Since the documents have already been validated they do not require further validation. In other words, when you own both the producer and the consumer of XML documents you may use validation only for debugging and turn it off when in production.
Figure 1: Validation is required when the source cannot be trusted. Once in the system, validation may be considered optional. Validation is required when the source cannot be trusted. Once in the system, validation may be considered optional. Still, even without validating, the DTDs and entities referenced in the documents need to be loaded and parsed allowing entities to be substituted, attributes values to be normalized or their default values to be properly substituted. At the extreme, documents without DTDs don't require (and don't stand) validation. Since they don't refer to any DTD or external entity, none is loaded or parsed and no validation can be done. Performance is therefore better. This extreme solution, while not viable as such for exchanges between XML applications, can be used between the components of an XML application. In this particular case, the document type declaration may be inserted during debugging to enable validation, and omitted when in production. Still, a document conforming to a DTD can, after an optional validation, be converted to an equivalent document which will not require validation or external entity substitution by using the XML canonicalization process. This process, which is described below, was not originally intended to improve performance, but one may benefit from it under certain situations and with certain limitations. Any document can be converted in an equivalent (with some limitations) to a DTD-less document through a process named XML Canonicalization. The generated document is called a Canonical XML document. "Any XML document is part of a set of XML documents that are logically equivalent within an application context, but which vary in physical representation based on syntactic changes permitted by XML 1.0 and Namespaces in XML. This specification describes a method for generating a physical representation, the canonical form, of an XML document that accounts for the permissible changes. Except for limitations regarding a few unusual cases, if two documents have the same canonical form, then the two documents are logically equivalent within the given application context. Note that two documents may have differing canonical forms yet still be equivalent in a given context based on application-specific equivalence rules for which no generalized XML specification could account." - Extract from Canonical XML Version 1.0 - W3C Recommendation 15 March 2001 The XML canonicalization process results in some changes from the original document, among others:
Although canonicalization is not primarily meant for this purpose, we can use it to improve performance. The front-end of the e-business application from our previous example could be improved to validate the incoming documents and generate a canonical form of the documents that are routed to the proper backend services. The backend services are able to parse the document much faster since no validation is required and the document doesn't refer to any external entity. Generated canonical documents, while having the same logical structure, don't share the same physical structure as the original documents. The application may therefore require that the original version be archived early on in the processing pipeline.
Unfortunately, so far, there is no standard XSL output method which could be used with the identity transformer (presented above) to generate a canonical form of a source document. To generate the canonical form of a document you may have to write a custom SAX2
The code sample below shows the canonical form of an XML document. Note the absence of the XML and DTD declarations and the
replacement of every line break by a
Code Sample 6: The canonical form of one of the Chessboards-[10-5000].xml documents (line breaks have been reintroduced after some of the 
 character references for readability purposes)
The chart below shows the relative time to process an XML document in its original form and in its canonicalized form. Depending on the complexity of the schema and the number of referred external entities, the difference in performance can be even bigger.
Figure 2: Time to process an XML document (containing 1 chessboard configuration/processed 1000 times) and its canonicalized form with SAX, using Xerces without validation (JDK 1.2.2_06) Like validation, canonicalization may be switched on in production only, when looking for the best performance. Canonicalization can only be used if the canonical XML documents and the original documents are equivalent. If the application relies, for example, on comment or any other lexical events generated by the parser, canonicalization can not be used. Any variant of this process can be applied as long as both forms of the XML document -- the original one and the refined one -- are equivalent for the application.
Figure 3: Since the document (post-validation) is canonical, it does not include a Document Type Declaration and does not refer to any external entities Reducing the Cost of Referencing External EntitiesExternal entities, including external DTD subsets, require to be loaded and parsed, even when not validating, in order to deliver the same information to the application regardless of the validation. Standalone documents don't reference any external entity but may still use internal DTD subsets. Therefore, by avoiding loading any external entity, the performance may be increased, especially compared to the cases where the DTD or the other external entities reside on a non-local repository. Nevertheless, standalone documents may not be the solution of choice especially in the case of e-business document exchanges which rely on public XML schemas being published on a common registry or repository. Code Sample 7: A standalone XML document, the DTD has been embedded as an internal DTD subset. Performance may be improved especially compared to a situation where the DTD is an external DTD subset located on a remote repository.
Caching External Entities Caching Using a Proxy Cache References to external entities located on a remote repository may be improved by setting up a proxy that caches any document retrieved and especially external entities -- provided the references to the external entities are URLs whose protocols are handled by the proxy.
Figure 4: Caching architecture. Entities still have to be resolved. Caching With a Custom EntityResolver
SAX parsers allow XML applications to handle external entities in a customized way. Such applications have to register their own
implementation of the
This feature can be used to implement:
Both mechanisms can be used jointly to ensure even better performance. The first one may be used for static entities which have a lifetime greater than the application's. It's especially the case for public DTDs which usually evolve through successive versions, and which include the version in their public or system identifier. The second mechanism may first map public identifiers into system identifiers and then apply the same techniques as a regular cache proxy when dealing with system identifiers in the form of URL, especially checking for updates and avoiding caching dynamic content. Code Sample 8: A simple cache implementation for external entities (this implementation is incomplete: it doesn't free unused entries in the entities hash map)
Code Sample 9: A sample program using the SAX API and implementing the entity resolver to look up entities in a in-memory cache.
The improvement in performance is quite significant, especially when external entities are located on the network:
Figure 5: Time to process an XML document (containing 1 chessboard configuration and referencing its DTD across a LAN) with SAX, using Xerces with validation (JDK 1.2.2_06) When combining both suggested architectures for reducing the validation cost and reducing the cost of referencing external entities, the resulting architecture may look as follows:
Figure 6: An architecture to reduce the costs of validation and referencing external entities; caching only occurs on the front-end since the documents processed by the services don't refer to any external entities. Caching Generated Content and Style SheetsThe eMobile sample end-to-end application demonstrated at the JavaOne Conference 2000 provided an example of how servlets can generate content targeted at different devices from value objects returned by an EJB application. Style sheets were applied to DOM trees built from the value objects in order to transform them to the targeted content type. There are two places where the performance of the web tier of the eMobile sample application has been improved:
Figure 7: The two places where performance was improved in the eMobile application When all the generated content could not fit on the device (WML deck or HTML page), the result was divided among several decks or pages to allow the user to browse the overall result. The decks or pages were generated one at a time upon the user's request from the DOM tree. To avoid invoking the EJB application again, the DOM tree was cached in the user's session. When loaded, the style sheets were also cached to avoid reloading them for every content generation. Caching the result of a user request to serve subsequent related requests more quickly consumes memory. It must not be done to the detriment of the other users: the application must not fail because of memory shortage due to the cached results. Soft references introduced with Java 2 allow interacting with the garbage collector to implement caches.
In the context of a distributed web container, the reference to the DOM tree stored in the session may have to be declared as
The following code sample shows how a query and its result are cached in the client's associated session, and how the result of a previously executed query may be retrieved from the session. This code sample from the eMobile application has been updated to take into account distributed web containers. Code Sample 10: Caching the query and its result in the client's associated session
Caching the style sheets relied on the same principle as caching the result DOM trees. When loaded, the style sheets were cached in a hashtable using soft references which were shared among all the servlets.
Along this line, JAXP 1.1 defines the Using Java 2 SE v 1.3 (and Higher)
XML processing is very CPU- and memory-intensive. For a server-side application, better performance is obtained by using the HotSpot server system which can be
activated by passing the Using XML With ParsimonyXML documents are text documents. Therefore they can easily be exchanged between heterogeneous systems. But they require a parsing phase that, as we mentioned earlier, is very expensive. It is the price to pay for allowing loosely-coupled systems to work -- loosely-coupled not only technically, but also enterprise-wise. When system components are tightly-coupled, "regular" non document-oriented techniques (using RMI for example) are far more efficient not only in terms of performance but also in terms of coding complexity. With technologies like JAXB the two worlds can be efficiently combined to develop systems that are internally tightly-coupled, object-oriented and which interact together in a loosely-coupled document-oriented way.
Figure 8: A mixed architecture: Loosely-coupled document-oriented on the outside, and tightly-coupled object-oriented in the inside To illustrate this statement, let's compare the cost of serializing/deserializing to/from:
We designed Java classes that implement all the pieces (
The code fragment below shows the Code Sample 11: Implementation of the Chessboards methods to serialize/deserialize to/from XML and DOM; the serialization/deserialization to/from a Java serialized object is simply enabled by implementing the Serializable interface
To measure the performances we implemented three structurally equivalent test programs. They first loaded the original XML document
describing a set of chessboard
configurations and, in a loop, wrote it into a file and read it back either as an XML document or as Java serialized objects. We
ran these test programs on a set of
1000 chessboard configurations, which was processed 10 times for each of the 10 runs. The measured time was the sum of the user and
system times, as returned by the
Code Sample 12: The test program to serialize/deserialize to/from XML, through an intermediate DOM tree
Code Sample 13: The test program to serialize/deserialize to/from a serialized Java object
Code Sample 14: The test program to serialize/deserialize to/from a Java serialized DOM tree
The results of this show that not only is the direct Java serialization of the "business objects" faster than the XML serialization or the Java serialization of the DOM tree, but also that the resulting serialized object form is smaller than the serialized XML document or the Java serialized DOM tree form. The Java serialization of the DOM tree is the most expensive in processing time as well as in memory footprint; therefore it should be used with extreme care, especially in the context of Enterprise JavaBeans (EJB) where serialization occurs when accessing remote EJBs. When accessing local EJBs, DOM tree or DOM tree fragments can be passed along without incurring the same issue.
Figure 9: Average time to serialize/deserialize a set of 1000 chessboard configurations in its XML document form (through an intermediate DOM tree) , in its "Business Object" Java serialized form and in its DOM tree Java serialized form; Crimson's DOM implementation does not support Java serialization
Figure 10: Size of the serialized XML document, the Java serialized "business objects" form and the Java serialized DOM tree form Applications which are internally document-oriented may be designed so that only the most relevant and most accessed information is extracted from the document to be processed and mapped to business objects. These business objects may keep a reference to the original document (in its original text form or in a cached DOM representation) so that more information can be queried when needed from the original document using XPath expressions or XQuery, for example. ConclusionIn this article, we presented different performance improvement tips. The first question to ask when developing an XML-based application is "Should it be XML based?" If the answer is yes, then a sound and balanced architecture has to be designed, an architecture which only relies on XML for what it is good at: open inter-application communications, configuration descriptions, information sharing, or any domain for which a public XML schema may exist. It may not be the solution of choice for unexposed interfaces or for exchanges between components which should be otherwise tightly coupled. Should XML processing be just a pre or post-processing stage of the business logic or should it make sense for the application to have its core data structure represented as documents, the developer will have to choose between the different APIs and implementations considering not only their functionalities and their ease of use, but also their performance. Ultimately, Java XML-based applications are developed in Java, therefore any Java performance improvement rule will apply as well, especially, those regarding string processing and object creation. Resources
Java Technology & XML Part 1 -- An Introduction to APIs for XML
Processing 1 Depending on the implementation, a DOM tree may or may not be "Java" serializable: it's not a requirement from the specification. 2 A system here is understood as any set of hardware and software that composes your solution and which defines a boundary within which any exchange between components is considered secure and reliable. 3 As used on this web site, the terms Java virtual machine or Java VM mean a virtual machine for the Java platform. Have a question about programming? Use Java Online Support. | |||||||||||||||||||||||||||||||||||||||||
|
| ||||||||||||