Sun Java Solaris Communities My SDN Account Join SDN
 
Article

Redundancy Elimination

 

We'll describe the algorithm used here by refining a naive reflective "walk" of the object graph. 

An Algorithm for Persisting Beans as Text

[For illustrative purposes we will output text a in Java-like language though the same techniques apply to other formats such as XML].

The simplest way to do this is as follows (we omit the termination conditions which include a hash table containing the visited nodes):

writeObject(Object node) {
    nodeName = <generate a symbol>
    write(nodeName + " = new " + node.class + "()\n");
    for each property p in node.class.properties do {
        write(nodeName + "." + p.name + " = " + writeObject(p.value(node)) + "\n");
    }
    return name;
}

There are a number of special cases which need to be handled to make this work on all classes involved in the faithful storage of a Swing user interface. These include:

  1. Classes that do not have a nullary constructors, e.g. Color, Rectangle, Dimension, Integer, String, etc...
  2. Classes whose public constructor has parameters that refer to private state, e.g. EmptyBorder, JScrollPane$ScrollBar, etc...
  3. Classes that have no public constructors: Method, Class, array.
  4. Classes that do not expose all of their useful state as standard properties: Vector, Hashtable, Container, array.
The ability to be able fix these issues without being able to change the implementations of the classes involved is an essential part of any practical scheme that relies purely on public APIs. We have produced an API to allow users of this form of persistence to manipulate a property schema defined by a class.

Here are our solutions:

  1. Meta data specifies the parameters of a constructor as named properties (e.g. Color("red", "green", "blue")).
  2. A comprehensive defaults mechanism that obviates the need for these instantiations. (see: Elimination of Redundancy).
  3. Meta data describes the method by which these instances should be created.
  4. Meta data describes the method by which these instances should be modified.
We will not go into this further here other than to report that these anomalies exist and that we have produced some APIs to allow the manipulation of the table of meta data which is required to resolve them. To simplify the rest of this document we will describe the rest of this algorithm as if each object in the graph is a regular bean with a nullary constructor and an ordered list of public properties.

The biggest problem with this technique is the size of the files that are created. The algorithm walks deeply into the internal structures of most Swing components and generates large archives which are mostly copies of the default values of the objects they contain. The rest of this document explains the techniques we use to eliminate this default information and make the archives much smaller than their .ser equivalents.

Elimination of Redundancy

The algorithm outlined above has the problem that all properties will be recorded for a Bean even if its constructor initializes some of them by default. It is possible to ensure that this additional information, defining the default value of each property, is available but only by adding (a significant amount of) extra meta data for each class. This is fundamentally error prone since it leaves the default information two places: the default value from the meta data and the default value provided by the constructor. Nevertheless, this is a workable solution and has been used in a number of other UI kits and IDEs. Inprise notably, minimized this inconsistency by extending their language to support the notion of defaults.

Typically (Delphi, VB, ...) this technique is only implemented for primitive types, leaving default non-primitives types in the position that they must always be instantiated. If the constructor had already instantiated a value for a non-primitive property and this value was of the same class as the desired value, it would be much better to use that value as that would avoid the creation of a new object. Also, the constructor may have already configured this child object after it created it. There is little point in writing into each archive, property setting code which effectively duplicates the efforts of such a constructor.

[Alternatively, one could try to handle defaults for non-primitives in the same way that they are handled for primitives. To do this we could define a default value for all non-primitive properties along with a means for comparing them - perhaps by recursively examining their properties. This definition of `equals'  - effectively as `property equals' - implies an algorithm which passes over the graph of objects more than once; this is undesirable for a number of reasons.]

Again, we'll not go into the details other than to say that we believe the following algorithm for handling defaults achieves all of these goals and can do so in a single pass over the object graph.

Recursive differencing

The idea here is essentially to run the unarchiving process at the same time as the archiving process, thereby creating a second copy of the object graph as it it being written to the stream. This second, in memory, copy of the object graph can then be used to determine whether each property assignment that we are contemplating will have any effect.

Example

Suppose we are currently examining the "state" property of an object called "button0". Let us call the environment (hash table) in which the original object and its children will be placed: "env1", and the environment in which the clones are placed: "env2".

If env1.button0.state is true we will be considering writing the following assignment to the output:

    write("button0.state = true;\n");

We know that the destination environment will have a property env2.button0 as well. If the value of env2.button0.state is also true we can skip writing out this statement since we know that when the file is finally read, this statement will have no effect. This handles defaults for primitive values nicely. The surprising part is that, because we only ever compare objects from the same environment, this exact technique is all that we need to do to handle non-primitive values as well.

Preservation of Identity

Throughout our implementation we use "==" to test for equality instead of equals. Internally the hashtable that is used to detect circularities in the object graph is a special hashtable which uses "==" instead of "equals" and "System.identityHashcode" instead of "hashCode".

The result is just what we need, that all objects have the right properties assigned to them and objects that were shared in the source environment end up being shared in the destination environment.
 

Example

Consider the following pseudo-Java program fragment:

INPUT:

    l1 = new JList();
    l2 = new JList();
    writeObject(new Object[]{l1, l2});

This will produce an archive equivalent to the following:

OUTPUT:

    l1 = new JList();
    l2 = new JList();
    result = new Object[]{l1, l2};

So, two lists with models that have exactly the same properties are instantiated by their constructors producing two different lists containing two different models.
 

Example

INPUT:

    model = new DefaultListModel();
    l1 = new JList(model);
    l2 = new JList(model);
    writeObject(new Object[]{l1, l2});

OUTPUT:

    l1 = new JList();
    l2 = new JList();
    l2.model = l1.model;
    result = new Object[]{l1, l2};

 Here the two lists have the same model, and the file reflects this. Moreover, the algorithm has realized that since the first list already has a model of the right type there is no point in throwing away the model inside the list, l1, only to replace it with another fresh instance of the same model. If  there are any public properties, they can be set on the model which was created by the constructor. If not (as here), the default model contained in the Jlist is indistinguishable from one that we would instantiate afresh.
 

Efficiency

At first sight this technique seems expensive. In fact, its performance characteristics are extraordinarily well suited to the problem domain. First of all, extra time is spent at the writing stage so as to minimize the size of the file. For Applet use, downloading time is critical and minimizing the size of resources that will be vended by a server is paramount. Secondly, any successful Application will be read hundreds of times more than it is written. It is therefore a good trade to spend more time in writing the archive if this can reduce the time taken to read it.

It is also worth highlighting that the cost of performing incremental cloning during the write operation is, in both time and heap allocation, equal to the cost of reading it. If this operation is unduly costly - for any reason - the resulting file will be too costly to load anyway and is probably of little value.
 

Error Handling

To prevent archives form becoming useless to future versions of a Java package, comprehensive error recovery is required.  Some Beans may contain properties that cannot be set to certain values when the Bean is in a certain state. During the writing phase, we will be setting these values in our cloning environment and can therefore catch any exceptions and skip the emission of the offending assignment. This guarantees the correctness of an archive to the point that the reading process is sure to proceed in another (identically configured) VM without error.

Of course, a VM containing a different version of the library defining the classes may throw errors in different places, so the input stream must also catch errors encountered in the evaluation of its statements. When an error is encountered during the reading process the statement is discarded and evaluation continues with the next statement. If the statement that threw the error was an instantiation that was intended to bind a name to a value in the evaluating environment, this variable is left unbound. Any subsequent reference to that variable will also throw an error and skip the statement that referenced it.
 

Special Cases

Objects could, in theory, have undesirable side effects precluding the use of this technique - forcing the default values of a class to be stated explicitly. If an object only calls instance methods it cannot change state anywhere outside the `cloning' environment in which it is created. This is the normal case and exceptions to it seem to be very rare. Static methods uniquely provide a means for the clones to change state in areas outside their environment.

So far, we have only found one such example of this sort of side effect, and that is in the awt's Window class.
 

INPUT:

    window = new Window();
    writeObject(window);

OUTPUT:

    window = new Window();
    window.name = "window0";
    result = window;

Here, a static variable is being incremented as each window is created, the defaults system notices this and sets the property so that it is identical to the original.

Note also that the Window has one other undesirable side effect: if the "visible" property is true, for example, a second window would appear on the screen every time a window object was archived! For this reason the default values of a Window are handled as a special case.

Other special cases are possible, perhaps a GUI might log into a database on startup. If the database only accepted one entry at a time, such a GUI could not be written out without manually removing the offending properties. Again, the APIs that we have defined for the definition of properties allow these potential problems to be circumvented at development time.
 

Advantages

We have mentioned how this technique reduces the size of the archives produced. Smaller files, with less redundant information, are not just good because they take less time and space to use, they are also less contingent on the stability of an API.

Interestingly, the reason we invested so much time in this defaults system is that the list of properties provided by the Introspector caused our serialization code to walk very deeply into the internal structures of the objects that were being serialized. Taking each of Swing's top level components we found that roughly half of them contained at least one object that could not be instantiated naively. Invariably, the problem classes did not have nullary constructors and did not expose enough of their state to produce the values required in the constructors they did define.

More than 99% of the "problem cases" in swing were fixed by using the defaults scheme above.

Side Effects

In Swing's top level components there are many 'cover methods' that expose, for example, the properties of the models they contain. Since these models also expose their properties using the Beans conventions the properties will be detected by the introspector. This brings up the possibility of setting properties twice, unless the list of properties is manually configured to remove the duplicates.

Example

Suppose we have an instance of Swing's JList class: "list0" with properties "model" and "count" - where the list's model also has the property "count". The two possible assignments are:

    list0.count = 10;
    list0.model.count = 10;

An interesting and desirable consequence of using the incremental cloning technique is that this does not matter which order the "model" and "count" properties are defined in the JList class. The reason for this is that the side effect of setting the "count" property on the list is precisely as we would expect - it sets the "count" property of the model it contains. Since this happens to the cloned version of the input, by the time we come to compare the values in the second assignment they will be "equal" and the assignment will be skipped.
 

Ordering of Properties

In all the Swing classes the accommodation of side effects reduces the task of providing a good serialization strategy for each Swing component to that of providing a list of properties to record and an order in which they should be recorded. In general, it is most efficient to vend the properties in the order where the properties with the most side effects appear first. That way, the majority of the state in the properties which follow can be skipped and the file will be shorter.
 

Conclusion

We have spent the last twelve months or so refining a technique for persisting user interfaces that use the Swing toolkit by restricting all our archives strictly to the use of public APIs. In practice this step appears to make the archives "version resilient" in the same way that old Java programs will compile and run with backward compatible APIs of newer libraries.

With such a large library, general techniques turned out to be far more effective than Swing specific ones. So much so that our final code base is entirely independent of Swing (the class defining the meta data is the only class which imports the Swing classes). We have built a simple BeanBox style builder to demonstrate the construction and serialization of interfaces and, so far, all Swing components can be written in a compact form and read without error.

According to our tests, time spent parsing the files and making the reflective calls is insignificant compared to the raw cost of performing the operations specified in them. We think the resulting file format (at least in structural terms) is as small as it is practical to make it - normally being significantly smaller than the normal .java, .class and .ser representations that contain the same information. If the files are minimally small (which has only been demonstrated empirically) this implies this is the fastest of all techniques that read archives written in textual (human readable) formats.
 

Back to Persistence Article