|
The
Swing HTML Parser
Parsing a Netscape Navigator
Bookmarks File
By Scott Violet
The high-level Swing component, JEditorPane,
is responsible for displaying, among other things, HTML text. However,
this article shows how you can use the HTML parser outside of JEditorPane.
An example provided shows how to use the standard HTML parser (also
shipped with HotJava) to parse
the bookmarks file created by Netscape Navigator. Previous Swing
Connection articles have featured the custom component, JTreeTable.
This article also demonstrates an enhanced editable JTreeTable.
The parser provided by Swing is DTD driven, and therefore is capable
of parsing much more than HTML. This article uses the parser with
its standard DTD for parsing HTML documents. Future articles will
address how to configure the parser with a custom DTD, and will
discuss the binary DTD format used by the parser.
HTMLEditorKit.ParserCallback
The main entry point into the HTML parser is the class ParserDelegator.
ParserDelegator parses an HTML document passed in as
a
Reader and notifies the passed-in
ParserCallback object as to the state of the parsing.
ParserCallback implements the following methods:
public void flush() throws BadLocationException
public void handleText(char[] data, int pos)
public void handleComment(char[] data, int pos)
public void handleStartTag(HTML.Tag t,
MutableAttributeSet a, int pos)
public void handleEndTag(HTML.Tag t, int pos)
public void handleSimpleTag(HTML.Tag t,
MutableAttributeSet a, int pos)
public void handleError(String errorMsg, int pos)
public void handleEndOfLineString(String eol)
|
Here is a simple example of creating your own ParserCallback
subclass to output all the text from an HTML document:
HTMLEditorKit.ParserCallback callback =
new HTMLEditorKit.ParserCallback () {
public void handleText(char[] data, int pos) {
System.out.println(data);
}
};
Reader reader = new FileReader("myFile.html");
new ParserDelegator().parse(reader, callback, false);
|
When the parser encounters a tag, it invokes either handleStartTag
or handleSimpleTag, based on the tag. The method parameters
specify the tag, any attributes on the tag, and the position in
the reader where the element was encountered.
handleSimpleTag is invoked for empty tags. Empty
tags are tags that are defined not to have an end tag, and can thus
have no content or child tags. BR and IMG
are examples of empty tags, whereas P is not an empty
tag. (While the end tag for P is optional, P
is not an empty tag.) The set of empty tags currently supported
by the Swing HTML DTD are:
BASEFONT
BR
AREA
LINK
IMG
PARAM
HR
INPUT
ISINDEX
BASE
META
FRAME
The handleSimpleTag method is also invoked for tags not
defined in the DTD. For example, <foo> is not a
valid HTML tag, and thus handleSimpleTag is invoked when
the tag foo is encountered. On the other hand, handleStartTag
is invoked for the valid non-empty tags -- the normal tags that are
defined in the DTD.
Both handleStartTag and handleSimpleTag
are passed a MutableAttributeSet
containing the attributes of the tag. The MutableAttributeSet
argument is reused by the caller. If you need to keep a reference
to the AttributeSet,
you must make a copy, perhaps using AttributeSet.copyAttributes.
If an attribute is defined, the MutableAttributeSet
key is an instance of HTML.Attribute,
otherwise (with a few exceptions) it is a String containing
the name of the attribute. For normal attributes, the attribute
values in the AttributeSet are Strings.
Two special keys and one value worth noting are:
ParserCallback.IMPLIED |
Indicates the DTD implied a particular tag, but it was not
present in the content. For example, <html><body><table><td>
is not legal HTML, as TR is missing. The parser
generates the TR noting that the TR
was implied by adding ParserCallback.IMPLIED as
a key in the AttributeSet passed into handleStartTag.
|
HTML.Attribute.ENDTAG |
Indicates the end of an element not defined in the DTD was
encountered. Remember that handleSimpleTag is invoked
for elements not defined in the DTD. handleSimpleTag
is also invoked for the end of elements not defined in the DTD
(such as <foo>). The callback method can
check for this by checking for the key HTML.Attribute.ENDTAG
in the passed-in AttributeSet. |
HTML.NULL_ATTRIBUTE_VALUE |
Indicates an attribute of an element did not have an explicit
value, and the DTD did not have a default value. For example,
<tr rowspan width=10% foo> illustrates the
three possible types of attribute values. The width
attribute has an explicit value of 10%. The rowspan
attribute has an implicit value of 1 (implicit values are defined
in the DTD; the attribute rowspan of a TR
element has a default value of 1). The foo attribute
has no default value which will be indicated with the NULL_ATTRIBUTE_VALUE.
The callback method can identify attributes that don't have
a defined value by checking for the value HTML.NULL_ATTRIBUTE_VALUE
in the AttributeSet as the value for the attribute
name. |
handleEndTag is invoked for closing tags that are
known to the DTD, such as </html>.
handleText, as the name implies, is invoked when
any content is encountered in the document. The text and location
(as an integer into the document) are passed in. Each occurrence
of white space (any newlines, tabs, carriage returns, or multiple
spaces) is coalesced into a single space character.
Any errors encountered are notified via the handleError
method. The default implementation of the callback method ignores
any errors, as many pages on the web do not contain valid HTML.
flush is actually not invoked by the parser, but
by HTMLEditorKit,
to indicate that parsing has successfully finished.
Since white space is stripped when parsing, handleEndOfLineString
is invoked after parsing with the best guess for the end of line
string. The end of line string will be \n, \r, or \r\n, whichever
is encountered the most in the document.
Sample Document
Let's take a look at what happens when a small HTML file is parsed.
Consider the following HTML:
<html><p>A <foo>xx</foo><a href=test>link</a>
The following shows the sequence of invocations on the callback
as well as some of the values:
| Method |
Position |
Tag |
Attributes |
Text |
handleStartTag |
0 |
html |
|
|
handleStartTag |
6 |
head |
IMPLIED=true |
|
handleEndTag |
6 |
head |
|
|
handleStartTag |
6 |
body |
IMPLIED=true |
|
handleStartTag |
6 |
p |
|
|
handleError |
16 |
|
|
tag.unrecognized foo?? |
handleText |
9 |
|
|
A |
handleSimpleTag |
11 |
foo |
|
|
handleError |
25 |
|
|
end.unrecognized foo?? |
handleText |
16 |
|
|
xx |
handleSimpleTag |
18 |
foo |
ENDTAG=true |
|
handleStartTag |
24 |
a |
HREF=test |
|
handleText |
37 |
|
|
link |
handleEndTag |
41 |
a |
|
|
handleEndTag |
44 |
p |
|
|
handleEndTag |
44 |
body |
|
|
handleEndTag |
44 |
html |
|
|
A couple of things are worth noting. Because the parser looks
ahead to determine state, it is often possible to get error notification
out of order. Notice that the HTML has A <foo>,
but the parser reports that foo is invalid before it
reports the text A. Also notice that the parser automatically
generates start tags for the HEAD and BODY,
even though they are not specified in the document. Callback methods
can detect tags implied by the DTD by checking for the ParserCallback.IMPLIED
key in the MutableAttributeSet.
Netscape Bookmarks File
In recent versions of Netscape Navigator, bookmarks are saved
in a file format that closely resembles HTML. Here is a sample bookmarks.html
file (bookmark.htm on Windows):
<!DOCTYPE NETSCAPE-Bookmark-file-1>
<!-- This is an example file! -->
<TITLE>Bookmarks for TreeTableExample 3</TITLE>
<H1>Bookmarks for TreeTableExample 3</H1>
<DD>Toolbar Folder<<BR>
<A
<DL><p>
<DT><H3 ADD_DATE="871524103">Games</H3>
<DL><p>
<DT><A HREF="http://www.activision.com" ADD_DATE="917293502"
LAST_VISIT="920521850"
LAST_MODIFIED="920521850">Activision</A>
</DL>
</DL> |
Notice that each bookmark directory is represented with DL,
and each bookmark entry is represented with DT.
To better illustrate these concepts, we will create a JTreeTable
that displays bookmarks from a Navigator bookmarks file, similar
to the Edit Bookmarks... feature of Navigator. As the
format of the bookmarks file is almost valid HTML, we can use the
Swing HTML parser with the default DTD. A custom implementation
of ParserCallback (called Bookmarks) is
used to create the internal objects used to represent the bookmarks.
Here is a snapshot of the GUI:
The name of the directory and the name of the bookmark entry are
both represented as text in the file. As such, when handleText
is invoked on Bookmarks, a new object is created to
represent the directory or bookmark. Bookmarks sets
an instance variable to indicate which object should be created
when the handleText method is invoked: either a representation
of a bookmark directory (BookmarkDirectory) or a representation
of a bookmark entry (BookmarkEntry). This instance
variable, state, is set in the handleStartTag
method based on the passed-in parameters. A DT followed
by the anchor tag A identifies the accompanying text
as a bookmark entry name. A DT followed by an H3
tag identifies the accompanying text as the name of a bookmark directory.
In addition to setting the state variable, handleStartTag
extracts the ADD_DATE attribute for bookmark directories,
and the HREF, DATE_CREATED, LAST_VISITED,
and LAST_MODIFIED attributes for bookmark entries.
Here is the relevant code for handleStartTag:
public void handleStartTag(HTML.Tag tag,
MutableAttributeSet attrSet, int pos) {
if (tag == HTML.Tag.A && lastTag == HTML.Tag.DT) {
URL url =
new URL((String)attrSet.getAttribute(HTML.Attribute.HREF));
Date createDate = convertNetscapeDateToDate
((String)attrSet.getAttribute("add_date"));
Date lastVisited = convertNetscapeDateToDate
((String)attrSet.getAttribute("last_visit"));
state = BOOKMARK_ENTRY;
}
else if (tag == HTML.Tag.H3 && lastTag == HTML.Tag.DT) {
Date createDate = convertNetscapeDateToDate
((String)attrSet.getAttribute("add_date"));
state = BOOKMARK_DIRECTORY;
}
lastTag = tag;
}
private Date convertNetscapeDateToDate(String nsDate) {
return new Date(1000l * Long.parseLong(nsDate));
}
|
state is used to indicate what should happen when
the next block of text is encountered and is used in handleText
to create the correct representation of the data. Here is the relevant
code for handleText:
public void handleText(char[] data, int pos) {
switch (state) {
case BOOKMARK_ENTRY:
createBookmark(new String(data), url, createDate,
lastVisited);
break;
case DIRECTORY_ENTRY:
createBookmarkDirectory(new String(data), createDate);
break;
default:
break;
}
state = NO_ENTRY;
}
|
The last information we need to track is which directory new bookmark
entries are added to. When a new bookmark directory is created via
createBookmarkDirectory, it becomes the directory new
entries are added to. Similarly, when an end DL tag
is encountered, the directory new entries are added to should become
the current directory's parent directory. handleEndTag
is overridden to handle this case, and looks like:
public void handleEndTag(HTML.Tag t, int pos) {
if (t == HTML.Tag.DL && parent != null) {
parent = (BookmarkDirectory)parent.getParent();
}
}
|
Editable JTreeTable
Previous articles on JTreeTable (Creating
TreeTables in Swing and Creating
TreeTables: Part 2) have not touched on how to make the
JTree column editable. We have received numerous
requests asking how to do this, so, for this example, the JTree
column of the JTreeTable is editable. There are different
approaches that can be taken to make the JTree column
editable, and we describe the most straightforward approach here.
The obvious approach is to make the JTree itself
editable. This does not work. One subtle point of renderers and
editors is that the same
Component cannot be both the renderer and the editor.
Remember that the renderer is used as a rubber stamp, that is, the
renderer Component is continually added and removed
from the containment hierarchy and asked to paint at each step,
similar to rubber stamping a document. On the other hand, the editor
Component is much longer lived. The editor Component
exists in the
JTable or JTree as long as the JTable
or JTree is editable. The two problems with using the
same Component for both the renderer and editor then
become:
- The renderer could be asked to paint when editing, resulting
in the editor's current value getting lost.
- After rendering a value, the editor may no longer be in the
containment hierarchy, making subsequent editing fail.
To give the illusion the JTree is editable, a custom
TableCellEditor
is used by the JTable for the JTree column.
In this way, the JTree is never really editable, rather,
the JTable is responsible for editing the JTree
column. The only other trickery involved is that JTree
editors do not usually take up all the horizontal space allocated
to them while JTable editors do. JTree
editors instead are horizontally positioned based on the depth of
the current node. (This behavior actually depends on the current
look and feel, but all the look and feels currently defined for
JTree base the horizontal position on the depth.) To
solve this problem, the custom JTextField
used for the actual editing Component locks its horizontal
location based on the current node's depth. To do this, reshape
is overridden to look like this:
public void reshape(int x, int y, int w, int h) {
int newX = Math.max(x, offset);
super.reshape(newX, y, w - (newX - x), h);
}
In the above code, offset is set before the editor
Component is returned from the editor. That is, offset
is computed in JTreeTable's getTableCellRendererComponent
method by looking at JTree's getRowBounds
and the width of DefaultTreeCellRenderer's icon.
The Source
The following files are new or have changed since the last JTreeTable
article:
- Bookmarks.java - Responsible for
parsing the Netscape bookmarks file.
- BookmarksModel.java - An implementation
of TreeTableModel based on the values from an instance of Bookmarks.
- DynamicTreeTableModel.java
- An implementation of TreeTableModel that uses reflection to
look up values.
- JTreeTable.java - An implementation
of
JTable with one column containing a JTree.
The new feature is editing support for the JTree
column.
- TreeTableExample3.java
- Builds the GUI containing all the necessary components.
The following files have not changed since the last JTreeTable
article:
A sample Navigator bookmarks file can be found here: bookmarks.html.
All the of the sources can be downloaded at once from the zip
file bookmarks.zip.
The main method is in TreeTableExample3.java.
By default, TreeTableExample3 looks for the file bookmarks.html
in the ~/.netscape directory. If this file cannot be
found, the bookmarks.html file in the current directory
is used. Or you can specify an alternate file at the command line:
% java TreeTableExample3 myBookmarksFile.html
Conclusion
We have only lightly touched on the capabilities of the HTML parsing
support in Swing. Future articles will more fully explore this powerful
and flexible feature.
|
|