Appeared in Object Magazine, February 1998
Document Objects with Style
Technology trends come in at least two flavors: hype and revolution. Separating the two, and deciding where to apply your limited resources, can often feel like a roller coaster ride; or a trip through the haunted house, depending on your perspective! I hitched my cart to the Java train back in August 1995, convinced that the train would lead to revolution. The extensible markup language (XML) is now at a similar stage of evolution, and Im equally convinced that it will yield a revolution in hybrid Web-Object systems. Ive hitched my cart to the XML train and will describe some of the scenery that Ive seen on my initial journey.
Although XML is heavily dependent on its relationship and inheritance from the Standard Generalized Markup Language (SGML), I will attempt to describe XMLs benefits and merits based on its own specifications. SGML (an ISO standard since 1986) contributes a lot of capabilities and prior thought, but it also brings a bit of baggage from its history. I do not mean to criticize SGML, it will continue to be a viable choice for complex document management. But for XML to be successful in the much larger and more diverse Web community, it needs to stand on its own feet and be understandable without history lessons. So far, however, Ive found it necessary to dig deeply into SGMLs roots to understand XMLs potential. Hopefully, this will change with maturity and with new XML guidelines and books.
First, you need to expand your notion of what constitutes a "document." Simply stated, an XML (or SGML) document is a composite structure of node objects, each having optional attributes. The principal sub-nodes are typed "elements" and blocks of uninterpreted text. From these basic roots, you can construct schemas defining valid node structures, and document instances that adhere to the schema. In my previous column (see Object Magazine, November 1997) I discussed several XML draft standards and W3C working documents for defining XML document schemas. This month, Ill focus on two draft standards for processing XML document instances: the Document Object Model, and the Extensible Style Language.
Document Object Models
From the developers perspective, XML usage can be roughly separated into parsing and application processing. There are several good XML parsers written in Java, available for free download (see, for example, www.microsoft.com/standards/xml/xmlparse.htm ). However, if you isolate the act of parsing to the generation of tokens according to some predefined grammar, then you still need to think about useful object models for representing the "source tree" that is produced by the parser. One potential standard for representing this structure is the Document Object Model (DOM) as defined by a W3C sponsored working group (see www.w3.org/DOM ). The DOM goes significantly beyond representation of the parse tree and proposes an interface for manipulating document objects and for constructing documents within your application program.
Ill briefly summarize the core objects in the DOM, but keep in mind that this is a draft specification subject to change. The Node class defines an abstract interface for getting, inserting, and removing child nodes within a recursive structure. NodeList and NodeEnumerator classes are defined for traversing sets of Node objects. Several specialized subclasses of Node are defined for Document, Element, Attribute, Text, Comment, PI (processing instruction), and Reference. The Document object contains a pointer to the root Element in the document tree, and a DocumentContext object contains a pointer to the Document, plus adds additional metadata about that document. Instances of the Element object would be created for each markup tag in the document, and uninterpreted text is stored in a data attribute of the Text object.
Its interesting (and appropriate) that the DOM specification defines the object model interface using the CORBA interface definition language (IDL). The authors are careful to point out that use of IDL does not require the use of CORBA, but enables a language-independent definition that can be easily translated to implementation languages. The specification also provides an equivalent Java interface definition. I have not yet seen any implementations of the DOM interface specifications, but I expect that will change by the time you read this column.
At the time of this writing, only the Core Document Structure and Navigation specification draft has been published. Future specifications will specialize this object model to HTML and XML structures, and to object models for document schemas and stylesheets.
Flow Objects and Transformations
There is a second specification draft targeted at document stylesheet definition and document formatting. The Extensible Style Language (XSL) is itself based on XML - the stylesheets are XML document instances, specifying how other XML documents should be transformed and/or formatted for presentation. (See www.w3.org/TR/NOTE-XSL.html for the specification draft.)
The term "stylesheet" is somewhat misleading, because it includes a general capability for transforming the documents source tree into an output tree, based on a set of construction rules. The output tree can be another document object model defined by a different schema, and the construction rules will map target elements from the source tree into corresponding elements in the output tree. The term "tree" is used to signify the composite structure of the document object model hierarchy of nodes. Each output element is called a "flow object," and the XSL specification includes definition of a standard set of flow objects, analogous to a standard class library in Java. Initially, a set of flow objects will be defined that allow XML documents to be transformed into HTML documents, which can then be viewed in existing Web browsers.
Whereas only one construction rule can be applied to each element in the source tree, any number of style rules can be applied. Style rules do not create new flow objects, but modify the characteristics of flow objects produced by construction rules. If you are familiar with rule-based expert systems, these style sheets look like a knowledge base for document transformation. Each construction rule contains a pattern that identifies the source element to which the rule applies, and an action that specifies the flow object to be created. There is even a conflict resolution algorithm for choosing from among multiple rules that might be applied to a particular element.
Although XSL is only in its first draft, Ive already found two implementations available on the net. First, xslj is, according to its developer, a "virtually complete implementation of XSL by way of translation into extended DSSSL." Xslj is a frontend for processing XSL stylesheets and XML documents with existing SGML tools; DSSSL is a Scheme/Lisp based stylesheet language used with SGML documents. However, this approach has a useful advantage in that you can use these existing tools to transform any XML document into other presentation formats, including SGML, HTML, RTF, and TeX. To download a copy of xslj, including its C source code, see www.ltg.ed.ac.uk/~ht/xslj.html.
If youre not afraid of the bleeding edge, check out the "docproc" tool available at http://jersey.uoregon.edu/ser/software/docproc_2/docs/. This XSL processor is written entirely in Java and runs as a servlet, allowing any XML document to be filtered and formatted for presentation in existing HTML Web browsers. It is in the midst of development, so current features may vary, but it looks like a very interesting testbed!
An XML document, by definition, only specifies the logical structure of document elements, and makes no statement about the documents formatting or presentation. An XSL stylesheet and processor would transform the document into its viewable form. If we take a look into the future, one can envision Web browsers that receive XML documents, parse them into DOM representations, and apply their built-in XSL processor to the stylesheet specified by the document. Shared, standardized stylesheets can be available from centralized Web servers, and an XML document simply refers to the URL for its preferred presentation style. Alternatively, a customized XML document structure can refer to its private stylesheet, or include the style rules directly in the document.