The Case of the Missing Text

In the June issue, I presented an XSL style sheet that used patterns to locate objects within the document tree, and showed how you can specify template rules to format these objects. The example was interesting in that it demonstrated the process of transforming XML into HTML and showed how you can combine CSS style rules to format the HTML output.

Originally, I had written and tested the example using IBM's LotusXSL style sheet engine. In that arrangement, the style sheet was to be served along with its accompanying XML document from a Java servlet. However, at the end of that column I suggested that, while it hadn't been tested, the style sheet should run in most XSL processors, including Internet Explorer 5's.

Apparently, many of you tested that assumption and things didn't run as advertised. The symptom in all cases was that the entire document, save one word ("by"), was missing. In tracking down the problem, I discovered a few differences between the XSL draft specification and Internet Explorer's implementation. In fairness to Microsoft, the differences reflect the quickly changing specification. Nevertheless, an understanding of how IE processes style sheets will save you countless hours of head scratching.

The Problem

The June issue included an XML document, news.xml, which presented a typical online news story. The document contained elements that described the various parts of the article including its title, dek (subtitle), and byline, as well as formatting information within paragraphs. For example, the first character in the first paragraph was a drop-cap letter that would be larger than the rest of the characters in the document. (You can view news.xml and the style sheet mentioned below in the June source-code section at webtechniques.com.)

The style sheet used to transform the XML document went something like this: A root template was used to create a boilerplate of the HTML output. Wherever we wanted to place text from the XML document, a pattern was used to select that portion of the document. Then <xsl:apply-templates> was called to process child nodes. For example, we called <xsl:apply-templates select="Story/SectionTitle"/> to include the title of the document at the top of the story.

When <xsl:apply-templates> is called, two things happen. First, the processor grabs the content specified in the pattern: In this case, the content of the Story/SectionTitle element is News&amp;Views. Next, <apply-templates> looks in the style sheet to see if there are any templates that apply to this node. In the original example, the story title was processed directly in the root template and no other template rules were specified. Thus, it was simply included with the appropriate formatting information.

However, later in the style sheet when <xsl:apply-templates select="Story//BodyText"/> is called, a chain of events occurs (see Listing One). First, the content from the BodyText element is retrieved. This time, a template for Body Text exists, so it becomes instantiated and is processed. The only thing the BodyText template does is call <apply-templates> to process its children. And herein lies the problem.

At the time this style sheet was written, it was not clear whether <apply-templates> should process all descendants of the current node, or just its immediate children. LotusXSL assumes that all descendants will be processed. From an author's perspective, this is better, because it means that you don't have to write a separate template rule for every element type in your document. Microsoft, on the other hand, assumed that only immediate child nodes should be processed. That means you must write a separate template for each and every element type in your document.

Solving the Mystery

As you might guess, to get the style sheet from the June issue to work properly in Internet Explorer 5, you must write template rules for all of your document's element types. Essentially, what you must do is get your templates to cascade down through the tree in order to touch all of the elements. To do this, most rules will simply call <apply-templates> to traverse to the next level of child nodes. If any element type requires special formatting, you can simply add it to that template.

Listing Two contains an excerpt from the revised style sheet. The complete example -- including the new DTD, XML source document, and XSL style sheet -- is available online at www.webtechniques.com/cgi-bin/sourcecode. The first thing to note in the new example is that I have renamed the elements in the XML document for readability. In particular, the BodyText element has been renamed to aBody, and the story element has been renamed to article.

Browsing through Listing Two, you'll notice that the root template includes a rule to process the aBody element. The pattern article//aBody says "start at the document element article and select any descendants that are aBody elements." This allows aBody elements that are nested within other elements to be processed. When this <apply-templates> is instantiated, the processor looks for any templates that match aBody. Since there is an aBody template, it becomes instantiated. The only statement in this template is an <apply-templates>, which says "process all immediate child nodes."

Within the tree structure, child nodes of aBody include para1, para, and para2. Once again, the processor sets out to find templates for these element types and locates a template for each. The template for para1 simply inserts an HTML paragraph element (<P>) and calls <apply-templates> to process its child nodes. Children of the para1 element include the dropCap, bold, and italic elements. Note that there's also a text child node that represents para1's content. Again templates exist for each of these element types. In the case of the dropCap template, an HTML <DIV CLASS="dropCap"> element is inserted. Don't mistake this reference as pointing to an XML element. The CLASS attribute for this <DIV> actually references a CSS <STYLE> rule of the same name, which is located in the root template (not shown).

Next, the dropCap template also calls <apply-templates> to process its child nodes. This time, the only child node is a text node representing the element's content -- in this case, the "W" character. This character is inserted into the <DIV> element and the processor moves on to process the other templates.

I fully expected this approach to solve our mystery. The title, dek, byline, and document text all should have appeared in the browser. But I had the same problem: Only the solitary word "by" appeared in the window.

It turns out that Microsoft requires that you create a template rule to process all node types that are not specified as an element type. This means you must create templates to process attributes, comments, processing instructions, and yes, text nodes. That seems peculiar since text is so common that a template for handling it is built in to XSL. Nevertheless, IE requires that you include the template rule found in Example 1 to display an element's content.

With this tiny bit of code, the mystery was solved. The result is shown in Figure 1. Presumably, these implementation details were not made clear when Microsoft wrote its processor. In any case, this template rule should be included in any style sheet you design for use in IE.

More Q & A

The techniques described in this month's column were prompted by a question I received from a reader. While I'm at it, I thought I'd share a couple of other questions with you. One recent inquiry came from Ted Baer, who wanted to dynamically build a hypertext link using XSL, where the target filename is an attribute of an XML element. In his words:

I need to create something like
<<A HREF="target.xml"
where target comes from an XML element
<GOHERE ref="target">

Assuming that you are outputting HTML from XSL (that is, transforming the XML to HTML), you could simply use a pattern to access the attribute, then generate an HTML anchor tag using the attribute for the HREF. In fact, this month's example uses this approach to create a link to a biography in the author's byline. The code is shown in Example 2.

Ted was actually full of questions. He later wrote:

How does one determine whether he or she is outputting to HTML or XML? I indeed tried what you said, and the resulting text string was a stunningly perfect

<A HREF="target.xml>xxx</A>

displayed on the screen when I launched the XML file. But it wasn't a link that IE5 should have interpreted. I thought that there was no distinction between XML and HTML in your output tree type. Is there something I need to declare, like a processing instruction (PI). If so, put it in your book -- I haven't found a decent explanation yet.

As it turns out, a complete answer involves a lengthy explanation. First, the interpretation process depends on how you load the XML stream. For example, you could simply launch the XML file in the browser and rely on XSL to process the document, or you could load it and process via the DOM. Also, I am not privy to all of the inner workings of Internet Explorer; however, the important point to keep in mind is that conceptually there are two trees -- the source tree and the result tree. The source tree is constructed by parsing the original XML document and placing all elements, attributes, comments, processing instructions, and so on into your tree structure.

The result tree is constructed from what is specified in the XSL style sheet (in this case, transformed HTML). At this point, the tree nodes represent well-formed XML. If the output is to go to a file, then the output will look like HTML and can be processed by any HTML browser. Presumably, IE shortcuts this process. That is, when you launch the XML file directly in IE, it parses the document into the source tree, constructs the result tree, reads it, and processes the output as HTML.

Conclusion

One thing you may have learned from all of this is that despite vendors' best efforts to comply with existing standards, various XSL engines still exhibit peculiar differences. Part of the problem is that implementation depends on which version of the standard was used, and what state it was in at the time. We, as Web developers, must still grapple with these differences.

Thanks to you, the reader, we've shined the flashlight in corners that might otherwise have been overlooked. In the process, you've gained some insight that I hope will save you time in the future. In the meantime, keep those questions coming.

(Get the source code for this article here.)


Michael publishes BeyondHTML.com, and serves as Web Techniques' editor at large. His upcoming book Building Web Sites with XML is due out from Prentice Hall later this year. Michael can be reached at mfloyd@lifestylesSantaCruz.com.




Copyright © 2003 CMP Media LLC