Extreme Markup

In the good ol' days, there were frames, tables, IMG tags, and anchors. HTML was, and is, great at marking up documents for publishing. And after all, HTML is the language of the Web. So why is there all this interest in the eXtensible Markup Language (XML), and where did it come from?

The short answer to the second question is the Standard Generalized Markup Language (SGML). SGML is not a markup language, per se. Rather, it's a metalanguage that lets you create your own customized markup language. This is, in fact, what HTML is: an instance of specific tags defined in SGML. More accurately, HTML can be fully specified by a particular Document Type Definition (DTD) in SGML. So, at the next party when you hear that SGML is the grandfather of HTML, you can respond that HTML is actually the most widely deployed SGML application in the world.

The answer to the first question is that HTML's original draw -- its simplicity -- is now its greatest drawback. Users want HTML to do things it was never designed to do, and as a result the HTML 4 specification has added new complexities to a once elegantly simple language. The bigger problem is that we'll have to wait for the next generation of browsers to fully support the new specification. And if we, as developers, choose to use these new features, we risk excluding older, noncompliant browsers.

Recognizing these and other problems, the W3C went back to the SGML drawing board and decided that if HTML represents the proverbial fish, then SGML is the pole. The problem with SGML was that it's bloated with esoteric features, so Jon Bosak, Tim Bray, Dan Connolly, and others working as part of a generic SGML Editorial Review Board (with significant help from the SGML Working Group) came up with a leaner, meaner SGML for the Web: XML. The initial XML draft specification was presented at the SGML '96 Conference in Boston. But it wasn't until Microsoft released its Channel Definition Format (CDF) that the XML shot was heard round the world. That's because CDF relies on XML for its declarative syntax.

Touring XML

XML has been characterized as many things including SGML for the Web, SGML Lite, HTML on steroids, and HTML++. These characterizations reflect the two distinct communities of developers interested in XML. As a restricted form of SGML, XML makes life much simpler for SGML authors and tool developers, while making existing SGML documents accessible from the Web. Consider that SGML documents still outnumber HTML documents many times over.

The idea behind XML is that it decouples markup syntax from presentation semantics. Combining XML with a style-sheet language like XSL allows you to overcome HTML's primitive page-layout features. If you've ever tried wrapping text around an image, creating a drop cap at the opening of a paragraph, or flowing text into a two-column format, you know what I mean. Web designers have come up with clever methods to overcome many of these problems, but they're often kludges that slow download time and clutter the document with unnecessary markup. This is also the reason we haven't seen good Web page-layout tools -- there's no easy or reliable way to reproduce these effects in different browsers. By separating document markup from visual presentation, XML tools can now be developed that overcome HTML's inadequate page layout.

XML greatly extends HTML in many ways. The most noticeable feature for HTML hacks is that XML allows you to create your own custom tags. You use these tags to create rich documents that are more structured than HTML documents. In many cases, you'll also notice a decrease in the level of complexity in XML-tagged documents. Custom tags can be used to tailor the data for a back-end database, to drive searches, or to describe a structure for a navigation system. For instance, you can create your own tag, <JavaCode>, and use it to mark up text just like any HTML tag, as shown in Example 1(a).

If your only concern is to display the Java method using a different font type, then stick with HTML and use a Cascading Style Sheet definition -- XML is not required. On the other hand, if you're creating an index of keywords for a search engine and want to include the name of the Java method, then you would create an XML declaration, called a "document type declaration," and pass this information to an application that can process it. A declaration for <JavaCode> might look like Example 1(b). The declaration is placed in a file along with the other declarations for this document. Collectively, this file is "loosely" called a Document Type Definition (DTD).

Language Elements

There are six types of markup you can use in an XML document: comments, elements, attributes, entities, processing instructions, and CDATA. Comments, which begin with <!-- and end with -->, can be used to annotate your document.

We've already seen how to declare and use an element. Elements are used to create custom tags and generally describe the content they enclose. Unlike HTML, elements must include an end tag as seen with </JavaCode> in Example 1(a). XML also supports the notion of empty elements. An empty element can be written as <JavaCode></JavaCode>, or more conveniently with just a single tag containing a trailing slash (<JavaCode/>).

You use attributes to associate a name with a value. For example, I could create a tag to identify captions for figures, tables, listings, and examples within an article by creating a Caption tag and associating an attribute as: <Caption ID="Figure">.

XML includes keywords and other reserved characters that ordinarily can't be used in your documents. So, you use entities to get around this problem. In addition, entities are used to add unprintable characters, such as color values or Unicode characters to a document. You can also use entities to refer to repeated text, or to include the content from an external file.

Processing instructions (PIs) are used to supply an application with data. Characters contained within a PI are ignored. The syntax designates a "PI target" that can be identified by an application. According to the specification, applications should not attempt to process targets they don't recognize. CDATA is another way to tell the XML parser to ignore reserved characters. For example, mathematical equations often contain less than (<) and greater than (>) signs, which could be confused with markup for an element.

Document Type Declarations

The power in XML can be traced to document type declarations. You use these declarations to give your custom tags meaning and to communicate information about the content in your document to the XML application. People often refer to document type declarations as DTDs because they're the equivalent of SGML's DTDs, or Document Type Definitions. In most programming languages there's a semantic difference between declarations and definitions. In XML, however, declarations can be contained in either the XML document or an external DTD file. Unlike SGML, XML does not require DTDs (as we'll conveniently call them).

There are four types of declarations: element, attribute list, entity, and notation. Element declarations associate a named element with its "content model," the content an element can contain. For example, you could declare ArticleText to contain plain text, text for subheads, and specially formatted characters to designate programming code as shown in Example 2(a). In this example, the #PCDATA refers to parsed character data. The pipe characters (|) mean "or", so ArticleText can contain PCDATA, a subhead, or CODE characters. The plus sign after Subhead means that ArticleText can contain one or more subheads; the question mark after CODE means that it is optional.

Attribute declarations define the attribute list for an element. In Example 2(b) I've created an attribute list for the Subhead element. In this example, Level is the name of the attribute, and it may take on a value of either HEAD1, HEAD2, or HEAD3. The second occurrence of HEAD1 outside the closing parenthesis indicates that it is the default value.

Entity declarations let you associate a name with some other content fragment in the document -- a copyright notice, for example. The content fragment can either be text within the document, part of a document type declaration, or a reference to an external file. The specification also distinguishes among internal, external, and parameter entities.

The final type declarations are called notation declarations, which are used to identify external binary data -- say a GIF89 image -- so that the XML application can process it. The language of XML is necessarily complex, and space constraints prevent a thorough discussion here. But document type declarations are fertile ground for future columns, so stay tuned.

Adding Structure

Now that you can create your own tags at will, what new tags should you create and how should they be used? One reason to create new tags is that you can better describe your documents with additional markup, thus adding value. More importantly, you can give your document structure. That structure can be read by XML parsers and processed in interesting and novel ways. For example, you can create a navigational system that "knows" your document's structure and can assist users in maneuvering through it. Help documents are a perfect application.

To demonstrate how structuring works, I've excerpted a portion of Ken North's March 1998 "Database Developer" column and marked up the document with some XML tags that I've defined. The feature stories and columns all contain some common elements -- Title (or headline), Subtitle (called a deck), text, subheads, and so on. I've defined these elements in terms of a structured hierarchy; see Figure 1. Notice that "Article" is at the root of the tree structure, and all other elements are descendants of "Article." The other groupings include elements that are part of the magazine's structure (such as Footers), and the largest group contains elements specific to the article. Listing One contains the marked-up document based on this tree structure.

Aside from the fact that the tags are custom, the markup looks very much like HTML. The first line tells the XML parser which version of XML the document supports. The second line points to the DTD that defines our custom tags. Strictly speaking, this is not required since XML doesn't require a DTD. Note that tags in XML must be properly nested, unlike HTML, where tag hierarchies are not strictly enforced. The only other item of note is the <Subhead> element, which includes an attribute to identify specific levels of subheads.

When Listing One is passed through an XML parser, a tree structure similar to Figure 1 is constructed. The question is, where can you find an XML parser? As XML-enabled tools become available, the parsers will be embedded into applications. In the meantime, both Internet Explorer 4 and Mozilla include one, so you can experiment with XML files now. I haven't looked at Mozilla's parser yet, but Internet Explorer's includes an API that's accessible from both Java and JScript. You can use either parser to validate the correctness of your XML documents. Unfortunately, you must write some code to query the parser for the results. Next month, I'll show you how you can send an XML document to the parser and return its structure using JScript. From there, you can build a navigation interface that allows users to browse your document.

Tools of the Trade

Numerous commercial XML tools are beginning to surface. One you may want to check out is Vervet Logic's XML editor, XML <PRO>, which features an Element Wizard for creating elements, and an Attribute Wizard for adding attributes to elements. The product also includes support for entities, CDATA, comments, XML validation, and DTDs. Complementing your XML development is ArborText's XML Styler, a tool for creating and modifying XSL style sheets. I haven't talked much about style sheets yet, but they're required reading if you plan to use XML. The product includes a GUI for developing XSL style sheets, and hides many of the syntactic and structural details of XSL. Another product, Frontier, is a full-blown content-management system that specifically focuses on XML's ability to separate content from display. This allows larger teams to separate the site management, build, and design functions. And this one is free. Over the coming months, I'll look at these and other tools to help you on your XML way. In the meantime, check out the " Online" included with this article for a list of other freely and commercially available tools.

(Get the source code for this article here.)


Michael Floyd is a consultant, freelance writer, and Web Techniques' editor at large. He can be reached via email at mfloyd@web-techniques.com.




Copyright © 2003 CMP Media LLC