|
|
||||||
|
|
![]() |
|
![]() |
|
||
|
|
||||||
Death of a DTD
As cool as XML is, the W3C hasn't exactly been perfect when it comes to defining supporting standards. Case in point? The document type definition, or DTD. As you likely know, a DTD is a collection of rules that define the type of content your elements may contain, the number of times a subelement may occur, and whether they are required or optional. DTDs also let you specify default values for attributes. And it would be impossible to perform entity replacement without the DTD.
When I look at the DTD, however, I think, Gee, the W3C really rushed that one out the door! You see, one of the requirements in defining XML was that it be SGML compliant. So when it came to specifying the XML DTD, the XML Working Group apparently decided that it had to use the same syntax as the SGML DTD. The problem with this syntax is that it's different from XML and relatively arcane. Even worse, the syntax varies (slightly) depending on whether it's an internal DTD contained within the XML document, or an external DTD maintained in a separate file. That means learning yet another syntax, and remembering nuances that depend on the physical location of the definitions. In creating its new markup language, the XML Working Group initially missed the opportunity to redesign the DTD.
"Ah!" you say, "what about the requirement that XML remain SGML compliant?" Well, that's the beauty of XML (and SGML, for that matter). The first word in XML is "extensible." And there's no reason you can't rewrite the DTD directly in XML. Of course, this is only possible because the DTD is an optional component of an XML document. (Hmmm. Maybe Jon Bosak and the others in the XML Working Group knew what they were doing after all.) That, in part, is what the XML Schema standard proposes -- the ability to write document type definitions using XML syntax.
Before I proclaim the death of the DTD, let me first say that although I believe DTDs will eventually go away, they're still very important. For starters they're currently the only consistent vehicle for defining XML schemas. Additionally, DTDs take time to develop and they're already growing in numbers. That means there will be scores of DTDs around for some time to come.
With that in mind, reading and writing DTDs are still valuable skills for XML developers. So this month I'd like to examine the XML DTD syntax. Despite its arcane appearance, it's actually quite simple to write. And understanding DTD syntax will give you an appreciation for other proposed standards like XML Schema.
DTD Basics
First, let's get some terminology out of the way. Document type definitions are the rules that define the elements, attributes, entities, and so on, within a document. That is, each rule is called a "document type definition." What often causes confusion is that the entire collection of rules is called the document type definition, or DTD.
In providing feedback on my latest book, Charles Goldfarb was quick to point out that a DTD is implied whenever you mark up a document, even if a formal DTD is not present. (As I already mentioned, DTDs are not required by XML documents.) However, the DTD lets you formalize the rules for using elements and attributes while imposing additional constraints.
Why is this important? The answer touted by most is that you can compare these rules to specific documents in order to validate your XML data. However, DTDs can be important for other reasons. For example, you may want to relax the restrictions imposed by a DTD, yet use it for documentation. This lets you share your documents in a less formal way. That is, you can ensure that authors know that the
<contact>element takes<name>,<address>, and<phone>elements, or that the<address>element acceptsstreet,city, andstateattributes.Internal vs. External DTDs
When a DTD appears directly within the XML document, it's called an "internal DTD." Internal DTDs are placed in the document type declaration (the
<!DOCTYPE>) of your XML document. The DTD begins with the name of the document element followed by a left-hand square bracket ([), and ends with a right-hand square bracket (]). For instance, the code fragment in Example 1 shows a portion of the internal DTD for a contact-database document.The problem with internal DTDs is that they can't be reused. More often, you'll be creating DTDs in separate files that can be referenced by multiple documents. These are called "external DTDs." Example 2 shows an external DTD. In comparing the two examples, you'll notice that the external DTD does not use the square-bracket syntax to delimit the DTD. In fact, the brackets are a specialized syntax allowed in the XML document's
<!DOCTYPE>declaration.When you move the DTD to an external file, you need some method for referencing the file. You do this using a combination of statements in both the XML declaration and the document type declaration. For instance, consider the code in Example 3. Instead of containing the DTD, the
<!DOCTYPE>declaration references a URL pointing to the DTD. You'll notice that the first attribute in the<!DOCTYPE>declaration also contains the valuearticle. This names the document type and is also the document's root (or document) element. You'll also notice theSYSTEMattribute. This basically tells the XML processor that it should fetch the DTD from the URL that follows (if it's required). You can instead specify this attribute to bePUBLIC. This is reserved for common DTDs. For example, it would be inefficient to send everyone to the W3C Web site to fetch the HTML DTD every time someone requested a Web document. Declaring the document type asPUBLICinforms the XML processor that the application software has internal knowledge of the DTD. (This is why you see thePUBLICattribute in most HTML documents.)The rest is handled in the XML declaration. In Example 3, you'll notice the
standaloneattribute in the<?xml...?>processing instruction. When an XML document isstandalone, it doesn't require a DTD to process the document properly. You make the documentstandaloneby setting its attribute value toyes. You can require that the DTD be included for processing by setting this value tono.Essential Elements
When it comes to creating elements, the element type declaration is like a glob of clay. Using
<!ELEMENT>, you can mold an individual element, then add it to other element type declarations to build up more complex element structures. For example, the first element type declaration in Example 2 defines an<article>element. According to this definition, the<article>must contain a headline followed by a dek, a byline, and text. In addition, this element may optionally contain one or more listings, figures, examples, or tables. In essence, the<article>is the finished product, constructed from many subelements. And as you can see from Example 2, each of these elements, likewise, is created using the element type declaration.So, how does it work? The element type declaration begins with the
<!ELEMENTfollowed by the name of the element you want to create and a closing>. This name is called the element's generic identifier. The part contained in parentheses is the called the content specification. The content specification is where you define which elements and attributes are allowed for this element, how often they can occur, and so on. For instance, the first portion of the content specification for the<article>element (contained in the first set of parentheses) separates the names of other elements with commas. The comma means "and" and implies that sequence or order is important. Therefore, the headline must come before the dek.The other separator, the pipe character (
|), is used in the second part of the content specification. This character means "or" and implies a choice. Thus, the<article>may contain a figure, or a table, and so on. You can also control the number of occurrences for a given element or the entire content specification. For example, the plus sign following the text element indicates that text is required and that it may occur one or more times. There are only three occurrence indicators, which are summarized in Table 1.You can also specify a content type for your specification. For example, you can specify an empty element using the
EMPTYkeyword, or restrict your element to contain just subelements using theELEMENTkeyword. The possible content types are summarized in Table 2.Finally, you'll notice that many of the elements in Example 2 use the
#PCDATAkeyword. This is the blob of clay you start with.#PCDATArefers to parsed character data, otherwise known as XML text. According to the XML specification, this can be just about any combination of text-based characters and white space. (There are some notable exceptions, but space limitations don't permit a discussion of those here.)Attributes
The next task you'll want to master is creating attributes. Well, guess what? Once again there's a new syntax to learn. Fortunately it's straightforward. The idea is that you declare your attributes in an attribute list. This lets you declare one or more attributes at the same time. By declaring multiple attributes in a single attribute list, you effectively create a group that can be assigned to an element in one fell swoop.
As an example, consider the attribute-list declaration in Example 4. The declaration begins with
<!ATTLISTand is followed by the name of the element with which you want to associate your attribute. In this case, we're adding an attribute to the<contact>element. Next is the name you want to give to your attribute, followed by a type identifier. In this example, the email attribute is defined to be of typeCDATA.
CDATArefers to unparsed character data. That is,CDATAcan be any string of characters. Normally, the XML processor scans the XML data looking for special characters, like a left-hand angle bracket (which indicates an XML element is coming), or the CDEnd string (]]>) which indicates the end of aCDATAsection. By defining the attribute as aCDATAtype, you tell the parser to pass this attribute without parsing it. This is particularly useful when you want to pass larger chunks of data containing characters that might be interpreted as markup.There are, in fact, many types you can assign for your attributes. Your attribute can be a name token (
NMTOKEN), enumerated list, notation, ID, IDREF, or an entity. Space constraints do not permit a complete explanation of attribute types here. Briefly, however, a name token is like aCDATAattribute, except that it's restricted to characters representing letters and numbers, and the period, underscore, and hyphen characters. For example, name tokens let you include a decimal or a minus sign for a numeric value. As the name implies, enumerated lists let you create a list of name values that can be used as attributes. IDs let you assign a unique name to an attribute; notation attributes (actually called "data content notations") are useful in mapping an object to a resource; and entity attributes, which can be used to refer to external objects. As you can see, attribute types cover a lot of territory. Hopefully, we will be able to explore them fully in a future column. IBM has a good DTD tutorial at www.ibm.com/software/developer/library/buildappl/writedtd.html.Returning to Example 4, the
<!ATTLIST>syntax allows you to define attributes and their values. You can specify the default value, in this case "USA", by using the#IMPLIEDkeyword. You could have instead used the#REQUIREDkeyword to indicate that the author must always supply a value, or the#FIXEDkeyword to specify a value that the author cannot overwrite.Running on <EMPTY/>
Before I sign off this month, I wanted to announce that after much ballyhoo with operating systems and Web-hosting companies, my Web site BeyondHTML.com is finally running completely on XML. All documents on the site are marked up in XML. (You can verify this by clicking View/Source in your browser). In addition, the XML Tools Database makes use of XML to format tables for different browsers. If you're running Internet Explorer 5, XML is streamed down to your browser and processed there. If you're running any other browser, the documents are processed on the server and transformed to HTML for presentation in your browser.
I'd like to offer you a chance to take part in LifestylesSantaCruz.com's growth. As time permits, I'll be adding new features to the site and writing about them here. However, one writer can do only so much. If you have a cool style sheet or want to share a technique, pass it along. If it passes muster, I'll feature it on the site and -- who knows -- maybe cover it here. Of course, all contributions will be credited. And so you know, BeyondHTML is a "commercial-free" zone that is unsupported by commercial interests. Even if you're simply interested in a "live" presentation of XML, stop on by.
(Get the source code for this article here.)
Michael is the author of Building Web Sites with XML from Prentice Hall. He provides XML training to large companies and publishes BeyondHTML.com. Michael also serves as Web Techniques' editor at large. He can be reached at mfloyd@lifestylesSantaCruz.com.
|
|