Separating Body from Soul

XML Makes Changing Easy

By Michael Floyd

One of my colleagues once said that gathering information from the Web is like trying to drink from a fire hose. Indeed, even search engines that scour countless Web sites are able to categorize only a fraction of the available pages. For those of us who use information to make business decisions, that's a frightening prospect. On the other hand, those of us who own lots of information have a lot at stake. We're constantly looking for better ways to create, manage, retrieve, exchange, and reuse that information. The trick is to filter it and present it in useful chunks that we can absorb, and that's where XML comes in. XML is, in part, a data representation language that lets us describe data and create vocabularies to exchange information. XML separates data from presentation, making it reusable. As for the creation, management, and retrievalthat's where content delivery systems come in. (See " Online" for sites to help you employ XML in your content delivery system.)

If you've been looking at the prospect of building a system that takes advantage of XML, or researching ways to incorporate XML into your existing system, then you'll want to read on. It can be confusing to sort through all the available components and high-end tools available on the market.

Considerations

There are many types of content publishing applications, and your solution depends on the problem you're trying to solve. Designing an Enterprise Information Portal (EIP) is much different than creating a content delivery system for a site like Motley Fool. In general terms, you must also be mindful of network infrastructure, platforms, servers, and the software tools currently in use. This is where XML shines. No single type of Web site is best suited to the use of XML. It's platform independentyou can use XML with UNIX, Linux, or NT. It works equally well with the IIS and Apache Web servers, and XML documents can coexist with all other data formats, including HTML and PDF files. You can access XML documents via the Document Object Model (DOM) using almost any programming language. In short, XML is easily adaptable to virtually any existing system.

So, what pieces make up an XML delivery system? At a minimum, you'll want to author documents, validate them, publish them on the server, and allow them to be retrieved. In helping you build such an XML delivery system, we'll make a few assumptions. The first is that you want to use the existing infrastructure. In most cases, this means the Internet is the carrier and the Web server is the delivery system. The second assumption is that some information will be stored in a database, while other information will be stored in XML documents. Lastly, let's assume that you want to deliver that information to any Web browser.

Virtually all XML servers run in an environment that includes an XML parser, an XSL processor, a document repository, a collection of document schema, and a collection of XSL style sheets. The schema could be document type definitions (DTDs), XML Schema, or a combination of both. The schema are used to validate new documents as they're input into the publishing system. The style sheets use the XSL Transformation Language (XSLT) to transform classes of XML documents into HTML.

The Process

Using the XML Schema, you have two types of interactions. The first interaction is document authoring and submission. A robust system will likely provide an XML authoring tool that lets authors mark up documents. In lieu of a proprietary editor, you could use a tool like SoftQuad's XMetaL to edit documents. (Note that XMetaL has some drawbacks. It requires that DTDs be specified, which isn't an XML requirement; DTDs are compiled into a proprietary format, and XML schema are not supported.)

When the author submits his or her content to the server, a gateway program takes the XML document and uses DOM methods to load it into memory. If the document contains a <!DOCTYPE> declaration referencing a DTD in the system, it's validated against the DTD. If the document passes validation, it's stored in the system. If the document isn't valid, the author receives a message indicating the problem.

The other interaction is document retrieval. In this case, the client's browser requests a document. When the gateway program receives the request, it dynamically assembles the document for display. The gateway program does this by first selecting an XML document from the document storage area (the "store"). In practice, the store could be file based, or a document repository based on a database model. Once the gateway program selects the document, the script uses DOM methods to load the document and attach an XSL style sheet to it.

CGI, Java, or ASP?

As it turns out, you can choose from three basic methods to achieve the standard server arrangement: CGI, Java servlets, and Active Server Pages (ASP). If you prefer to work with Perl and CGI, you can use the XML::Parser module in Perl to load XML documents into the DOM. The module was developed by Larry Wall, Perl's inventor, and later modified by Clark Cooper. The XML::Parser module provides an interface to James Clark's Expat parser, which is highly regarded in the XML community.

Another approach to serving XML dynamically is to use Java servlets. Like CGI, you can use servlets to generate XML documents on the fly. Servlets also offer built-in support for network sockets, database connectivity, and string manipulation. And they're easily portable to any Java-enabled Web server. Assuming your Web server supports servlets, you'll also need to install a Java XML parser and an XSL processor. I've had success with this approach using IBM's XML Parser for Java and the LotusXSL processor (see "Building an XML Workbench," Web Techniques, May 1999).

ASP and the Rocket XML Framework

Another option is to use ASPs to serve XML. Currently, this is the approach I'm taking with the Rocket XML framework. However, whether using CGI, Java, or ASP, the processes are the same. (In fact, a Java version of Rocket is in the works.) Briefly, Rocket uses ASP scripts to perform browser detection and to set session and application-level variables. When a browser requests a page, Rocket uses that information to select an XSL style sheet, which it then applies to the XML document. The style sheet transforms the XML document into HTML, adds navigation and user interface components, and sends the resulting page back to the server. Because Rocket knows the browser's type and version, the style sheet is tuned to generate optimal HTML for the browser making the request. Rocket also has an option that lets you send XML documents directly to clients that are XML aware. This last feature lets the Webmaster offload some of the processing onto clients.

From an architecture perspective, this is exciting stuff. First, I've separated my data from the processing logic and from HTML presentation. That means I can now reuse that same data in other ways. For example, I can stream select material to other sites for syndication. I can target my content for WebTV or other Internet devices. I can even take the same material off the Web and target it for CD-ROM distribution. The bottom line is that I can generate additional revenue from the same data with minimal cost and effort.

To develop content, all I have to do is enter text for my document, add some minor markup, and I'm done. I've written a Web interface for Rocket that takes an author's document, compares it to the DTD referenced in the document type declaration (<!DOCTYPE>), and validates the data. This ensures that the style sheet renders the document properly. Not only does this save the author a huge amount of time, but it also means a document author no longer has to be a content specialist. Bottom line: cost savings.

From a designer's perspective, you actually have more freedom to create. Because style sheets can be tuned to a specific browser, you can take advantage of any browser feature you want. Sure, you'll have to create a style sheet for each browser, but once a style sheet is written you don't have the redundant task of rewriting that code for every new page you create.

Designers can also do new and cool things. For instance, you can use style-sheet selection to offer alternative look-and-feel properties. Rocket includes a style-sheet selection mechanism that lets users select from one of four predefined themes. When the user selects "Change Themes" from the site's navigation bar, she or he is presented with a list of themes for the site. Clicking on one of the other themes automatically changes the appearance for the duration of that user's session. As a designer, you can add your own themes simply by creating new style sheets with your favorite look.

With XML, Webmasters will have a much easier time with site maintenance. First, you'll find that XML and HTML coexist just fine. That means you can begin adding XML pages to your site without affecting your existing configuration. In the case of Rocket, I created a configuration variable that sets the installation directory for my XML pages. By default, Rocket is installed in wwwroot/rocket. However, you can run XML from the root directory, or any subdirectories under it.

Secondly, XML documents, DTDs, and style sheets can be maintained in their own directory trees. This lets you set permissions that prevent, say, an author from altering a style sheet, or a designer from changing the DTD.

As a site producer, I've also found the DOM helpful in maintaining the navigation system. In Rocket, navigation entries are maintained in a separate XML document, nav.xml. When the client makes a request, Rocket opens and parses nav.xml and appends it in memory to the requested document. When the style sheet is applied, it simply views the navigation information as part of the document and renders it. In fact, the same information is rendered twice: once in the main navigation bar, and again in text mode at the bottom of the page. The benefit is that the Webmaster can change the structure of the site in minutes simply by editing nav.xml, and the changes propagate through the site without any further intervention.

Packaged Solutions

When it comes to content publishing, we're also seeing large-scale systems that address broad categories of content delivery. DataChannel's XML framework, for example, addresses the EIP. Enterprise publishing is one of the more interesting categories because the problem varies widely depending on the organization and its information needs. Inside the company, users of enterprise information include the business decision-makers: the executive management, marketing, and advertising teams. Also within the corporate structure are administrative support, human resources, and finance. And then there are the business-to-consumer departments like customer service and technical support.

DataChannel addresses the EIP, in part, with DataChannel Server 4.0 (formerly known as Rio), a collection of tools that includes its XJParser, and the DataChannel XSL Processor that helps transform structured data (such as that from a database) into an XML representation. DataChannel Server 4.0 also includes tools for authoring, delivering, and managing that information.

However, DataChannel has recognized that tools alone are not enough. That is, much of the data within the enterprise is linked directly to the applications that deliver them. To reuse enterprise data, systems have to be rearchitected from the ground up. So DataChannel integrates a set of services with its product offerings that help the company define the problem and create a blueprint for a solution. DataChannel also offers XML training for employees. I've attended its public training seminars and they're first rate. Finally, the company provides several options for post-sale technical support and professional services that place experts on site to help orchestrate your company's efforts.

Many vendors of high-end content management systems have enabled their tools with XML. Vignette, for example, was one of the first companies to integrate support for XML in its StoryServer. In addition, StoryServer supports the XML-based Information and Content Exchange (ICE) protocol for electronic asset exchange. Similarly, Gauss Interprise's VIP'ContentManager is a complete Web content management system for large sites. Gauss has an add-on product called VIP'XML-Gateway that uses XML to output content stored in its manager product in any format, including HTML.

When content publishing became mainstream, object-oriented databases found new life in the Web content management business. So it's no surprise that companies like Poet have adapted their product lines and strategies to the XML market. Poet produces Content Management Suite (CMS), which is but one weapon in its application development arsenal.

According to Poet, XML content is "hierarchical and irregular." By this, I presume that they mean that structure doesn't always occur in a predictable manner. Consider that database data is highly structured information. That is, there are well-defined fields with names and associated data types. The kind of information that typically gets posted on the corporate portal may start as fairly structured information: email, files, memos, documents, spreadsheets, and so on. While we may know the overall structure, we don't necessarily know the ordering of individual elements or how often they may occur. That makes handling these structures difficult for relational databases. So, Poet CMS stores XML as objects and uses XML internally to dynamically assemble content on the server.

Although space does not permit a discussion of all the publishing options on the market, you'll also want to look at other tools such as Chrystal Software's Astoria. And though they're targeted at markets other than content management and publishing, you may also want to look at some of the tools used in e-commerce and business-to-business applications. These include Bluestone's XML-Server, eXcelon 2.0 from eXcelon (formerly Object Design), and even the ExeterXML Server from XMLSolutions.

The XML Void

XML was originally designed as an HTML replacement that would tame the tag wars. For content publishers, XML offers a structured approach for managing applets, scripts, rich media, and other content types while separating out the data. This in turn lets Web publishers retarget their content for new media and new devices. Now that you can keep your data independent from your devices and platforms, my question to the community is, where are all the XML-based Web sites?

One factor that may be affecting XML's use in real-world Web applications at the moment is the perception that writing XML is a lot more work than writing HTML. To be sure, XML requires that you plan your strategies before jumping into content creation. For example, once you realize your data can be retargeted, you must create element types that make sense in these retargeted applications: A <table> element may have little meaning in a voice application. The key is to describe your data sufficiently and let other mechanisms (either via the DOM or XSL) mold that data into any form you want.

There are also other technologies to learn. Once you've decided which element types are appropriate, you'll likely have to formalize those decisions in a DTD. Of course, DTDs use an entirely different syntax. Or you may decide to employ an XML Schema that uses familiar XML syntax but contains features not included in DTD syntax.

Also, rendering documents using XML Style-sheet Language Transformation (XSLT) requires programming-like skills, albeit in a declarative language. No matter how you slice it, displaying an element by traversing a tree in a style sheet just isn't as easy as writing the element in HTML. And if you plan to process the data anyway, you'll likely have to learn the DOM API.

Despite the high price of admission, XML offers a great deal in return. XML lets you separate, manage, display, and repurpose content at will. And with new tools on the market, creating XML content is getting easier. As such, I hope to see more XML sites in the near future.


Michael is the author of Building Web Sites with XML from Prentice Hall, and teaches XML BootCamp. He is also the architect of the Rocket XML framework, and carries the honorary title of editor at large at Web Techniques. Please contact him if you have an interesting site that uses XML. He can be reached at mfloyd@lifestylesSantaCruz.com.




Copyright © 2003 CMP Media LLC