|
|
||||||
|
|
![]() |
|
![]() |
|
||
|
|
||||||
Parsing XML in IE4 with JScript
One bit of fallout from the Artificial Intelligence craze of the mid 1980s was accelerated research into natural-language processing. During that time period, I wrote several natural-language parsers that would take English syntax and convert it into commands that computers could understand. Most of the parsers were used in database applications and accepted English queries like "Give me all the dish on the 1995 Microsoft consent decree." The system would dutifully return all records related to the landmark case between Microsoft and the Department of Justice.
While it may seem like smoke and mirrors to some, such parsers are actually quite simple to write. They take a string (a sentence, in this case) and break it up into a list of tokens. The tokens are placed into a tree structure, which can be traversed by an application. This process is simplified in database applications by the fact that many tokens -- words like "and," "the," "me," and "dish" -- are unnecessary for the query and can be tossed out. Parsers are used for many types of applications such as compilers, interpreters, and other language processors, including browsers.
Last month I mentioned that several XML parsers are beginning to appear around the Net. Most of these parsers are written in C++ or Java, and will likely be used by programmers to create the next generation of XML tools and applications. They often incorporate a command-line interface, and in most cases are poorly documented. The good news is that you don't have to be a hard-core C++ or Java developer to use XML on your Web site. That's because Microsoft Internet Explorer 4.0 includes an XML parser that's accessible through scripting -- if you have IE 4, you can use JScript to experiment with XML now. This month I'll look at the API for the MSXML parser, and in the process show how you can pass an XML document to the parser and get back the document's structure.
Document Objects
As you likely know, the Document Object Model (DOM) is a W3C specification that describes a platform-independent and language-neutral interface for accessing structured documents including those in HTML, CSS, and XML formats. The core DOM specification describes a set of object definitions that let you represent the objects within a document. The core specification has also been extended for XML so that document type definitions, entities, and CDATA sections can also be represented. The objects and interfaces, sometimes called APIs, are referred to as object models.
Microsoft defines its object model in terms of a "document" object, an "element" object, and a collection class. Each of these classes has a set of properties and methods you can use to manipulate XML documents. That is, when you take an XML document and run it through Internet Explorer's parser, you'll get back a tree structure that you can then traverse. In this first example, I'll show you how to access the document object and retrieve information about the document. First, I'll create an HTML page that contains a form, our document object, and the parser script. The idea is to put up an edit control that can be used to enter the name of a valid XML file. The script takes that file and passes it to Internet Explorer, which will parse the file and return the results in our document object. I'll then query the document object's properties and report on this document.
Listing One starts off by creating an object using the new HTML 4
Objectelement. This new element is a generalized mechanism for inserting things like multimedia objects, plug-ins, Java applets, and COM objects into an HTML document. The syntax for the Object tag is shown in Example 1. The key attributes for our purposes areClassID,ID, andName. In general,ClassIDcontains a URL that identifies the implementation of an object. In Internet Explorer,ClassIDacts as an identifier for the object type. The long string of characters assigned to theClassIDattribute identifies the object implementation to Internet Explorer. Theclsid:portion of the string tells Internet Explorer that the rest of the string refers to an ActiveX control. TheIDattribute is a unique identifier that will be used to reference this object from within our script.According to the HTML 4 specification, the browser must pass the contents of the
Nameattribute along with any data from the object if there's no accompanyingDECLAREattribute. So, I use theNameattribute to submit, or pass, the object in a form. The next step is to create the form, which will be used to input the name of the XML file to be parsed. Listing One uses a simpleInputelement to place an edit field on the page. Because noTypeattribute is specified, a text field is assumed. A secondInputelement is used to create the "Parse", or submit, button. TheNameattribute acts as an identifier and allows us to reference the button from the script. When theonClickevent occurs, the value in theFilenameedit field is retrieved and passed to theParse()function.Parsing the Document
Microsoft's Internet Client SDK (see Online) describes ten API calls that let you set and retrieve various properties for an XML document. These include the ability to retrieve the version of XML the document supports, the character set supported, contents of the
!DOCTYPEelement, and the document's root element. There are also a number of properties that are documented, but were not implemented at the time of this writing. I stumbled into one other property,READYSTATE, which appears to be undocumented for the parser (although it is covered in the DHTML documentation). The complete list of properties and methods is detailed in Table 1.My
Parse()method in Listing One uses most of the methods in Table 1 to report on the document object. (Examples for the unsupported method calls are included for completeness, but are commented out.) Parse begins by opening a browser window and writing out the preliminary HTML tags needed to display the results. Next, therootproperty, which stores the document's root element, is retrieved and assigned to theDocumentRootvariable. This is the starting element we'll use to traverse the tree structure of elements; see Listing Two. The remaining code queries the other properties and reports the results in the browser Window.When you load Listing One into Internet Explorer, the Edit control and Parse button appear in the window. To test the code, I've used the XML file we created in last month's column (see "Beyond HTML," July 1998), which is an excerpt from Ken North's March 1998 "Database Developer" column. When you enter a filename, a second browser window pops up and displays the XML version as defined in the
<?XMLVersion?>prolog. The results of running north.xml through the parser are shown in Example 2.Traversing the Tree
Listing Two presents a second method,
displayTree(), which reports on elements within the document tree. ThedisplayTree()method outputs the element details in a visual manner that mimics the structure of the document. If you were to run this code you would see that ArticleText is a child node of Article and a parent node of SubHead.The
displayTree()method is designed to be called fromParse()and takes a document object and an integer value(N)as its parameters. You get this value by querying the document'srootproperty, as described previously. The purpose ofNis to keep track of our level in the hierarchy and to indent the child nodes appropriately.displayTree()begins by ensuring that there is, indeed, a document object. If not, the routine issues an error and bails out. Assuming we have a valid object,displayTree()creates the indent string used to indent child nodes from their parent.You can get to the child nodes through a collection class called
children. This class provides anitem()method to retrieve elements from the collection and alengthproperty, which lets you determine the number of items in the collection. Collections are a very powerful feature of Microsoft's Dynamic HTML, and a discussion is beyond the scope of this month's column. For our purposes,displayTree()uses thelengthmethod simply to set the value forNwhen indenting the output and to display the number of child elements in the output for element detail.The next series of statements prints out the element detail in the browser. The first step is to check the current object type using the
typeproperty. This property contains an integer value that represents one of five types: element (return value is equal to 0), text (value = 1), comment (2), document (3), or DTD (4). For completeness, I've mapped these values to their string equivalents; see theGetTypeStr()method in Listing Two. In our case, we're just interested in elements. So,displayTree()checks for an element type equal to zero. If found, it writes out the element type, its tag name, and any attributes contained within the element. Also note that the indent string is incorporated into the detail output.The final step is to check to see if there are any child nodes. If there are, we must perform the entire process again. I do this by iterating through each element in the collection and calling
displayTree()recursively. This has the effect of performing a "depth-first search," where each branch of the tree is fully explored before moving on to the next branch. Note here thatNis incremented only if its value is different from the current branch level within the tree.Since this is a depth-first search and we are traversing the tree from top to bottom, I need to increment N only when a child branch is encountered. And because of the nature of recursion, I don't have to worry about restoring (or decrementing) the value of N after searching the branch. The reason is that as the recursion "unwinds" to the previous level, the value of N is restored automatically.
I've separated the methods in Listing Two from Listing One for purposes of clarity. In practice, you'll want to drop these methods into Listing One, then add a call
displayTree(xmlDocument.root, 0)
to the
Parse()function (just beforewritelnstatements that add the closing</BODY>and</HTML>tags). Finally, the complete source code and the XML file used for this example are available electronically; see "Source-Code Availability" on page 5.Tool of the Month
Many of the tools available for editing, publishing, and viewing XML files are actually SGML tools. Such is the case with SoftQuad's Panorama Viewer, a browser plug-in that lets you view SGML documents. The company has announced a new version supporting XML, which should be available by the time this reaches print. In the meantime, since XML is a subset of SGML, you can also view XML files as long as they're both "well formed" (they conform to XML guidelines) and valid (they contain a DTD). The Panorama viewer is currently available for the Windows 95/NT/3.x, Macintosh, and UNIX platforms.
The Panorama Viewer includes a number of features: You can use annotations to add notes and comments to SGML documents. Panorama Viewer supports multiple style sheets to control the display of documents. Panorama also provides navigators -- multipane windows that let you navigate through documents. The Panorama Viewer carries a retail price of $49, but a 60-day evaluation copy is available for download from the SoftQuad Web site; see " Online." A complementary product, Panorama Publisher, is also available for $195.
Putting It to Work
Now that you can parse and view XML files, you may be wondering how XML can be used on your site. You can use XML to make your site more accessible to the disabled, create channels using Microsoft's Channel Definition Format, support incremental downloads using Marimba's Open Software Description (OSD) format, or let your visitors view molecular structures with the Chemical Markup Language.
One intriguing idea is to create an XML-aware search engine that lets you search for text more efficiently and at a finer granularity. For example, I could tag this column in every place where a word or phrase is defined. Then I could search on <definition> = "object" and get the definition of "object model" without getting every occurrence of object that appears in the column. The ability to search on tags, and text within tags, is very powerful. If you come up with other ideas for using XML, drop me a line at the address below.
(Get the source code for this article here.)
Michael Floyd is a consultant, freelance writer, and Web Techniques' editor at large. He can be reached via email at mfloyd@web-techniques.com.
|
|