deutsche Version
 

 

 

 
 

XML Metalanguage Opens Up New Worlds

by Nigel Hutchison, Software AG

The XML markup language is a new standard for describing the content of documents and can be adapted to suit different applications. Its possible areas of application range from Web pages through electronic commerce to complex database solutions.

The unbroken expansion of the Web has given rise to an information jungle in which it is increasingly difficult to get your bearings. Overloaded search engines often no longer provide the required results; and advertising messages, outdated pages and junk information are a frequent hindrance when using the Web as an information medium. On the one hand, this is down to the ever greater numbers of documents stored on the Web (even according to conservative estimates there are hundreds of millions of pages); on the other, it also has to do with the way the information is stored on the Web. Essentially, it is contained in text documents with HTML (Hypertext Markup Language) formatting. The Web designer employs HTML tags to define how the document will appear in the browser, include links to other Web pages, incorporate applets, and so on. This bears little resemblance to ‘real’ programming work; HTML is a page description language, comparable to the PostScript printer language.

Limitations of HTML

HTML owes its fame to the Web, and today there are many Web tools that relieve the designer of the burden of working with HTML code, though the limitations of HTML do, of course, remain. Essentially, these are:

  • HTML is not extensible; you cannot define your own tags for specific requirements.
  • It is not possible to represent the specifications of data structures, as required in databases or object hierarchies, for example.
  • HTML does not support data validation.

HTML thus does not deal with aspects of content, and you cannot use it to qualify Web pages, for example. HTML only controls the presentation of the information. Anything beyond that requires considerable programming effort in the form of either applets or applications. Not the least of the problems posed by the limitations of HTML is the danger of proprietary enhancements being developed, which would restrict the universal availability of information formatted in HTML.

SGML – flexible but complex

Most users of HTML today are unaware of the fact that HTML is based on SGML (Standard Generalized Markup Language), a metalanguage for defining rules governing the handling of documents of different types. HTML is an application of SGML that is concerned only with the presentational aspects mentioned above, thus providing as broad a basis as possible for the interchange of documents. And SGML promises to provide everything that cannot be achieved with HTML. Every application of SGML – including HTML, for example – contains one or more document type definitions (DTD). These are formal descriptions of the syntax of the documents assigned to a document type. A DTD is required in order to interpret and check an SGML document.

SGML has some decisive advantages. Firstly, it is a non-proprietary standard supported by a large number of software vendors. Because of this, a document database complying with the SGML standard will have a much longer life than one based on proprietary standards. Secondly, the documents can both be read by people and parsed (have their code dissected and analyzed) by programs. Thirdly, SGML documents describe the structure of the data, not just how it is to be presented.

But SGML also has its weaknesses. It is very general and complex, with a specification extending to some 500 pages, most of which is of no practical relevance to the Web. Since there are so many options available, the actual interoperability between companies is minimal. Consequently, SGML is used, above all, for large-scale technical documentation projects, particularly in the military and intelligence sectors, aircraft manufacturing, publishing and archiving.

HTML, a derivative of SGML, has of course achieved great fame, but in contrast to SGML it makes no attempt at all to describe the semantic content or structure of data. As already mentioned, given the immense growth of the Web this is not without its problems. Because HTML departed significantly from the spirit and purpose behind SGML, many in the SGML community gave the standard a hostile reception when it was introduced in 1994. HTML nevertheless became very popular, and it was the very fact that it had such limitations that made it so useful to the expanding Web. Consequently, HTML is now the most widespread application of SGML.

XML solves the dilemma

Thus, for today’s requirements SGML is too complex and HTML too simple. The solution to this dilemma actually already lies within the logic of SGML: This is, after all, a metalanguage with which you can create a system of ‘grammar’ to suit your own requirements. So what could be better than to define a new standard on the basis of SGML, combining the simplicity and universality of HTML with the flexibility of SGML? The result is XML (eXtensible Markup Language). If HTML can be thought of as SGML's superficial child, then XML is its pragmatic child. First and foremost, XML is much less complex than SGML: Its specification, created by the World Wide Web Consortium (W3C), runs to a still manageable 26 pages.

XML’s simplicity considerably facilitates its implementation. For example, XML parsers, which are applications that analyze XML syntax, do not need DTDs in order to dissect a document into its component parts. In addition, XML does not permit any deviations from the standard syntax. This ensures that all XML-coded documents can be edited, saved and delivered regardless of the recipient’s XML parser.

XML documents can provide style sheets that enable browsers to convert documents in order to display them in HTML. XML applications can be created using standard parser interfaces inside browsers. XML is structurally downwards compatible, which means that XML browsers can also parse, or interpret, well-formed HTML documents. It is even possible to write XML-compliant HTML documents. XML’s downward compatibility means that the transition from HTML can be implemented in stages, which will doubtless contribute to its acceptance.

 

XML – the simple metalanguage

One of the central strengths of XML, however, is that, like SGML, it is itself a metalanguage, not an independent markup language (seen in this light, the name XML is in fact extremely inappropriate). It is a set of rules that allows you to define specific ‘grammars’, which then only apply within certain application scenarios. Consequently, XML can be adapted very flexibly to suit different purposes. For example, you can use XML to do such things as define prices, authors’ names, times and dates, keywords or share prices. To describe content like this, syntax must be defined for the relevant application areas (for realtors, stock market services or publishers, say) in the form of a document type definition (DTD).

This makes it possible to specify precise queries on the basis of the defined content-based criteria. The results of an analysis of documents or Web pages corresponding to an XML standard can then be edited directly in application programs. An application might, for example, independently read product or share prices from Web pages and then use them. But the Web is about more than just Wall Street; it is also possible, for example, to conceive of applications that could use the contents of XML-based CAD files.

XML – universal standard for data interchange

XML is also a very versatile basis for data interchange between different systems. In this regard it leaves the formal structure of relational data trailing in its wake, since it is much more flexible than a system based on a rigid field concept. It also enhances the performance capability of interface standards such as CORBA and DCOM. And not the least of XML’s merits is that it can also describe more sophisticated graphical user interfaces than the rather basic forms possible with HTML. So although XML technology is mainly discussed in connection with the Web, its range actually extends far beyond it.

For instance, once you have agreed on an XML format for the representation of molecular structures, it is, of course, possible to carry out complex searches for compounds on the Web, but equally you can save such information in, and query it from, a database in the same way. It is virtually impossible to represent such structures in the fields of a relational database in such a way as to allow individual components or compounds to be searched for; the relational approach simply cannot handle the complexity of the information. The data can generally only be stored in an unstructured form as text or graphics, in which case you can only view it, not manipulate it. Roughly 60 percent of all the information available in data processing is available in this unstructured form; the rest is in the form of documents, images, graphics, spreadsheets, etc.

However, if chemical compounds' structures can be described using an XML DTD, then it should be possible to locate chemical compounds in a database using structural information (two adjacent methyl groups on a benzene ring, for example). This opens up a new and highly interesting field of application for databases. The major vendors are already working on appropriate strategies. Software AG, for example, has announced support for XML on a broad front in databases, development systems and middleware. It is especially XML’s capacity for flexible adaptation to very different purposes that makes it suitable as a universal interchange format for a wide range of purposes, from electronic business to document management.

A large number of definitions have already been drawn up in compliance with the XML standard, in the healthcare sector or for the representation of complex chemical structures, for example. The Shakespeare example (see box 2) shows how simple but powerful such an application can be. Once all Shakespeare users have agreed on the rules, and on the meaning and use of keywords such as ACT, SCENE or SPEECH, a corresponding application can be programmed without difficulty to print out the dramas or transfer them to a voice output algorithm; enhancements can be added seamlessly. Interest in a common, cross-platform data interchange format is particularly high in the financial services sector. And, in the long term, even the never ending story of EDI could eventually have a happy end with XML: EDI specifications can be represented in XML and thus implemented in concrete projects without those tiresome problems of high cost and low acceptance associated with EDI, which could lead to a big boom in electronic business.

Heavy investments are currently being made in the development of XML standards, interfaces and tools. Microsoft is already providing XML technology in Version 4 of its Web browser, Internet Explorer, and is expected to support XML soon in Web tools such as Frontpage. Microsoft has also selected XML for the implementation of its Channel Data Format, a protocol for the standardization of information processing on the Web, for push technology, for example. Few people realize that Quicken, which is a very widespread application, already uses the XML format. Companies whose orientation is more toward mainframe or Unix environments are also working on XML projects. A number of Software AG’s products, such as Natural@Web, Adabas Web Retrieval and Adabas Web Gateway, already offer basic XML functionality. Companies such as Sun, Oracle, Sybase and Corel are also concentrating hard on the subject of XML. But like so much in the computing industry, the involvement of Microsoft will probably be decisive for the future of XML – and the industry leader has demonstrated a clear commitment to it. Adaptable standards have always been enthusiastically embraced by Bill Gates’s company.

 

Box 1

Here are some of the many industry initiatives which exist to establish XML-based interchange standards:

  • The XML/EDI group, an international consortium working on guidelines for integrating XML into the electronic business of different sectors
  • HL7, a group of healthcare organizations developing standards for the electronic interchange of hospital, financial and administrative data between independent computer systems in the healthcare sector
  • Chemical Markup Language (CML), developed in Great Britain to enable chemists to exchange descriptions of molecules, formulas and other chemical data
  • Open Financial Exchange (OFX), the format used by Intuit Quicken and Microsoft Money to communicate with banks (already in use)
  • Open Software Distribution (OSD) from Marimba and Microsoft

 

Box 2

Example 1

Act 1

SCENE 1. Elsinore. A platform before the castle.

FRANCISCO at his post. Enter to him BERNARDO.

BERNARDO: Who's there?

FRANCISCO: Nay, answer me: stand, and unfold yourself.

BERNARDO: Long live the king!

To read this script of Shakespeare's Hamlet, humans do not require text with markup. They are able to recognize or infer the meaning of individual parts based on their experience


Example 2

<ACT><TITLE>ACT 1</TITLE>
<SCENE><TITLE>SCENE 1. Elsinore. A platform before the castle.</TITLE>
<STAGEDIR>FRANCISCO at his post. Enter to him
BERNARDO</STAGEDIR>
<SPEECH>
<SPEAKER>BERNARDO</SPEAKER>
<LINE>Who's there?</LINE>
</SPEECH>
<SPEECH>
<SPEAKER>FRANCISCO</SPEAKER>
<LINE>Nay, answer me, sand, and unfold yourself.</LINE>
</SPEECH>
<SPEECH>
<SPEAKER>BERNARDO</SPEAKER>
<LINE>Long live the king!</LINE>
</SPEECH>

XML markup has been added to the play script from the first example. Computers cannot infer the meaning from the context; they require text markup in order to add information about the content.


Example 3

The HTML version of the play script would look something like this. The markup is only for display purposes in a browser; it adds no semantic value to the text and therefore cannot be evaluated as such.

<H1>ACT 1</H1>
<P><I>SCENE 1. Elsinore. A platform before the castle.</I></P>
<P><I>FRANCISCO at his post. Enter to him BERNARDO.</I></P>
<P><B>BERNARDO:</B>Who's there?</P>
<P><B>FRANCISCO:</B>Nay, answer me: stand, and unfold yourself.</P>
<P><B>BERNARDO:</B>Long live the king!</P>