XML Metalanguage Opens Up New Worlds
by Nigel Hutchison, Software AG
The XML markup language is a new standard for describing the content of documents and
can be adapted to suit different applications. Its possible areas of application range
from Web pages through electronic commerce to complex database solutions.
The unbroken expansion of the Web has given rise to an information jungle in which it
is increasingly difficult to get your bearings. Overloaded search engines often no longer
provide the required results; and advertising messages, outdated pages and junk
information are a frequent hindrance when using the Web as an information medium. On the
one hand, this is down to the ever greater numbers of documents stored on the Web (even
according to conservative estimates there are hundreds of millions of pages); on the
other, it also has to do with the way the information is stored on the Web. Essentially,
it is contained in text documents with HTML (Hypertext Markup Language) formatting. The
Web designer employs HTML tags to define how the document will appear in the browser,
include links to other Web pages, incorporate applets, and so on. This bears little
resemblance to real programming work; HTML is a page description language,
comparable to the PostScript printer language.
Limitations of HTML
HTML owes its fame to the Web, and today there are many Web tools that relieve the
designer of the burden of working with HTML code, though the limitations of HTML do, of
course, remain. Essentially, these are:
- HTML is not extensible; you cannot define your own tags for specific requirements.
- It is not possible to represent the specifications of data structures, as required in
databases or object hierarchies, for example.
- HTML does not support data validation.
HTML thus does not deal with aspects of content, and you cannot use it to qualify Web
pages, for example. HTML only controls the presentation of the information. Anything
beyond that requires considerable programming effort in the form of either applets or
applications. Not the least of the problems posed by the limitations of HTML is the danger
of proprietary enhancements being developed, which would restrict the universal
availability of information formatted in HTML.
SGML flexible but complex
Most users of HTML today are unaware of the fact that HTML is based on SGML (Standard
Generalized Markup Language), a metalanguage for defining rules governing the handling of
documents of different types. HTML is an application of SGML that is concerned only with
the presentational aspects mentioned above, thus providing as broad a basis as possible
for the interchange of documents. And SGML promises to provide everything that cannot be
achieved with HTML. Every application of SGML including HTML, for example
contains one or more document type definitions (DTD). These are formal descriptions of the
syntax of the documents assigned to a document type. A DTD is required in order to
interpret and check an SGML document.
SGML has some decisive advantages. Firstly, it is a non-proprietary standard supported
by a large number of software vendors. Because of this, a document database complying with
the SGML standard will have a much longer life than one based on proprietary standards.
Secondly, the documents can both be read by people and parsed (have their code dissected
and analyzed) by programs. Thirdly, SGML documents describe the structure of the data, not
just how it is to be presented.
But SGML also has its weaknesses. It is very general and complex, with a specification
extending to some 500 pages, most of which is of no practical relevance to the Web. Since
there are so many options available, the actual interoperability between companies is
minimal. Consequently, SGML is used, above all, for large-scale technical documentation
projects, particularly in the military and intelligence sectors, aircraft manufacturing,
publishing and archiving.
HTML, a derivative of SGML, has of course achieved great fame, but in contrast to SGML
it makes no attempt at all to describe the semantic content or structure of data. As
already mentioned, given the immense growth of the Web this is not without its problems.
Because HTML departed significantly from the spirit and purpose behind SGML, many in the
SGML community gave the standard a hostile reception when it was introduced in 1994. HTML
nevertheless became very popular, and it was the very fact that it had such limitations
that made it so useful to the expanding Web. Consequently, HTML is now the most widespread
application of SGML.
XML solves the dilemma
Thus, for todays requirements SGML is too complex and HTML too simple. The
solution to this dilemma actually already lies within the logic of SGML: This is, after
all, a metalanguage with which you can create a system of grammar to suit your
own requirements. So what could be better than to define a new standard on the basis of
SGML, combining the simplicity and universality of HTML with the flexibility of SGML? The
result is XML (eXtensible Markup Language). If HTML can be thought of as SGML's
superficial child, then XML is its pragmatic child. First and foremost, XML is much less
complex than SGML: Its specification, created by the World Wide Web Consortium (W3C), runs
to a still manageable 26 pages.
XMLs simplicity considerably facilitates its implementation. For example, XML
parsers, which are applications that analyze XML syntax, do not need DTDs in order to
dissect a document into its component parts. In addition, XML does not permit any
deviations from the standard syntax. This ensures that all XML-coded documents can be
edited, saved and delivered regardless of the recipients XML parser.
XML documents can provide style sheets that enable browsers to convert documents in
order to display them in HTML. XML applications can be created using standard parser
interfaces inside browsers. XML is structurally downwards compatible, which means that XML
browsers can also parse, or interpret, well-formed HTML documents. It is even possible to
write XML-compliant HTML documents. XMLs downward compatibility means that the
transition from HTML can be implemented in stages, which will doubtless contribute to its
acceptance.
XML the simple metalanguage
One of the central strengths of XML, however, is that, like SGML, it is itself a
metalanguage, not an independent markup language (seen in this light, the name XML is in
fact extremely inappropriate). It is a set of rules that allows you to define specific
grammars, which then only apply within certain application scenarios.
Consequently, XML can be adapted very flexibly to suit different purposes. For example,
you can use XML to do such things as define prices, authors names, times and dates,
keywords or share prices. To describe content like this, syntax must be defined for the
relevant application areas (for realtors, stock market services or publishers, say) in the
form of a document type definition (DTD).
This makes it possible to specify precise queries on the basis of the defined
content-based criteria. The results of an analysis of documents or Web pages corresponding
to an XML standard can then be edited directly in application programs. An application
might, for example, independently read product or share prices from Web pages and then use
them. But the Web is about more than just Wall Street; it is also possible, for example,
to conceive of applications that could use the contents of XML-based CAD files.
XML universal standard for data interchange
XML is also a very versatile basis for data interchange between different systems. In
this regard it leaves the formal structure of relational data trailing in its wake, since
it is much more flexible than a system based on a rigid field concept. It also enhances
the performance capability of interface standards such as CORBA and DCOM. And not the
least of XMLs merits is that it can also describe more sophisticated graphical user
interfaces than the rather basic forms possible with HTML. So although XML technology is
mainly discussed in connection with the Web, its range actually extends far beyond it.
For instance, once you have agreed on an XML format for the representation of molecular
structures, it is, of course, possible to carry out complex searches for compounds on the
Web, but equally you can save such information in, and query it from, a database in the
same way. It is virtually impossible to represent such structures in the fields of a
relational database in such a way as to allow individual components or compounds to be
searched for; the relational approach simply cannot handle the complexity of the
information. The data can generally only be stored in an unstructured form as text or
graphics, in which case you can only view it, not manipulate it. Roughly 60 percent of all
the information available in data processing is available in this unstructured form; the
rest is in the form of documents, images, graphics, spreadsheets, etc.
However, if chemical compounds' structures can be described using an XML
DTD, then it
should be possible to locate chemical compounds in a database using structural information
(two adjacent methyl groups on a benzene ring, for example). This opens up a new and
highly interesting field of application for databases. The major vendors are already
working on appropriate strategies. Software AG, for example, has announced support for XML
on a broad front in databases, development systems and middleware. It is especially
XMLs capacity for flexible adaptation to very different purposes that makes it
suitable as a universal interchange format for a wide range of purposes, from electronic
business to document management.
A large number of definitions have already been drawn up in compliance with the XML
standard, in the healthcare sector or for the representation of complex chemical
structures, for example. The Shakespeare example (see box 2) shows how simple but powerful
such an application can be. Once all Shakespeare users have agreed on the rules, and on
the meaning and use of keywords such as ACT, SCENE or SPEECH, a corresponding application
can be programmed without difficulty to print out the dramas or transfer them to a voice
output algorithm; enhancements can be added seamlessly. Interest in a common,
cross-platform data interchange format is particularly high in the financial services
sector. And, in the long term, even the never ending story of EDI could eventually have a
happy end with XML: EDI specifications can be represented in XML and thus implemented in
concrete projects without those tiresome problems of high cost and low acceptance
associated with EDI, which could lead to a big boom in electronic business.
Heavy investments are currently being made in the development of XML standards,
interfaces and tools. Microsoft is already providing XML technology in Version 4 of its
Web browser, Internet Explorer, and is expected to support XML soon in Web tools such as
Frontpage. Microsoft has also selected XML for the implementation of its Channel Data
Format, a protocol for the standardization of information processing on the Web, for push
technology, for example. Few people realize that Quicken, which is a very widespread
application, already uses the XML format. Companies whose orientation is more toward
mainframe or Unix environments are also working on XML projects. A number of Software
AGs products, such as Natural@Web, Adabas Web Retrieval and Adabas Web Gateway,
already offer basic XML functionality. Companies such as Sun, Oracle, Sybase and Corel are
also concentrating hard on the subject of XML. But like so much in the computing industry,
the involvement of Microsoft will probably be decisive for the future of XML and
the industry leader has demonstrated a clear commitment to it. Adaptable standards have
always been enthusiastically embraced by Bill Gatess company.
| Box 1 Here are some of the many
industry initiatives which exist to establish XML-based interchange standards:
- The XML/EDI group, an international consortium working on guidelines for integrating XML
into the electronic business of different sectors
- HL7, a group of healthcare organizations developing standards for the electronic
interchange of hospital, financial and administrative data between independent computer
systems in the healthcare sector
- Chemical Markup Language (CML), developed in Great Britain to enable chemists to
exchange descriptions of molecules, formulas and other chemical data
- Open Financial Exchange (OFX), the format used by Intuit Quicken and Microsoft Money to
communicate with banks (already in use)
- Open Software Distribution (OSD) from Marimba and Microsoft
|
| Box 2 Example 1
Act 1
SCENE 1. Elsinore. A platform before the castle.
FRANCISCO at his post. Enter to him BERNARDO.
BERNARDO: Who's there?
FRANCISCO: Nay, answer me: stand, and unfold yourself.
BERNARDO: Long live the king!
To read this script of Shakespeare's Hamlet, humans do not require text with markup.
They are able to recognize or infer the meaning of individual parts based on their
experience
Example 2
<ACT><TITLE>ACT 1</TITLE>
<SCENE><TITLE>SCENE 1. Elsinore. A platform before the castle.</TITLE>
<STAGEDIR>FRANCISCO at his post. Enter to him
BERNARDO</STAGEDIR>
<SPEECH>
<SPEAKER>BERNARDO</SPEAKER>
<LINE>Who's there?</LINE>
</SPEECH>
<SPEECH>
<SPEAKER>FRANCISCO</SPEAKER>
<LINE>Nay, answer me, sand, and unfold yourself.</LINE>
</SPEECH>
<SPEECH>
<SPEAKER>BERNARDO</SPEAKER>
<LINE>Long live the king!</LINE>
</SPEECH>
XML markup has been added to the play script from the first example. Computers cannot
infer the meaning from the context; they require text markup in order to add information
about the content.
Example 3
The HTML version of the play script would look something like this. The markup is only
for display purposes in a browser; it adds no semantic value to the text and therefore
cannot be evaluated as such.
<H1>ACT 1</H1>
<P><I>SCENE 1. Elsinore. A platform before the castle.</I></P>
<P><I>FRANCISCO at his post. Enter to him BERNARDO.</I></P>
<P><B>BERNARDO:</B>Who's there?</P>
<P><B>FRANCISCO:</B>Nay, answer me: stand, and unfold
yourself.</P>
<P><B>BERNARDO:</B>Long live the king!</P> |
|