XML at work
by Jürgen Harbarth, Software AG
XML is a metalanguage which can be used to describe content-related structures for
data of all kinds. Given this universal and flexible approach, XML's uses, from text
processing to electronic business, are almost limitless.
If the indications are right and the experts' assessments correct, soon,
the IT world will see a new standard which could match ASCII or HTML in importance: XML
(eXtensible Markup Language). The main feature of XML is its exceptionally high
flexibility. It is already possible to identify a wide range of potential applications,
covering the whole world of IT: text processing and document management, databases and
database queries, and, of course, in particular the Web.
What is XML? XML is a metalanguage for the description of documents and
data. It is thus a universal convention which can be used as a basis for defining
task-related structures. The principle itself is nothing new; SGML (Standard Generalized
Markup Language) functions this way and has already been around for some time. This
standard, however, is so complicated and unwieldy the specification alone is 500
pages long that in practice, it is only used to describe technical documentation on
a large scale. And as to HTML, the format language of the World Wide Web, here is
something many people do not know: HTML is a specific token of SGML. The tags used in HTML
to set up and present Web pages are derived from the SGML standard.
From HTML to XML
Through a series of 'tags,' HTML determines how a document should appear
to the browser, what links exist to other Web pages, whether applets are integrated, etc.
HTML is a page description language comparable to the printer language Postscript; it has
become the language of the Web not least of all because of its limited number of
functions.
The limitations of HTML are widely known by now. HTML does not support
any validations, and it is not possible to reflect the specifications of the data
structures, as would be required for databases or object hierarchies. HTML is limited
solely to formal aspects; it handles the presentation alone and does not take content
structures into account. This means, for instance, that it cannot and will not distinguish
between a shoe size, a date of birth and a house number, but it can present the number in
bold print if you like.
The greater the sophistication that the users demand from the Web the
more urgent the need becomes to use content-related structures as well, for example to be
able to search the Web for people with the same shoe size and the same date of birth. The
sheer existence and popularity of various search engines that sift through general Web
pages to filter out specific content more or less effectively shows that there is a great
need in this area. The Web, using HTML throughout, recognizes only pages, not content, and
this logic is already clearly reaching its limits given the overwhelming mass of Web
pages.
The need for content-based structures is not confined to the Web,
however. In the area of text processing and document management as well, parallel
developments are taking place: search for key words, for storage and modification date,
for authors' names, for headings which can be used to generate a table of contents, etc.
like the Web search engines, all these approaches represent attempts to handle the
data in question not only as a string, but in terms of its content. However, most of these
attempts fail, on the one hand because they are proprietary, and the other because they
are rigidly defined. Users cannot determine what is important to them within their own
documents; for example, they cannot search for shoe sizes listed within documents
and no supplier would program a special browser just for orthopedists.
The more IT penetrates all areas of business life, the more urgent the
need becomes for universal exchangeability of data. The issue of EDI (Electronic Data
Interchange) shows how great the need is, and also how great the difficulties are. The Web
provides only the necessary infrastructure; exchanging data, however, calls for data
structuring conventions as well. If each participant were to use a different convention,
electronic business would never grow beyond simple e-mails.
Attempts to solve this problem by expanding the HTML standard are
dangerous, since consistency could be lost, with users ultimately defining their own HTML;
it could even mean the end of the Web. On the other hand, it is not possible to set up a
formal specification in HTML for every individual application; HTML would get completely
out of hand.
Only a metalanguage can offer the solution to these difficulties. XML is
a standard which is also practical to implement. XML is much simpler than SGML; the
official specification from the World Wide Web Consortium (W3C) comprises only 26 pages.
The simplicity of XML makes its implementation considerably easier.
XML and HTML
For XML to succeed in practice, it is also important that it be
compatible with HTML from the start. HTML is a token of SGML, but XML is designed so that
HTML can also work in XML applications; i.e. HTML tags integrate seamlessly with the XML
meta-logic. Thus, linkup to the great wide world is an integral part of the XML universe
from the very start, because HTML can be executed via the standard parser interfaces. XML
is downwardly compatible; i.e., XML browsers can also 'parse,' or interpret, HTML. It is
even possible to write HTML documents which are XML-compliant. XML's downward
compatibility allows a gradual transition from HTML, which will doubtless increase XML's
acceptance. Currently, "only" XML browsers are lacking, but both Microsoft and
Netscape have already announced that the next versions of their browsers will support XML.
The major difference remains: In contrast to HTML, XML can structure data
not only according to formal criteria (such as headings, running text, etc.), but also
according to aspects of content. To be more precise: XML allows content-based structuring,
since it functions on a different level than does HTML; it also allows a layout
description that goes beyond the scope of HTML.
XML a
metalanguage that's simple
Since XML is a metalanguage like SGML, it is not really the independent
markup language that its name would indicate. XML becomes whatever the users choose to
create on the basis of XML. Every user can define new, individual tags according to his
needs, such as "date of birth," "shoe size," or even "cooking
time," should XML be used to exchange recipes. XML does not define these tags itself,
but rather sets out how the tags are to be defined. Naturally, there are also XML tags
which apply to all documents, and others which are needed to file definitions in
documents.
XML thus offers a kind of grammar, e.g. <start date> content
</end date>, which the user can then fill in with the desired content, for instance:
<date of birth> 11/25/78 </date of birth>. It would also be possible, for
example, for meteorologists to exchange weather data using their own tags, such as
<temperature>, <air pressure>, <wind force>, etc. and to file these
definitions in corresponding templates. XML-compatible applications could then process
such Web pages directly, e.g. automatically evaluating weather data via the Web. It is
then no longer absolutely necessary for the various user groups, to agree on a syntax in
the form of a Document Type Definition (DTD).
Although the application-specific tags only make sense when all users
know them and have parsers set up accordingly, XML also accepts unknown tags: They are
automatically recognized as such, and are returned uninterpreted, but with no false
interpretations. Similarly, although XML also uses DTDs, a type of document mask, an XML
parser can also process a document without a DTD if necessary. The architecture is thus
very flexible and already designed with the idea in mind that even users processing very
general information must be served.
Since the definitions are also entered in normal text, rather than in
cryptic code symbols, everyone can read them; regardless of which XML parser was used by
the person creating the document, all XML-encoded documents can be processed, stored and
delivered. Thus, an orthopedist could also read the weather data. He just has to do
without the specific functions of the given XML implementation, and thus cannot generate a
weather report however, he can also be sure that his XML browser will not mistake
the wind speed for a shoe size, for example.
XML can thus be adapted flexibly to all conceivable applications. There
are no limits to the imagination: Prices, author names, time or date information, key
words, share prices, etc. can all be defined. For content of this type, it makes sense to
set a syntax in the form of a document type definition (DTD) in relation to certain
application areas (for example real estate agents, stock exchange services, publishers).
It is then possible, for example, to execute targeted queries on the basis of the defined
content-related criteria. The results of the evaluation of documents or Web pages which
correspond to an XML standard can be processed directly in the application programs as
well; they could independently extract price information or share prices and then process
the information, for example. Definitions for CAD data or for x-rays can be generated in
the same way.
XML
universal standard for the exchange of data
XML has many uses in data exchange between different systems. It is more
flexible than the rigid field concept of relational data, and extends the performance of
interface standards such as Corba and DCOM. Not least of all, XML can also describe more
sophisticated GUI interfaces than can the very simple HTML forms.
Although the use of XML technology is discussed primarily in connection
with the Web, the possible applications of XML extend far beyond it. For example, once an
XML format for depicting molecular structures has been agreed upon, it becomes possible
not only to run a targeted search in the Web for certain compounds, but also to store such
information in a similar form in a database and to call it up from there. It is nearly
impossible, however, to reflect such structures in the fields of a relational database in
such a way that it is then possible to conduct a targeted search for individual components
or compounds; the relational approach fails in light of the complexity of the information.
It is generally possible to store it in an unstructured manner as text or a graphical
information; the information can then only be viewed an estimated 80 percent of all
information is available to IT only in such an unstructured form; the rest can be accessed
in databases in an elementary form. If a data structure is described in an XML format,
however, it also becomes possible to conduct a targeted search in the database for
specific chemical compounds. For databases, this represents a new, highly interesting area
of application. The major suppliers are already working on the corresponding concepts.
Mainly because of this ability to adapt flexibly to widely differing uses, XML is suitable
for use as a universal format for the exchange of data in numerous areas, from electronic
business to document management.
Given XML's universal nature, its possible applications are limitless.
XML is truly a framework for all data, regardless of where it is stored, whether in
enterprise databases or on the Web. How widespread XML actually becomes naturally depends
to a great extent on the commitment shown by bodies of the IT industry, by trade
associations and also by leading enterprises, which will create subject-related
implementations for specific scenarios. There are already numerous initiatives in this
area, and many definitions have already been generated using the XML standard, for example
in healthcare or for depicting complex chemical structures: Chemical Markup Language (CML)
enables the exchange of descriptions of molecules, formulas and other chemical data. The
Open eBook standard, which a few US publishers have introduced for electronic publishing,
is also based on XML. The XML/EDI group is also working on integrating XML into the EDI
concept. Open Financial Exchange (OFX), the format used by Intuit Quicken and Microsoft
Money to communicate with banks, is already in use.
It is an interesting new project from Germany, however, that is
demonstrating just how great XML's impact on information technology could be: Software
AGs new database system will use the XML standard to define internal data
structures. It does not just feature an interface to XML; rather, it uses XML as the basic
definition language for documents and data.
It can be assumed that in the future, all essential electronic business
applications in the Net will be based on both HTTP and XML. Every technology already
integrating these standards today is one that will prove viable for the future.
The example below, a description of patient information created using
XML, shows the complex set-up effective for depicting structures. The information consists
both of business data and texts, and a reference to the x-ray image. Additional data can
easily be added; the parser can analyze the syntax just as before.
<Box
<Patient>
<Name>Smith</Name>
<First name>Kevin</First name>
<Occupation>Forest Ranger</Occupation>
<Shoe size>42</Shoe size>
<Date of birth>10/25/1967</Date of birth>
<Address>
<Street>Ivy Way</Street>
<House number>8</House number>
<City>Silver Spring</City>
<State>Maryland</State>
<Zip code>20904</Zip code>
</Address>
<Insurance>
<Insurance co.>State Farm</Insurance
co.>
<Insurance. no>999888777666</Insurance
no.>
<Patient no.>1234566</Patient no.>
</Insurance >
<Diagnosis>
<Illness>Splayfoot</Illness>
<x-ray
image="http://picture23.gif"/>
</Diagnosis>
</Patient>
End Box> |
XML in Practice
Document Management and Text Processing
Electronic Data Interchange (EDI)
Web and Web Search Engines
Electronic Business and Electronic Commerce
Database Structures and Queries
Electronic Publishing
|