deutsche Version
 

 

 

 

XML at work

by Jürgen Harbarth, Software AG


XML is a metalanguage which can be used to describe content-related structures for data of all kinds. Given this universal and flexible approach, XML's uses, from text processing to electronic business, are almost limitless.

If the indications are right and the experts' assessments correct, soon, the IT world will see a new standard which could match ASCII or HTML in importance: XML (eXtensible Markup Language). The main feature of XML is its exceptionally high flexibility. It is already possible to identify a wide range of potential applications, covering the whole world of IT: text processing and document management, databases and database queries, and, of course, in particular the Web.

What is XML? XML is a metalanguage for the description of documents and data. It is thus a universal convention which can be used as a basis for defining task-related structures. The principle itself is nothing new; SGML (Standard Generalized Markup Language) functions this way and has already been around for some time. This standard, however, is so complicated and unwieldy – the specification alone is 500 pages long – that in practice, it is only used to describe technical documentation on a large scale. And as to HTML, the format language of the World Wide Web, here is something many people do not know: HTML is a specific token of SGML. The tags used in HTML to set up and present Web pages are derived from the SGML standard.

From HTML to XML

Through a series of 'tags,' HTML determines how a document should appear to the browser, what links exist to other Web pages, whether applets are integrated, etc. HTML is a page description language comparable to the printer language Postscript; it has become the language of the Web not least of all because of its limited number of functions.

The limitations of HTML are widely known by now. HTML does not support any validations, and it is not possible to reflect the specifications of the data structures, as would be required for databases or object hierarchies. HTML is limited solely to formal aspects; it handles the presentation alone and does not take content structures into account. This means, for instance, that it cannot and will not distinguish between a shoe size, a date of birth and a house number, but it can present the number in bold print if you like.

The greater the sophistication that the users demand from the Web the more urgent the need becomes to use content-related structures as well, for example to be able to search the Web for people with the same shoe size and the same date of birth. The sheer existence and popularity of various search engines that sift through general Web pages to filter out specific content more or less effectively shows that there is a great need in this area. The Web, using HTML throughout, recognizes only pages, not content, and this logic is already clearly reaching its limits given the overwhelming mass of Web pages.

The need for content-based structures is not confined to the Web, however. In the area of text processing and document management as well, parallel developments are taking place: search for key words, for storage and modification date, for authors' names, for headings which can be used to generate a table of contents, etc. – like the Web search engines, all these approaches represent attempts to handle the data in question not only as a string, but in terms of its content. However, most of these attempts fail, on the one hand because they are proprietary, and the other because they are rigidly defined. Users cannot determine what is important to them within their own documents; for example, they cannot search for shoe sizes listed within documents – and no supplier would program a special browser just for orthopedists.

The more IT penetrates all areas of business life, the more urgent the need becomes for universal exchangeability of data. The issue of EDI (Electronic Data Interchange) shows how great the need is, and also how great the difficulties are. The Web provides only the necessary infrastructure; exchanging data, however, calls for data structuring conventions as well. If each participant were to use a different convention, electronic business would never grow beyond simple e-mails.

Attempts to solve this problem by expanding the HTML standard are dangerous, since consistency could be lost, with users ultimately defining their own HTML; it could even mean the end of the Web. On the other hand, it is not possible to set up a formal specification in HTML for every individual application; HTML would get completely out of hand.

Only a metalanguage can offer the solution to these difficulties. XML is a standard which is also practical to implement. XML is much simpler than SGML; the official specification from the World Wide Web Consortium (W3C) comprises only 26 pages. The simplicity of XML makes its implementation considerably easier.

XML and HTML

For XML to succeed in practice, it is also important that it be compatible with HTML from the start. HTML is a token of SGML, but XML is designed so that HTML can also work in XML applications; i.e. HTML tags integrate seamlessly with the XML meta-logic. Thus, linkup to the great wide world is an integral part of the XML universe from the very start, because HTML can be executed via the standard parser interfaces. XML is downwardly compatible; i.e., XML browsers can also 'parse,' or interpret, HTML. It is even possible to write HTML documents which are XML-compliant. XML's downward compatibility allows a gradual transition from HTML, which will doubtless increase XML's acceptance. Currently, "only" XML browsers are lacking, but both Microsoft and Netscape have already announced that the next versions of their browsers will support XML.

The major difference remains: In contrast to HTML, XML can structure data not only according to formal criteria (such as headings, running text, etc.), but also according to aspects of content. To be more precise: XML allows content-based structuring, since it functions on a different level than does HTML; it also allows a layout description that goes beyond the scope of HTML.

XML – a metalanguage that's simple

Since XML is a metalanguage like SGML, it is not really the independent markup language that its name would indicate. XML becomes whatever the users choose to create on the basis of XML. Every user can define new, individual tags according to his needs, such as "date of birth," "shoe size," or even "cooking time," should XML be used to exchange recipes. XML does not define these tags itself, but rather sets out how the tags are to be defined. Naturally, there are also XML tags which apply to all documents, and others which are needed to file definitions in documents.

XML thus offers a kind of grammar, e.g. <start date> content </end date>, which the user can then fill in with the desired content, for instance: <date of birth> 11/25/78 </date of birth>. It would also be possible, for example, for meteorologists to exchange weather data using their own tags, such as <temperature>, <air pressure>, <wind force>, etc. and to file these definitions in corresponding templates. XML-compatible applications could then process such Web pages directly, e.g. automatically evaluating weather data via the Web. It is then no longer absolutely necessary for the various user groups, to agree on a syntax in the form of a Document Type Definition (DTD).

Although the application-specific tags only make sense when all users know them and have parsers set up accordingly, XML also accepts unknown tags: They are automatically recognized as such, and are returned uninterpreted, but with no false interpretations. Similarly, although XML also uses DTDs, a type of document mask, an XML parser can also process a document without a DTD if necessary. The architecture is thus very flexible and already designed with the idea in mind that even users processing very general information must be served.

Since the definitions are also entered in normal text, rather than in cryptic code symbols, everyone can read them; regardless of which XML parser was used by the person creating the document, all XML-encoded documents can be processed, stored and delivered. Thus, an orthopedist could also read the weather data. He just has to do without the specific functions of the given XML implementation, and thus cannot generate a weather report – however, he can also be sure that his XML browser will not mistake the wind speed for a shoe size, for example.

XML can thus be adapted flexibly to all conceivable applications. There are no limits to the imagination: Prices, author names, time or date information, key words, share prices, etc. can all be defined. For content of this type, it makes sense to set a syntax in the form of a document type definition (DTD) in relation to certain application areas (for example real estate agents, stock exchange services, publishers). It is then possible, for example, to execute targeted queries on the basis of the defined content-related criteria. The results of the evaluation of documents or Web pages which correspond to an XML standard can be processed directly in the application programs as well; they could independently extract price information or share prices and then process the information, for example. Definitions for CAD data or for x-rays can be generated in the same way.

XML – universal standard for the exchange of data

XML has many uses in data exchange between different systems. It is more flexible than the rigid field concept of relational data, and extends the performance of interface standards such as Corba and DCOM. Not least of all, XML can also describe more sophisticated GUI interfaces than can the very simple HTML forms.

Although the use of XML technology is discussed primarily in connection with the Web, the possible applications of XML extend far beyond it. For example, once an XML format for depicting molecular structures has been agreed upon, it becomes possible not only to run a targeted search in the Web for certain compounds, but also to store such information in a similar form in a database and to call it up from there. It is nearly impossible, however, to reflect such structures in the fields of a relational database in such a way that it is then possible to conduct a targeted search for individual components or compounds; the relational approach fails in light of the complexity of the information. It is generally possible to store it in an unstructured manner as text or a graphical information; the information can then only be viewed – an estimated 80 percent of all information is available to IT only in such an unstructured form; the rest can be accessed in databases in an elementary form. If a data structure is described in an XML format, however, it also becomes possible to conduct a targeted search in the database for specific chemical compounds. For databases, this represents a new, highly interesting area of application. The major suppliers are already working on the corresponding concepts. Mainly because of this ability to adapt flexibly to widely differing uses, XML is suitable for use as a universal format for the exchange of data in numerous areas, from electronic business to document management.

Given XML's universal nature, its possible applications are limitless. XML is truly a framework for all data, regardless of where it is stored, whether in enterprise databases or on the Web. How widespread XML actually becomes naturally depends to a great extent on the commitment shown by bodies of the IT industry, by trade associations and also by leading enterprises, which will create subject-related implementations for specific scenarios. There are already numerous initiatives in this area, and many definitions have already been generated using the XML standard, for example in healthcare or for depicting complex chemical structures: Chemical Markup Language (CML) enables the exchange of descriptions of molecules, formulas and other chemical data. The Open eBook standard, which a few US publishers have introduced for electronic publishing, is also based on XML. The XML/EDI group is also working on integrating XML into the EDI concept. Open Financial Exchange (OFX), the format used by Intuit Quicken and Microsoft Money to communicate with banks, is already in use.

It is an interesting new project from Germany, however, that is demonstrating just how great XML's impact on information technology could be: Software AG’s new database system will use the XML standard to define internal data structures. It does not just feature an interface to XML; rather, it uses XML as the basic definition language for documents and data.

It can be assumed that in the future, all essential electronic business applications in the Net will be based on both HTTP and XML. Every technology already integrating these standards today is one that will prove viable for the future.

The example below, a description of patient information created using XML, shows the complex set-up effective for depicting structures. The information consists both of business data and texts, and a reference to the x-ray image. Additional data can easily be added; the parser can analyze the syntax just as before.

<Box
  <Patient>
    <Name>Smith</Name>
    <First name>Kevin</First name>
    <Occupation>Forest Ranger</Occupation>
    <Shoe size>42</Shoe size>
    <Date of birth>10/25/1967</Date of birth>
    <Address>
        <Street>Ivy Way</Street>
        <House number>8</House number>
        <City>Silver Spring</City>
        <State>Maryland</State>
        <Zip code>20904</Zip code>
    </Address>
    <Insurance>
        <Insurance co.>State Farm</Insurance co.>
        <Insurance. no>999888777666</Insurance no.>
        <Patient no.>1234566</Patient no.>
    </Insurance >
    <Diagnosis>
        <Illness>Splayfoot</Illness>
        <x-ray image="http://picture23.gif"/>
    </Diagnosis>
  </Patient>
End Box>

XML in Practice

  • Document Management and Text Processing

  • Electronic Data Interchange (EDI)

  • Web and Web Search Engines

  • Electronic Business and Electronic Commerce

  • Database Structures and Queries

  • Electronic Publishing