Preparing for the XML Wave
Michael Champion, Senior R&D Advisor,
Software AG
About the Author: Michael
Champion is a member of the W3C's Document Object Model Working
Group and co-editor of the core XML portion of the DOM Level 1
recommendation. Champion is currently a senior R&D advisor for new
technologies at Software AG.
What is XML and Why is the Wave Coming?
The current wave of worldwide interest in the Extensible Markup Language (XML) has come because it is widely touted as a solution
for a range of problems that plague those developing enterprise information and e-business
applications. For example:
- Microsoft Word is almost universally used in organizations (at least in the United
States!) to exchange documents, but the Word application is only available on the Windows
and Macintosh platforms, thus limiting the ability for information in Word documents to be
exchanged and re-used.
- HTML is very useful as a universally understood format for displaying text, but its
format is rigidly defined by the World Wide Web Consortium (W3C)
and the set of tags it defines cannot be extended without destroying the interoperability
of the data across applications and platforms.
- For "data" (as opposed to "documents", although there is a vast
fuzzy area between the two) stored in existing enterprise-level applications, there are
numerous problems exchanging the (often binary) information across platforms and between
applications. EDI solutions are cumbersome and expensive.
- Much data is stored in RDBMS systems that mostly use SQL as a common access mechanism,
but exchanging data across systems is much less standardized and far more problematic.
XML is a standard, simple, self-describing way of encoding both
text and data so that content can be processed with relatively little human intervention
and exchanged acros diverse hardware, operating systems, and applications.
- The biggest advantage is simply that it is a standard. The standard syntax means
that you don't need to write your own parser, formatter, or transformer in order to load,
display, or manipulate data.Tools, techniques, knowledge, etc. acquired for one XML
application or vendor are generally applicable to other vendors. You're not locked in to a
single vendor's data formats ... or to a lesser extent, APIs and supporting tools.
- Since XML is a relatively simple standard to understand, use, and implement, it
has become widespread quite quickly. It is pervasive -- virtually every language,
computing platform, and even large-scale application has XML tools available for it. It is
a classic "80:20" solution, i.e. it supplies about 80% of the functionality of
competing technologies (such as MS Word for documents, EDI for data exchange) with perhaps
20% of the effort required to build enterprise-level solutions.
- The tags make XML data self-describing in a well-designed application. The tags
label the meaning of a piece of data, greatly reducing the difficulty of extracting useful
information out of a data stream or of using data from one application in another.
- While it is true that any data schema can be "normalized" to fit into a
relational data model (and most can be mangled somehow to work with EDI formats and
protocols), this is a job best tackled by experts when data become at all complex.XML
allows a much more direct and natural way of representing many common types of data that
have an intrinsically hierarchical structure but not rigidly fixed format.
So in brief, XML offers:
- A widely adopted standard way of representing text and data ...
- In a format that can be processed without much human or machine intelligence
...
- Exchanged across platforms, languages, and applications ...
- and used with a wide range of development tools and utilities.
Example - Comparing HTML and XML
XML is similar enough to HTML in its actual format (both are closely related to the
SGML markup definition language that has been an ISO standard since 1986) so that those
familiar with HTML can fairly easily pick up basic XML knowledge. There are two
fundamental differences:
- Separation of form and content -- HTML mostly consists of tags defining the appearance
of text; in XML the tags generally define the structure and content of the data, with
actual appearance specified by a specific application or an associated stylesheet.
- XML is extensible -- tags can be defined by individuals or organizations for some
specific application, whereas the HTML standard tagset is defined by the World Wide Web
Consortium (W3C).
Let's consider a simple example, the title and author information at the top of this
document:
<H1 ALIGN="CENTER">Preparing for the XML Wave </H1>
<H3 ALIGN="RIGHT"> Whitepaper to Accompany Presentations in Asia<BR>
October 1999 <BR>
Michael Champion<BR> Software AG<BR> <A
HREF="mailto:mike.champion@sagus.com>mike.champion@sagus.com</A><BR><A
HREF="http://www.softwareag.com"> http://www.softwareag.com </A>
</H3>
Note how the tags refer to the visual appearance of the information. When rendered by a
brower, a human familiar with conventions for papers such as this will recognize this as
the header of a formal paper, including the title, subtitle, author name, author's
affiliation and contact information. A computer program -- at least one not specifically
designed to recognize this format or one having enough "artificial intelligence"
to figure this out -- will not be able to process this information further, e.g. to put it
in an address book or bibliography database.
Let's see a possible XML encoding of the same data, with tags indicating what each
piece of data is rather than how it looks:
<headerinfo><title>Preparing for the XML Wave </title>
<subtitle> Whitepaper to Accompany Presentations in Asia</subtitle>
<date>October 1999 </date>
<author><firstname>Michael</firstname><surname>Champion</surname>
<affiliation>Software AG</affiliation> <emailaddr>
mike.champion@sagus.com</emailaddr>
<homeurl>http://www.softwareag.com </homeurl>
</author></headerinfo>
The basic approach of using tags to markup the content remains the same, the actual
text inside the tags is almost identical, but the fact that the tags themselves carry
meaningful semantic and structural information means that ordinary computer programs can
"understand" the data far more easily than is possible with unformatted ASCII
text, proprietary wordprocessor formats, or even HTML.
How is XML Being Used Today?
Document Content Management
Content management systems typically store individual document components, provide
access control and revision control tools to ensure that the information in each component
is correct and up-to-date, then provide tools to assemble and publish the components into
complete documents. Likewise, content management systems often allow live data from
ordinary databases and graphics repositories to be incorporated into the text documents,
and often simplify the complexities of storing XML text in an underlying relational
database. SGML, and more recently XML, provides ways of structuring textual and graphic
document components, ensuring that a particular document maintains a pre-defined
structure, and of re-using bits of "boilerplate". Not surprisingly, many early
applications of XML have been in content management systems such as Chrystal Software's
Astoria, Inso's Dynabase, Texcel Information Manager, and Progressive's SIM.
SGML/XML has proved to be of great value in allowing authors to write a single source
document (or set of components) that can be variously published on paper, on CD-ROM, and
on the Internet. This is generally done by defining a different "stylesheet", or
set of rules defining how various XML elements are to be displayed, for each output
format. More recently, XML transformation systems such as XSLT have allowed this to be
taken another step, so that the same content may be displayed on ordinary HTML browsers,
XML browsers such as Netscape Mozilla and Microsoft IE 5.x, and even lightweight devices
such as cellphones and PDAs that support the "wireless markup language" (WML).
Glue, Wire and Duct Tape
As a recent article in
InfoWorld put it quite succinctly, "(XML) is emerging as a kind of
industrial-strength duct tape to fix cracks and fissures throughout an enterprise's
application foundation." Much of the current excitement surrounding XML involves its
potential in linking existing enterprise systems -- often ERP systems from SAP, Baan,
Peoplesoft, etc. -- with:
- Each other -- often organizational changes, mergers, and the general evolution of
technology and business force enterprise-level systems that were designed to stand alone
to exchange data with other "monolithic" systems. The ERP vendors themselves,
and such enterprise integration vendors as WebMethods
have used XML as the cornerstone for these integration efforts.
- Suppliers and Customers -- This has been traditionally the domain of
"Electronic Data Interchange", but EDI has useage has been relatively limited
outside of very large businesses because of its cost and complexity. XML-based EDI has
come into its own in the last year or so because the universal transport mechanism
provided by the Internet has become linked with the universal data format provided by XML.
See the xmledi.org Web site for much more
information.
- Consumers -- The explosion of interest in e-commerce has made it imperative for
many enterprises to offer their goods or services for sale from Web sites accessible to
anyone with a browser. There are many products allowing one to build an online storefront,
but few offer links to underlying ERP systems, and the ERP vendors have not been quick to
offer sophisticated storefront software of their own. This has led to a paradoxical
situation where orders entered electronically by consumers over the internet often must be
re-entered by hand into the ERP systems that track inventory, schedule production, etc.!
Vendors such as Intershop have used XML as the
basis for data presented to and received from the user and to link with back-end
EDI and ERP systems.
One additional way in which XML can serve as "glue" deserves mention. Various
initiatives, especially XML-RPC and SOAP, have
defined means of performing remote procedure calls between a client application and a
server application using HTTP as the transport mechanism and XML to encode the details of
the function names, arguments, returned values, etc.
Portals
"Enterprise Information Portals" provide coherent, personalized views of
disparate data in an enterprise, delivered over an intranet or the internet to anyone with
a browser. Portals can help classify and focus information to support specific internal
business objectives (such as workgroup productivity) or to provide information services to
customers. XML is central to many portal products (for example DataChannel RIO) because it
is rich enough to represent data from a number of sources yet flexible enough to be
formatted for display in browsers. This allows the portal itself to be a relatively thin
application that relies on underlying databases, ERP systems, search engines, and the Web
itself, for the underlying content and functionality.
Some Case Studies
XML is a fairly new technology, and to be perfectly frank most large projects with
which I'm familiar are in the conceptualization or development phase. There are a number
of vendors who offer pointers to case studies involving their own customers, including:
Some of the more interesting "state of the art" projects that are too early
in their evolution to fairly evaluate, but illustrate the kinds of things that large
organizations are trying to do with XML, include:
- mySAP.com - An ambitious collaboration between SAP
and WebMethods to expose ERP data merged with news, etc. in a portal-like way.
- The HL7 initiative - Healthcare data management may be
the "killer app" for XML; health records are partly structured but highly
variable, made up of both text and data, and issues such of privacy, security,
accessibility, and integrity make this an extremely challenging arena.
- The WAP/WML consortium - a collaboration among the major cellphone vendors and
enterprise-level system providers to define lightweight counterparts to HTTP and HTML for
extremely "thin" clients such as PDAs and cellphones.
There is one pioneering XML project that has been very successful, but has received
relatively little attention from the computer trade press -- the Wall Street Journal Interactive Edition. (The description
here is adapted from an Inter@ctive
Week article describing the project, and an online transcript of a conference
presentation by the manager of the project.)
The WSJ developers began by making a critical distinction, central to the XML
"dogma", but usually ignored in the rush to produce attractive Web pages under
great time pressure: There is a great difference between the content that one authors and
the content that one delivers.
"On the document that you author, the important thing to remember is that
authoring side documents should be optimized for the editor. The editor should be free of
the details of navigation, advertising, and other resources. Their view of the document is
often different from the final view. .... One is for the author's benefit, and one for the
customer's benefit. "
Microsoft Word is used for the basic text entry and editing macros and small utility
programs written in Microsoft's Basic scripting language for Word help ensure that the
text matches the "DJML" DTD. Documents created in Word are saved in the Rich
Text Format (RTF). The Konstructor Suite from OmniMark Technologies Corp. converts the RTF files
into XML text for storage.
XML also allows intelligent searching within the Interactive Edition archive.
Searching based on XML tags provides better information by treating the tagged data
separately from other data. For example, searching for the company name tag of
"IBM" will return stories about IBM, but not stories in which IBM is mentioned
in passing.
The power of XML was demonstrated when The Wall Street Journal subsequently
arranged to have its news presented to PalmPilot users via the AvantGo format. Subscribers
download information formatted for the limited Web browser abilities in these handheld
devices to their PCs, then sync them to their palm computers for later reading.
In short, the WSJ Interactive Edition successfully uses XML technologies to separate
form from content so that the content can be more efficiently processed by content
management tools, uses XML as both text and data, and "repurposes" the XML data
for the printed paper, the online HTML edition, and stripped-down HTML delivered to PDAs.
Criteria for Using XML in a Project Today
To summarize, with the standards and tools available now, XML can be a very appropriate
technology to consider for a project when some of the following conditions apply:
- Data must be exchanged across diverse systems ...
- The imported data must be processed with minimal human involvement ...
- Data must be presented in multiple formats or on a variety of media ...
- One can leverage existing products or specialized tools to ease the development.
Challenges Facing XML Developers
Hype Overkill
Ironically, one of the XML community's biggest challenges is in appropriately
addressing the inflated expectations caused by the rapid growth in its popularity. One
sees (in forums such as the XML
Developers Mailing List ) a certain amount of disillusionment when people realize that
adopting XML is no substitute for careful analysis and design in ensuring the success of a
project. For example, it's easy to mistake the excitement caused by XML making it feasible
to translate data from various existing systems running on different platforms into a form
that is accessible to all via browsers ... for an assertion that it is easy. If the
various underlying systems have radically different semantics for the data, or the data
schemas were poorly designed in the first place, or the applications were so poorly
written that it is nearly impossible to translate the outputs into XML, then it's not
likely that XML will be the "silver bullet" that turns a monstrosity into an
valuable system. XML can be valuable in presenting a common interface to a number of
disparate but rationally designed systems, but it will not in and of itself impose order
on chaos.
Maintaining Interoperability
Another way in which XML is a victim of its own success is the growing number of
XML-related standards and standards organizations addressing very similar problems with
rather different solutions. For example, BizTalk, xml.org, cXML, and RosettaNet (along with others) are involved with
defining XML schemas for common business documents such as catalogs, invoices and purchase
orders. Their membership overlaps, their missions are somewhat distinct, the types of
products vary ... but it is not clear at this time whether this diversity is good or bad
for the average e-business developer. Furthermore, it is easy to be cynical and interpret
much of this as attempts by different vendors and organizations to "own the
architecture" by being the first to define formats and protocols that become de-facto
standards. (The more established standards organizations such as ISO, the W3C, and the
IETF tend to not get involved in defining such industry-specific standards that are built
on top of XML and/or HTTP).
As with everything, the challenge is for each developer to determine -- for the
application in question -- which "standard" provides the best base to build from
and the best support for interoperating with the others. Building on top of a commonly
used format such as the BizTalk schemas does open up the possibility of using a range of
emerging tools that have bought into the "standard". But the whole purpose
of XML is defeated if people adopt the mindset (all too common among ERP system customers)
that if the tool does not fit the business need, change the business need to fit the tool!
In other words, XML is designed to be extensible and there are a large number of
transformation tools becoming available to facilitate conversion from one XML data format
to another, so don't get locked in to something that does not fit your needs.
Persistent Storage of XML
E-business applications currently typically use XML that is stored in either
filesystems or relational database systems. Filesystems have obvious limitations for
serious applications -- the same ones that led to the widespread acceptance of RDBMs
systems 10-15 years ago, including the lack of good search tools, data integrity
maintenance, transaction processing, etc. Relational databases have these, but have
disadvantages when it comes to storing XML.
While relational databases create the context to the data through tables, columns,
joins etc., they work best with data that fits to this structure. As soon as the data has
left the database, its meaning relies totally on the further processing applications. In
complex environments this often leads to problems which are hard to fix like unexpected
application behavior, lack of scalability and maintainability.
It takes a considerable amount of effort for a developer to work around the problems of
storing XML in a relational database in a way that is suitable for a real e-business. The
basic schema describing the hierarchical XML structure must be normalized into a form
storable in tables, but not "excessively" normalized to the point where
performance becomes unacceptable. Searchable data must be defined and (probably) stored
separately as metadata in the RDBMS. Queries on the XML structure must be translated into
SQL statements that access the underlying tables or full-text search engines. This is
generally a task requiring at least the skills of professional programmers, and often
those with specialized expertise in the underlying DBMS.
How Does Software AG Help XML Developers Overcome These Challenges?
Software AG has responded to the limitations of RDBMS storage of XML data with the
concept of an information server an efficient and scalable architecture for
information staging, integration and exchange. An information server is not intended to
replace existing data storage concepts within the enterprise. It acts as an information
hub which administrates all kind of data in the enterprise. Information can either be
stored directly or just processed through the information server. In this case the
information server stores the remote location of this information or how to access it. The
second important task of an information server is to control the information flow to and
from the company. Due to the wide acceptance of XML as the proposed information standard
in the Internet, the Intranet and between applications within the enterprise, information
servers have to be able to act as gateways, that centralize the control over the access
and the flow of information between those architectures.
Software AG has recently released its XML Information Server, known as Tamino.
Based on a small and extremely fast kernel technology that is able to process XML
natively, the so called X-Machine technology, Tamino is the first database that allows
direct storage, integration and exchange of XML-data. This guarantees high performance and
scalability since no extra layer for data conversation to and from XML is needed. In other
words, there is no mapping layer between the XML you see and the underlying
database structures. This eliminates having to do an analysis of which XML elements are to
be stored in an efficiently searchable manner and which are to be stored as something like
BLOBs.
Furthermore unexpected changes in the format of a data stream, which is a key feature
of XML, can be processed based on the embedded meta data. Tamino accepts XML objects as
input and offers XML objects as output. Central interactive administration of multiple
local and remote databases is provided and can be carried out from multiple locations via
a GUI that runs in standard Web-browsers.
To protect investments made in the past, Tamino also provides integrated access to
existing legacy data residing in external data sources (e.g. Relational DBMS or data
created by Office applications). Tamino is an integration platform for describing,
managing and integrating information within and from outside the enterprise without loss
in security, availability or performance. Tamino is totally based on standards for
seamless integration into existing IT environments.
Finally, Software AG has been developing industrial strength database systems for
something like 30 years now. Tamino has been built upon the knowledge acquired from
developing the Adabas system, which is widely used in environments where absolute
reliability, near-infinite scalability, TRUE 24:7 availability, etc. are requirements.
Based on Software AG proven technology, Tamino will be the fastest and most reliable
XML-database for Electronic Business.
XML and the Future of E-Business - From Duct-tape to Foundation Stone
Perhaps a majority of interesting business applications will be distributed across
platforms and servers and accessible via the internet. These will be accessible via
different clients, from heavy specialized apps, Web browsers, and
light PDAs and cellphones.
XML has the potential to become the leading software development platform for such
applications
For example, Forrester Research has written that The introduction of Visual Basic
1.0 catapulted Windows into the mainstream... A tool that allows simple business
connections to be crafted through scripting and document authoring ...will win big.
PC Week notes that Using an application server, some scripts and XML, a company with
capable in-house developers could build a system that approximated the functionality of
even the highest- end [Web content management] systems."
Using XML in the System Architecture
XML will offer developers in the near future some very significant advantages when it
is deeply embedded in the core of an application, not just used as"a kind of
industrial-strength duct tape to fix cracks and fissures". These include:
Allowing naturally hierarchical data to be represented directly
Many "naturally occurring" data fit much more easily into a hierarchical XML
schema than a set of RDBMS tables.
Providing flexibility for handling free-format data
Data that don't match a predefined schema do not store easily, at least if their values
are to be queryable, in RDBMS and EDI systems; this is much easier with XML.
Supporting XML-aware tools directly
Tools that support the exchange of native XML data (or API standards such as DOM and
SAX) need relatively little customization to be made to work in a new application, or with
other native tools.
Native XML Tools Will Make XML Development More Accessible to non-Experts
Text authoring
Even though XML data interchange gets more attention these days, it is still important
to easily and accurately enter XML text. Some XML developers have had success using
Microsoft Word as an XML text entry tool, and some vendors such as Interleaf and Arbortext
support exchange of XML content with Word reasonably well. There are also a large number
of free or cheap XML editing tools available, but only three native XML authoring tools
that can credibly claim to approach the ease of use of a wordprocessing program:
- SoftQuad XMetaL - The "mindshare"
leader and least expensive of the three here.
- Arbortext Adept - This was the dominant high-end
SGML editor
- Excosoft Documentor - another high-end tool from
a Swedish company that has recently gotten more name recognition outside of its European
base.
Business forms
What authoring tools are to text, forms definition and processing tools are to XML
data. The majority of the effort of building a traditional business application -- be it
in COBOL or Visual Basic -- tends to involve displaying forms, getting data entered into
the forms, validating the data, and storing it in a form suitable for later processing.
XML tools that make this easy have great potential to play the same role in the
development of XML e-business applications that the Dialog Editor has in Visual Basic.
- UWI.com - Has a suite of XML software that allows
organizations to conduct secure, verifiable business-to-business e-commerce transactions
on the Internet.
- JetForm XML Forms Architecture - A form definition
language that provides a definition of those elements critical to forms and the processing
of forms.
- Keyfile Keyflow Commerce - Has an XML engine and
forms designer that automates work processes over the Web.
Schema and stylesheet wizards
Just as forms technologies assist in the entry of XML data, schemas have the potential
to assist in the validation and transformation of XML data. There are currently a number
of schema "standards" in common use, from the limited "DTD" construct
in the XML 1.0 specification, including a number of proposed formats that have more
functionality in various respects, and the draft W3C schema specification. Similarly, the
evolving XSL stylesheet and data transformation specification offers much useful
functionality, but authoring stylesheets is not easy and converting stylesheets based on
interim proposals to the ultimate standard will be a challenge. Tools that make it
feasible to use XML now but to migrate to whatever the eventual standards become obviously
can be of great practical value.
- Various IBM alphaWorks XML tools - Programmer
tools to parse, transform, analyze XML
- Extensibility XML Authority - Schema editor
and translator
- Infoteria XSL Style Wizard - Assist production of XSL stylesheets
Scripting / Information flow applications
Finally, there needs to be some way of building actual information processing dataflows
that use data in XML messages and documents to make decisions about how it should be
routed or processed. Of course this can be done with procedural programming languages such
as C++ or Java, but widespread adoption will require user-level scripting and visual
programming tools.
- Many tools support ECMAScript/JavaScript or VBScript
- WebMethods B2B has a Visual "flow"
development tool to build business integration processes.
- Bluestone Visual XML is Visual tool to build
DTDs, documents, and Java classes that store XML in any database
Getting There From Here
Here are a few concluding suggestions:
Stick with the standards
It will often be necessary to work with interim standards (such as one of the schema
proposals or the draft XSL stylesheet language today) in order to build applications in a
timely manner. Likewise certain vendors encourage customers to use proprietary extensions
to their standard tools that offer more functionality, but tend to lock in customers to
the current supplier. While compromises are un-avoidable, in the long run one is almost
certainly better off by minimizing dependency on non-standard XML formats and tools to
maximize interoperability with other applications and maintaining the flexibility to
migrate to better tools as they become available.
Use XML for outward facing data --
One of the most compelling reasons for moving quickly toward using XML to
describe an application's content then transforming it into some display format (rather
than either storing HTML or converting directly from enterprise database data to HTML) is
the flexibility it offers in a world where display formats are in such flux. For example,
it's not at all clear when the browser on a typical user's desk will support native XML,
or whether it will support CSS and/or XSL stylesheets; cellphone browsers will probably
support WML, but it's not clear whether PDA browsers (and WinCE-based cellphones, if they
become a signficant factor) will support WML or some stripped-down HTML dialect. In any
event, keeping data on the server side in XML allows any or all of these options to be
supported simply by adding transformation stylesheets.
Stage XML data in a persistent cache --
Not all enterprise data will be converted to XML in the forseeable future, no matter
how successful the XML standards groups and vendors are at implementing and promoting the
technology. Thus there will always be non-XML data in the e-business "soup" for
virtually all developers. It seems clear from this overview that much can be gained by
transforming such data to XML for interchange and display purposes. It also makes much
sense to store such transformed XML -- as well as XML received from outside or loaded from
native XML applications -- in a native XML data cache so that it is instantly available
when a user or business partner request comes in. |