XML White Papers

Introduction to XML

By STEP Stürtz Electronic Publishing GmbH,
A Sponsor Member of OASIS

Content

The interest in XML comes especially from two directions: the Web, which is reaching the limits of HTML and SGML publishing, which is looking for a way onto the Web. This document addresses both groups of interest. It gives a short description of XML, its origin, its concepts and its purpose. The following list of questions will be answered:

Background:

What is modern data processing supposed to achieve?

Current approach:

How does SGML meet these needs?

Next step:

Why a new language? Why is HTML not the right solution? What is the difference between SGML and XML? Which language should be used for which purpose?

All about XML:

What else is out there? Which XML languages are coming up? Which languages support XML solutions?

Conclusions and prospects:

What is the consequence? How is it going to continue?

Background: Demands on Information Processing

We have not been faced with the tasks of Web data processing just recently: similar requirements have been existing for a long time in information processing of various industries:

Reuse instead of double data storage: in publishing houses where parts of information are used in several works or works are published in several formats (book, CD ROM); in technical documentation where parts of information are taken out of the same data pool and are reassembled again, e.g. for manuals or Web pages.

Hard-and software independence instead of laborious changes of formats: in industries with long-lasting data, e.g., in the pharmaceuticals industry, where the documentation of a drug or medicament has to be available from the first laboratory testing to the moment it is being approved and launched on the market, independent from modifications in the applications software; in the aviation industry which has to store and maintain its documentation of aircraft types for decades; for Web publication being a medium of presentation which has to be able to function independently from the browser and hardware environment used.

Data exchange instead of multiple data entry: going beyond department and company borders and between cooperating companies; in the Intra- or Internet

Targeted retrieval instead of heterogeneous search results: in company internal knowledge pools or in Internet retrieval where information of all kinds has to be accessible to various users for various purposes

The demands on electronic data processing on the Web as well as in other publishing and processing media can be reduced to a common denominator: the access to information as well as information processing has to be simple, targeted, flexible and independent:

simple, that is largely automated and thus faster, less

expensive and of consistently high quality targeted and flexible, according to the information and its purpose

independent from the medium, i.e., compatible with any kind of hard- and software

In order to fulfil these requirements, SGML, a well-established standard, was developed. SGML is the origin of XML, which was created to enable information-oriented data processing on the Web. In this basic demand both languages are similar. The experience with the standard SGML was used for the development of XML and is of great value for its use.

Current Approach: SGML, a Silent Revolution

SGML (Standard Generalized Markup Language), an international standard since 1986 (ISO 8879), is a meta-language: It can be used to create new languages in order to describe any kind of information.

A document contains two kinds of information: content and structure. A bibliographical entry may have the following content:

Charles F. Goldfarb / Steve Pepper / Chet Ensign: SGML's Buyer's Guide

A unique guide to determining your requirements and choosing the right SGML and XML products and services

ISBN 0-13-681511-1

This content consists of structural units (elements) "author, title, subtitle, ISBN" the entries of which come after each other. Each author consists of the last name followed by the first name.

Using SGML content and structure can be described in a standardized way:

<biblio-entry id="12">

<author><last-name>Goldfarb</last-name><first-name>Charles F.</first-name></author>

<author><last-name>Pepper</last-name><first-name>Steve</first- name></author>

<author><last-name>Ensign</last-name><first-name>Chet</first-name></author>

<title>SGML's Buyer's Guide</title>

<subtitle>A unique guide to determining your requirements and choosing the right SGML and XML products and services</subtitle>

<isbn>0-13-681511-1</isbn>

</biblio-entry>

The general definition of this structure is determined in the Document Type Definition (DTD):

<!DOCTYPE biblio-entry
<!ELEMENT biblio-entry - - (author+, title, subtitle?, isbn)>
<!ATTLIST biblio-entry id CDATA #REQUIRED>
<!ELEMENT author - - (last-name, first-name)>
<!ELEMENT (last-name, first-name, subtitle, isbn) - - (#PCDATA)
]>

It represents the structure rules, the building plan for all documents or document parts of the same type (e.g. bibliography, report, memo). A document is only valid SGML if it complies with the structure rules of the DTD.

The advantages compared with layout-oriented format are huge. They include:

Access to information: Since the documents are structured with regards to content (in the example author, title, ISBN etc.), the information parts and its relationship are available electronically and usable for various purposes and in different ways, e.g., for retrieval. In the example it is possible to formulate a search request which does not search "Pepper" via full text retrieval but as the last name of an author within a bibliographical entry. This is different than "Pepper" as a dictionary entry which could be structured as follows:

<diction-entry>

<lemma id="1076">Pepper</lemma>

<definition no="1">Pepper is a hot-tasting spice which is used to flavour food</definition>

<definition no="2">A Pepper is a hollow green, red, or yellow vegetable with seeds. See pictures headed <link ref="11983">vegetables</link></definition>

</diction-entry>

In the full text retrieval, both occurrences of "Pepper" are counted as hits; a structured search enables a distinction of different information. (What could be more different than authors and spices!)

Single source, multiple outputs: A single data source can be used in different ways for different media. Since the source has a clear structure, the conversion can be automated. Links, for example, can be emphasized in the print only in the format (e.g. italic) whereas they could be hyperlinks on a CD-ROM. Thus, independence from hard- and software is ensured: if, for example, the software used for printing changes, the conversion of this output format is adapted and the data source remains unchanged.

Quality control: SGML documents are parsed, which means they are checked for compliance with the regulations of the DTD. Deviations will result in error messages. In the example above, the sequence "first name, last name"

<first-name>Charles F.</first-name><last-name>Goldfarb</last-name>

is not allowed. In documents with strictly fixed structures, such as encyclopedias, the DTD can be used for automated quality control. In texts with relatively free structure, the DTD is formulated accordingly.

In many sectors of information processing, these advantages were recognized a long time ago and have been used since then. Company- and industry-wide, DTDs have been established for particular applications in order to optimize the interchange of homogenous information. For example in the automotive (J2008) and aviation industries (ATA 100, AECMA), in the publishing (ISO 12083, Majour) or software documentation sectors (DocBook). HTML is one of those DTDs. It was developed for presenting and processing information for the Web.

Next Step: XML, the Web Revolution

XML (eXtensible Markup Language), a recommendation of the World Wide Web Consortium (W3C) (www.w3.org), is a subset of SGML. XML is a meta-language, which enables a general availability and interchange of information that is structured according to its content--with any kind of application, in various presentation, for different target groups and different purposes.

Why is HTML not the right solution?

XML is not an improvement of HTML, it is a change in concept.

HTML is a language. XML is a meta-language a language to generate languages any kind of language, perfectly suited to the respective purpose, for marking up information of any kind.

Only this change in concepts makes it possible to surmount the limits.

Flexibility instead of fixed structure: Whereas HTML provides only a fixed amount of elements for heterogeneous information, XML is able to generate elements which are tailor-made to particular information types. One good example is an SGML document which is structured according to the pattern of the DTD, dependent on the industry. Usually, essential information is lost in Web publishing since HTML does not provide sufficient modes of expression. XML makes this information "Web-capable."

Access to information instead of only to layout: HTML offers limited possibilities for structuring documents according to their content. It is mainly a layout-orientated language for the display of documents. Targeted retrieval, for example, is not possible with HTML, but it is with XML and SGML data (e.g., search for the last name of an author).

Control: Just like in SGML, it is possible to parse documents. The XML DTD can be formulated in a way so that the basic requirements on the structure of the documents (e.g., sequence, existence of particular elements) is automatically verifiable (via an XML parser).

Differences between SGML and XML

The differences between SGML and XML arose from the aim to develop a meta-language especially for the needs of the Web and to promote a fast establishment of this language on the Web.

Simple implementation: The language capacity of XML is limited, and therefore the development of applications for XML is less complex than for SGML. The dissemination of XML for Web publishing is thus favored.

DTD not necessary: XML documents can be used without a DTD (Document Type Definition). Thus, XML can be used for structuring as regards content and as a pure presentation tool. It is possible to use it according to its purpose. For presenting and downloading data, the document can be used separately in order to facilitate and accelerate processing. If the structure of the document is relevant and has to be controllable, then the document is transported together with its DTD.

If the DTD is missing, XML documents have to be well formed, that is their structure has to fulfill specific preconditions in order to be able to be interpreted and processed correctly in all applications. The most important criteria of well formedness:

There is exactly one root element.

All elements that are not empty have to be marked with start and end tags.

The order of the elements is hierarchical: an element A that starts within an element B also ends with B.

An attribute must not occur twice in one element.

All entities that are used have to be declared.

In order to fulfil the criterion of simplicity (which means simple implementation), some expressions are regulated more strictly as in SGML and many possibilities of expression are missing. This especially concerns documents. Here are some examples:

Unicode is intended as a standard character set which has to be processable by all XML processors.

Attribute values have to be in quotation marks. They must not contain External Entity References , the character "<" is not allowed.

For elements of the type EMPTY there are two possible notations. Either they are marked up with their own Tag

<example/>

or with start and end tag

<example></example>

For elements that are empty only in the current context, both notation possibilities exist.

Processing Instructions are marked differently:

<? Processing Instruction ?>

and start with the Target Name indicating the application which evaluates the Processing Instruction.

The difference is even greater regarding DTDs. Some of those differences concern only the order of the DTD and are therefore do not restrict data modeling:

Comments are not allowed within declarations, only in isolation ()

Name Groups are not allowed within attribute and element declarations. The construction

<!ELEMENT (last-name | first-name | isbn) - - (#PCDATA)>

is changed into three separate element declarations in XML:

<!ELEMENT last-name (#PCDATA)>

<!ELEMENT first-name (#PCDATA)>

<!ELEMENT isbn (#PCDATA)>

There are also many possibilities of data modeling in SGML which are not available in XML. The most important differences:

There is no support for DATATAG, OMITTAG (that is no Mimization Parameters: start and end tag have to be set), RANK, LINK, CONCUR, SUBDOC, FORMAL, SHORTREF ( that is no Short Reference Delimiters are possible), USEMAP.

Inclusions and exclusions are not possible.

The And-operator is missing: The construction

<!ELEMENT name - - (first-name & last-name)>

(<name> consists of <first-name> and <last-name> in any sequence) is not possible in XML.

The type RCDATA in element declarations does not exist.

Mixed Content within one element declaration is only possible in the following way:

(#PCDATA | sup | sub | emphasis)*

#PCDATA has to be in first place, "*" has to be used as an indicator of frequency.

The attribute types CURRENT, NUTOKEN(S), NUMBER(S), NAME(S) are missing.

SDATA Entities do not exist.

External Entities have to be declared with a System Identifier, as an option additionally with a Public Identifier. The System Identifier is a URI (e.g. the URL or the path name of a file).

Some of those constructions which XML is missing simplify data modeling considerably, such as exclusions within elements (Example: A paragraph appearing within a footnote must not contain footnotes). In order to develop deeply structured and verifiable DTDs for information, SGML should be preferred as a source format.

Conversion

Regarding documents, the conversion from SGML to XML can be accomplished without any problems. SGML is a perfect source format for the output of XML data for Web publishing. The conversion is easy and thus easily automated.

Generating an XML DTD that conforms to an already existing SGML DTD is more complicated and cannot be automated completely. For example, when resolving inclusions, the automatic transformation (the element is allowed everywhere) does not often make sense. Tools and applications that facilitate the conversion are already available.

Which language should be used for which purpose?

The question whether XML or SGML is the right source format is not easily answered for all cases.

There is no question about existing SGML environments where highly functional applications are already available and the processing of SGML data is carried out effectively and is automated to a great extent. XML can be used as a further output format for Web publishing or data interchange. The conversion of existing SGML data into the output format XML is generally simple.

XML is also suitable as a source format for simply structured documents and data that are only used for Web publishing and that cannot be encoded optimally with HTML (e.g., because they have to be retrievable according to content criteria). A detailed analysis of the data structure and purpose of use is required, however, to be able to take advantage of the entire range of XML's functionalities. Although it is not a requirement within XML to develop DTDs, it is absolutely recommended for professional data processing and maintenance.

It could be said in general that SGML offers the perfect basis for the entry and editing of deeply structured information as the entire language capacity can be used for data modeling. XML is the perfect language for data interchange and presentation in the Web. SGML and XML can also be optimally used in combination with SGML serving as the editing format and XML as the output and exchange format. XML documents as well as DTDs (if they are required) can be derived from SGML.

All about XML: Related Languages and Derivatives

Being a meta-language, XML is more than a tool for semantic structuring of information in documents. With XML it is also possible to create languages which standardize the treatment of any kind of information, such as the presentation of data (style), data transfer, access of applications to the structure, description of the relationship between data and data modules (hyperlinks, addresses) or the communication between applications.

It is no wonder, then, that many languages supporting the processing of XML data are derived from XML itself.

Intelligent links: XLink, XPointer

XLink (XML Link Language) and XPointer (XML Pointer Language) are XML languages for links and addresses: They are used to describe the relationship between information units. They connect the functionalities of TEI- (Text Encoding Initiative) and HyTime-link concepts.

With its Simple Links, XLink provides a link structure that exceeds the capabilities of HTML-links: It enables a more specified classification of links and link targets as well as the determination of presentation and conversion with the applications.

With XLink's Extended Links in combination with XPointers it is possible to keep links and link targetse outside the documents concerned. Thus it is possible to set links/link targets also in write-protected documents (usually in the Web) and to define different link networks for documents, opening up knowledge pools from different point of views and for different objectives.

XLink and XPointer have been Working Drafts of the W3C (World Wide Web Consortium) since March 98.

More than just layout: XSL

With XSL (eXtensible Style Language), the presentation of an XML-document in a Web browser can be determined. The potential of XSL exceeds the pure layout definition by far. It is, for example, possible to hide particular elements or to change the sequence of elements in the display. Furthermore, other transformations can be executed.

XSL is a Working Draft of the W3C as of August 98. It is planned to pass XSL as a Recommendation in 1999.

New language for DTDs: XML-Schemas

With XML-Schemas it is possible to define DTDs. DTDs are not, as was previously the case, formulated in their own language. They are XML documents themselves. The documents and its definition of structure thus use the same language.

The definitions executed in schemas exceed the capabilities of the classical DTD-language by far. That way data types and relations between data can also be described with schemas. An XML-schema can, for example, define database models or the relation between data models and document types. Due to these and many other abilities XML schemas have become an important tool for describing information structures of any kind.

The W3C has to assess many schemas for different fields of application. DCD (Document Content Description for XML) is one schema for DTDs. The W3C received the specification as a Submission in August 98. XML-Data is a similar approach which describes also other data structures and relations. It has been a Note of the W3C since January 98.

New language for meta-data: RDF

With RDF (Resource Description Framework), meta-data structures ("data about data") are defined. A Web site, for example, could have the meta-data "author," "date of creation," "topic" and so on, which enable classification and retrieval on the Web site. The concept of RDF is based on a simple technique and is therefore suitable to describe meta-data structures of information no matter how complex they are.

The specification in the RDF-syntax was submitted to the W3C in August 98 as a Working Draft. Also the specification of RDF schemas has been a Working Draft as of August 98.

Standards for interfaces: DOM, SAX etc.

DOM (Document Object Model) is an abstract API (Application Program Interface) describing the access of applications and programs to Web documents in a standardized way, e.g., the navigation in an HTML or XML document. DOM (Level 1) is a Proposed Recommendation of the W3C since August 98.

SAX (Simple API for XML) is a simple API based on events for XML-Parser. It provides a standardized interface and thus enables to combine once programmed applications with many different XML tools.

Many of already existing XML programs (XML-Parser, XSL Engines etc.) support SAX 1.0.

Standards for documents: MathML, CML etc.

In order to be able to define recurring structures (e.g., formulas, mathematical expressions, tables) in a standardized way which is processable by all applications, standard modules are developed in XML. Here are some examples:

MathML (Mathematical Markup Language): XML-DTD for defining mathematical expressions and formulas. MathML 1.0 has been a Recommendationof the W3C since April 98.

CML (Chemical Markup Language): SGML/XML-DTD for defining chemical formulas.

XML as a general data format: PGML, WIDL, SMIL, EDI, CDF etc.

XML is an abstract meta-language. With this characteristic it is not only suitable as a markup language for documents but also as a data format for various non-text information, e.g., graphics or the data transfer between applications. Here are some examples:

• PGML (Precision Graphics Markup Language): XML-based description of vector graphics. Note of the W3C since April 98.

• SMIL (Synchronized Multimedia Integration Language): DTD which enables the production of TV-like applications for the Web. Recommendation of the W3C, June 98.

• CDF (Channel Description Format): XML-based language which describes the automatic data delivery from the Web server to client programs (push channels). Recommendation of the W3C since October 97.

• EDI (Electronic Data Interchange): Language for describing the data transfer for which XML can be used as a universal format.

• WIDL (Web Interface Definition Language): Language that defines the interactions (Requests, Responses) between two applications in a standardized way via HTTP. Note of the W3C of September 97.

Conclusion and Prospects

XML is the future of the Web. The decisive factor is that XML can be tailored completely to the needs of users, the information they want to exploit and finally the application purpose.

The enormous possibilities are at the same time a great danger to the success of XML. XML is not a solution but rather a tool to develop solutions. They are as good or as bad as the concepts behind them. With XML, basically everything is possible: from a highly effective, information-orientated language to a layout-orientated HTML extension which basically does not exceed the possibilities of HTML and misses the potential of XML.

In order to be able to exploit the capabilities of XML, a detailed analysis of the documents and its purpose is a precondition. With this basis, concepts are developed which are exactly geared at the specific requirements. The success of XML depends on how professional these tasks will be solved in the near future.

XML is the perfect platform for SGML-based systems in order to provide and exchange deeply structured information also on the Web.

XML is no "pie in the sky." The development is in full swing with increasing pace. For many sectors, user specific formats (DTDs, schemes) will be established which enable and optimize the access and interchange of homogeneous information. Soon there will be a large selection of tools, applications and programs that support XML. The difference in quality is already enormous. In this respect, a detailed analysis will be necessary to determine which applications stand up to the quality requirements and fulfil the individual conditions.

The other way around, applications will have to go along with the requirements of the market. For example, structural information (XML-Markup) is expected to not be shown for some application types but only for data entry and editing of deeply structured information.

The potential of XML does not only include documentation. There are already signs that XML will not merely revolutionize the possibilities of data modeling on the Web but will also contribute substantially to the standardization of data interchange formats and the communication between applications.

The future has already started!

"Introduction to XML" was written by STEP Strtz Electronic Publishing GmbH (www.step.de). STEP is a sponsor member of OASIS, the Organization for the Advancement of Structured Information Standards (www.oasis-open.org).

OASIS is a nonprofit, international consortium dedicated to accelerating the adoption of product-independent formats based on public standards. These standards include XML, SGML and HTML as well as others that are related to structured information processing. Members of OASIS are providers, users and specialists of the technologies that make these standards work in practice. 1998 STEP Stürtz Electronic Publishing GmbH. All rights reserved.

The information in this document is subject to change without notice and does not represent a commitment on the part of STEP or OASIS. No part of this document may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying and recording, or for any purpose without the express written consent of STEP.

STEP Stürtz Electronic Publishing GmbH
Department Consulting
Pavillon 7
Technologiepark Würzburg-Rimpar
D-97222 Rimpar
Germany

Tel: +49.(0)9365.8062.0
Fax: +49.(0)9365.8062.66
Email: consulting@step.de