Whitepapers
XML: Chance and Challenge for Online Information Providers
By Hans Holger Rath, Ph.D.,
STEP Stürtz Electronic Publishing GmbH,
A Sponsor Member of OASIS
Introduction
Online information providers know that quantity data is of little consequence. Isolation of the real value from an exponentially rising tide of information is what the market needs and wants. Information in database tables is structured and well retrievable. But much more information is stored in documents and fulltext search gives too many and too imprecise hits. Metadata helps, but not sufficiently.
SGML (Structured Generalized Markup Language, ISO 8879) was - and still is - the solution that covers both content and semantic structure. It is the perfect document format for information retrieval. Together with the WWW (World Wide Web), HTML started its successful story as an easy-to-use Internet/Intranet online document format. Even though it is an SGML application, HTML unfortunately consists of neutral (= layout oriented) and not of semantic (= content oriented) elements. Full text search hits are again not very helpful.
This restriction was recognised and the WWW consortium developed XML (eXtensible Markup Language). XML is a real subset of SGML limited to the features needed for online documents. The most important inherited feature is the definition of application specific semantic structure elements. The precise retrievability is ensured.
XML opens up new possibilities for the online information provider. Efficient creation, maintenance and storage of information in documents can be restricted to the content - layout will be added during publication. Document parts can be accessed, re-used, and re-composed for various needs. The "virtual document" generated from existing material on-the-fly opens up new products for the online market. Hyperlinks can lead to dedicated document parts instead to the whole document.
The availability of XML is a big chance for the online market. Both providers as well as users will benefit when the advantages are realised and implemented in online products.
The History of XML
The history of XML is deeply integrated with the history of electronic text processing, document processing, information processing, and publishing.
The grandparents of XML were born in the 1960s. IBM developed the Generalized Markup Language (GML) as a solution for its huge amount of internal publications. GML came with a batch processor and allowed reuse of the same source file for different outputs (paper and electronic). The syntax of GML was simple and allowed a lot of minimisation reducing the capturing effort. Another development in the 1960s was GenCode from Graphics Communication Association (GCA). GenCode supported generic typesetting codes and was already aware of document types.
In the early 1980s both groups came together, took the best of their developments (syntax from GML, semantics from GenCode), and started to standardise the way to specify, define, and use markup in documents. The Standard Generalized Markup Language (SGML) was published by the International Organization for Standardization (ISO) as ISO 8879 in 1986 [6]. SGML was generic (providing a formalism for definition of document structures), formal (allowing validation of document structures against its definition), structured (supporting complex documents structures), semantic (separating content, structure, and formatting), and scaleable (managing any size of document and document repositories). The key idea of SGML is the Document Type Definition (DTD), a formalism to define application specific document structures consisting of elements and attributes. Each DTD defines an application of SGML. Starting as format for a few SGML's users, this community became bigger and bigger in the 1990s.
Some people at CERN - European Laboratory for Particle Physics - in Switzerland adopted the SGML syntax for their hypertext application in the late 1980s which resulted in the HyperText Markup Language (HTML). Improvements to HTML and the definition of a HTML DTD made HTML 1.0 a real SGML application in 1993. The success of HTML came with the success of the World-Wide Web (WWW). Both were (and are) simple to use, but powerful. The growth rate of the Web was (and still is) exponential. There has been no technology in the history of mankind before that was used so fast by so many people. But the simplicity of HTML has its price. With its limited small tag set it cannot markup all the information in the world in an appropriate way. HTML supports layout oriented markup but not content oriented markup. Therefore the coded information is presentable to humans but not accessable to machines. Searching the Web works on the fulltext with intelligent ranking algorithms, but it is a matter of fact that the quality of query hits decreases with the increasing number of available and found documents. Besides there are other problems with HTML: a lot of versions are out there in the Web-space, vendors implemented browser specific extensions, and more and more layout elements were added to satisfy the needs of the Web page designers.
SGML is too complex for the Web and HTML is too simple for complex Web sites. What has to be done? In 1996 a group of people started to simplify SGML for Web requirements and developed the eXtensible Markup Language (XML) [13, 2]. In February 1998 the World-Wide Web Consortium (W3C) published the XML specification as W3C Recommendation. In March 1998 first software tools offer XML support. On April 1st 1998 Netscape brought its Mozilla browser in the Public Domain; 4 days later someone integrated an XML parser into the software. In May 1998 Microsoft announced that XML will be a "Save as" format for Word, Excel, and PowerPoint.
Why Is XML Different?
XML inherits most of the differences to other formats and paradigms from SGML.
Text processing and DTP
Text processing systems (e.g., MS-Word and Corel WordPerfect) as well as DTP tools (e.g., Quark XPress, Adobe PageMaker, Corel Ventura) store the documents in their proprietary binary format. These formats are not open and change from software version to software version - sometimes not even compatible. You can say that the documents are "owned" by the software vendors and not by the writers / publishing houses / information providers.
In addition to that these binary formats combine content and layout so closely that information itself cannot be accessed by a program, it is only readable by humans.
Publishing with text processing systems and DTP tools is media oriented, whereas paper still is the preferred medium. Publication of the same source document on different media is possible, but is restricted to a "simulation" of the given layout, because it is the only information available for the program. This is the way how "Save as HTML" and PDF generators work.
Macro-Controlled Batch Formatting
Batch formatting systems (e.g., DCF, nroff/troff, TeX) use macros to define formatting commands. These macros just like the formatting language itself are based on ASCII and well documented. The macros might be both layout or content oriented, whereas the content oriented macros are not used so often.
The macro abstraction level might be used for multiple media publishing in theory, but in practice it is not. Macros support paper oriented publishing.
HTML
As already mentioned, HTML is restricted to its predefined tag set. It is neither generic, nor is it well structured, semantic, or scaleable. Information cannot be marked up as needed for information retrieval in large repositories for example.
The search in AltaVista for an SGML book written by Mr Pepper brings more than 200 hits and none of the first twenty hits are useful. You find Usenet articles written by Mr Pepper and cooking recipes using pepper as seasoning. Search at amazon.com lists the book immediately, but here search is done in a database with metadata (see below) and not in HTML pages. The same precise result is possible with XML because the query could look like "product category = book", for "author = Pepper", and "title = SGML". Such a query is not possible today, but XML-aware search engines and common tag names will show up in near future.
Furthermore multiple media publishing is not supported very well because markup is (screen) layout oriented and therefore not useful for paper.
SGML
The basic concepts of SGML are great. Define a DTD with your elements, markup your documents with your tag set, implement specific processing for your application, be independent from software and hardware, and have a reliable format that is human and machine readable for decades of year [1, 3]. Why is SGML too complex for the Web?
There are several technical reasons. Let us concentrate on the most important ones: number of features and validating of documents.
SGML is hard to implement in tools due to of its large number of defined features, whereas most of them are rather academic and very rarely used in real-world applications. XML concentrates on those features that bring the most benefits without being difficult to use and implement. Just to give an impression about the complexity of the definition of SGML and XML: SGML standard has about 500 pages, XML recommendation has about 35 pages.
SGML was not designed for a network environment. XML takes care about network bandwidth and client server communication. XML makes the existence of DTDs optional. Documents are well-formed without a DTD. Tools like browsers, databases, and full-text search engines do not need a DTD. Well-formed documents contain all structural information that is necessary for processing. Only guided editing requires the DTD.
But keep in mind: XML is a real subset of SGML. This became possible with an official ISO Technical Corrigenda of the SGML standard in 1997.
Databases
Information stored in relational databases is highly structured and open for queries of any kind. Queries over different database tables and the combination of key, foreign keys, and values offer a large number of views on the same data.
But not all types of information "fit" into a database table. But it can be tagged with semantic XML markup that allows direct access to the data portions. Relations in XML are either hierarchical (like parent-children relations) or are expressed by links. Queries over XML have to be performed by XML aware repositories; this can be full-text search engines or relational or object-oriented databases. The key issue is the mapping of the XML document structure to retrievable components of the database.
As explained: XML is not a competitor to databases. It can complement databases for information types that follow the document paradigm.
Making XML Work
XML is not a tool you can buy of the shelf. It is a technology that needs careful set-up steps. Setting up an XML working environment can be compared with setting up a database. Without defining your relational table model, without implementing access routines you cannot load and retrieve data. And nothing is as important and crucial as the table model - when something is wrong with the model anything else could be useless - even your data.
It is the same with XML. You have to analyze your problem, specify your requirements and goals, analyze your document structure and your workflow, define the DTD and workflow, select, configure, and integrate the tools, test your application, train your users, and maintain the application over years. And consider that the DTD is important and crucial for the whole application and all of your data.
Most of the listed items are known from any kind of project. But document analysis and DTD definition might be unknown and need some explanation.
An XML DTD is -like an SGML DTD - a formality for the definition of document structures. These structures consist of nested hierarchies (a book contains chapters which contain sub-chapters), sequences (the front page of a book contains the title followed by the author followed by the name of the publisher), and alternatives (the heading of a chapter is followed by a paragraph or list or citation). The structural elements can be required (the heading of a chapter must be there), optional (the front page may have a sub-title), repeatable (there might be more than one author but at least one), and optional-repeatable (a chapter may have zero, one, or more sub-chapters). Furthermore each element may carry some attributes (the type of the list, the language of the document, the id of a link target). Before you can develop a DTD you have to analyze your data to determine the appropriate structures and their relations. Because the DTD is so important you should contact an expert before spending a lot of money in trial-and-error projects [9].
When you are part of a bigger user community it might be reasonable to define a DTD for this user group. It reduces the developing costs for all users and simplifies the exchangeability of the data.
What types of DTDs increase the quality of the data and the service of an online information provider? In general the answer is: "Use semantic elements to exploit the knowledge inside the information". The right selection of the semantic elements depends on your application and the goals you are targeting at.
Once you have your DTD you can start implementing it in the tools. Right now - this article is written in June 1998 - only very small number of tools support XML, but this will change dramatically in the next year. When you want to start working with XML, each tool in your process chain has to support it: an authoring tool (e.g., editor) to capture and maintain the data, a data mining tool (e.g., database) to store and manage the data, a retrieval software (e.g., full-text search engine) to find the requested data, a carrier (e.g., CD-ROM, Internet) to distribute the data, a browser (e.g., Internet browser) to view the data.
NOTE: Remember, XML is a subset of SGML. Some interesting features of SGML are missing which do not make much sense in a Web environment but which make a lot of sense in the authoring and data management environment. Therefore you have to analyze carefully - probably with the help of a consultant - if it is not better to use SGML in the production environment and XML only in the online publication environment. Going from SGML to XML is trivial and cheap.
XML in the Online Information Market
"XML Inside" - A great marketing slogan for an online information provider! Really?
Success on the market requires more than buzzword enabling products. XML itself is not a market, it is a technology, but you should use this technology to improve things and to innovate things.
Improve the quality of your data and of the query hits of your search engine.
Identify the market you are in and communicate the business benefits to your customers.
Invent new views on the existing data.
Combine selected portions of information to answer questions nobody thought of before.
Figure out what your customers want to have - today, next year, next decade.
Make use of the technology - XML allows a smooth migration from conventional information processing to knowledge management.
Start today to be prepared for tomorrow.
There are a lot of well-known phrases that have to be filled out with real life applications. Some are listed below. Most of them are enabled or enhanced by using XML.
Content management:
The semantic markup in the information gives detailed access to the content. Each semantic content container (= element) provides a handle for that particular information unit and says something about the content - its semantic. All this information allows more flexible and more precise dealing with data. Content management with a large amount of data has to be supported by an editorial system or document management system
Metadata management:
Metadata is information about information. Metadata can be used for retrieval purposes and for management of the information. Since metadata is "only" data, XML is an appropriate format to model and maintain it (see RDF in chapter 6). One paradigm is applicable for both and synergy effects are possible.
Database publishing:
Well structured documents in a database open up a wide range of publishing approaches out of the database. All of them can be fully automated. Publish only the latest changes as updates. Produce personalized edition depending on user profiles or user queries. Extract portions out of existing data and compose it to new publications. Allow direct access to your "information base". You or other service providers can add value to your information by assigning metadata or link structures. Your "information base" becomes a "knowledge base". That is the major reason why the Web service Yahoo! is so successful - they added metadata to their information. But be prepared that your information becomes "comparable" with the information of other providers.
Multiple media publishing:
Publish the same data on different media. Support paper, Web, and CD-ROM without changing the data - simply applying different filters or styles. However, not every content can be published on multiple media without editorial changes. An electronic version of a paper document in the Web makes no sense without added-value. PDF is not a real online format; it is a carrier to reproduce printouts at the user's site.
Single source and reuse of information in various publications:
Write once and publish the same portion of data in several publications many times. This goes hand in hand with database publishing.
What Is Beyond XML?
XML comes with a bunch of further solutions for the online market. Some of them were/are developed in parallel, others are based on XML and work is still in progress. Most of them are so "hot" that it is hard to follow the specification updates. Keep this in mind when reading the following sections.
Styles and Representation
With HTML a fixed tag set is available and browsers "know" to render it on screen or paper. With XML the tag set is up to the user and the browser does not "now" how to render all these tags. XML needs a style language that allows the description of various styles for all the existing/upcoming XML DTDs and well-formed documents. It is a more powerful alternative to Cascading Style Sheets (CSS). The roots of XSL are in the ISO standard ISO/IEC 10179 Document Style Semantics and Specification Language (DSSSL) [8].
The W3C develops the eXtensible Style Language (XSL) [12]. The Work on XSL should be finished at the end of 1998. A lot of SGML and XML software vendors promised to support XSL as soon as its specification is stable. This will be the first time that semantic data and its (screen) representation becomes independent from hardware and software!
Hyperlinks and Addressing
An online format like XML is useless without hyperlinks. XML inherits simple cross references inside one document from SGML, but that is not sufficient. Links between documents are needed. And the new linking mechanism should be more powerful than the <a> link in HTML.
What are the problems with <a>? There are billions of links out there in the Web and it works. It works, but to be honest, managing these links is a maintenance nightmare. Why? All link sources - the <a> element with href attribute- are part of the data and link targets which are inside a document - the <a> element with name attribute - are part of the document too.
XLink [15] and XPointer [17] - both are developments of W3C - have addressed these shortcomings. XLink is responsible for the link elements and their attributes (like <a> and attributes href, name). XPointer is responsible for the addressing of link anchors. Both borrow concepts from ISO standard ISO 10744 Hypermedia/Time-based Structuring Language (HyTime) [8] and are still under development.
What are the differences to HTML links? Due to the fact that the declaration of XML elements is up to the user XLink provides a mechanism to declare which elements are link source elements. An XLink anchor element need not be part of the data - not even a link source - it can be stored in a separate file or in a database. This becomes possible because both source and target anchor can be addressed with the powerful addressing concepts of XPointer. These concepts allow the absolute addressing with ids (like name attribute of <a>) and relative addressing using "tree navigation" inside the XML instance (like 2nd child of 3rd child of root element).
Storing link information separate from the data opens up a large number of new applications and services for the online market. Good examples are reference works publishers and their co-operation with (one or more) "information broker(s)": the publisher provides all its information units (= entries in an encyclopaedia), the information brokers add links structures "over" the units. These so called Topic Navigation Maps are added-value and they exploit the information in a way which is much more interesting for the reader/customer than an alphabetical order in a printed version or some search hits on a CD-ROM.
Data Modeling and Namespaces
XML DTDs can be defined using the well know SGML syntax. But this syntax has its limitations because it cannot express more information than SGML, which was designed in the late 1970s and early 1980s. Therefore no object oriented concepts and no database approaches are part of SGML DTDs.
Nowadays a document model should contain more information and not only a list of elements, their hierarchical relations and their attributes. XML-Data [16] combines this classical DTD information with relational table modeling (relations, keys, foreign keys, data types) and object oriented concepts like inheritance and supertyping. This is not called a DTD any longer, it is called an XML-Data schema.
XML will "produce" a lot of DTDs/schemas and some XML documents will be composed from different source documents belonging to different DTDs. Therefore a mechanism is needed that cares about element names and possible name conflicts. Namespaces in XML [13] are the solution for this requirement. They describe a way to distinguish elements even if they have the same names and come from different DTDs.
Metadata
XML is doing a good job of structuring data. Each relevant part of the data can be searched, accessed and processed. But what about the information objects - the objects containing the data? They are stored in a database (e.g., in a Document Management System) or are part of the World Wide Web. Both kinds of "repositories" contain a tremendously large number of documents. All these information objects have to be created, maintained, managed, retrieved and delivered as well as published. The larger this number of objects becomes the more difficult it becomes to manage and search them.
Metadata is the technology that makes possible faster, more focussed search and retrieval of information objects. Metadata is information about information. It supports not only searching, it also supports management of information objects and administration tasks. Metadata is added value to the information content itself, because it gives easier access to the requested information and brings information objects into new relations.
As already indicated, the usage of metadata in Document Management Systems and Editorial Systems is partly different from its usage in the Web. The major focus in a DMS/ES is management of info objects and administration of production processes. The major focus in the Web is searching. But both applications of metadata follow the same approach (attaching information to information) and have the same essential requirements: definition is independent from data and application, metadata is interchangeable, concept is scalable over number of info objects and number of metadata fields, concept is implementable with existing database technologies. With globalisation of the market non-Latin languages become more important. But existing full text-search engines do not fulfil the requirements. Searching in metadata could overcome this shortcoming.
The Resource Description Framework (RDF) [14] covers some interesting features: work in low memory environments, protocol for metadata interchange, tuple-based therefore mappable to relational databases, association with multiple info objects, trusted third party description for data with signatures. RDF is independent from XML, it uses XML as expression language for the metadata and the relations (= links) but it is also applicable to any other data format. RDF is one of the "hottest" topics beside XML at the moment.
Some more ...
The Document Object Model (DOM) [10] provides standardized data structures and access functions for XML document APIs. DOM gives application programmers a fixed set of data and functionality to implement extensions to XML tools.
The XML/EDI Group develops an XML application for commercial electronic data interchange [18]. XML/EDI provides a standard framework/format to describe different types of data - for example, a healthcare claim, project status - so that the information whether it is in a transaction, catalogue, or a document in a workflow can searched, decoded, manipulated, and displayed consistently and correctly by implementing EDI dictionaries.
The ICE Ad-hoc Working Group works on Information & Content Exchange (ICE) [5], a proposed protocol designed to significantly reduce the cost of doing business online and increase the value of business relationships by facilitating the controlled exchange and management of electronic assets between networked partners and affiliates. This sounds rather interesting but until now it is "only" an initiative of a few companies.
Conclusions
Document-based online information has to be coded with semantic markup to allow efficient and precise searching, as well as advanced data management, extraction and composition. HTML is layout-oriented markup with a limited tag set, and, therefore, it is not a solution for these requirements. SGML fulfils the requirements but is to "heavy" for the Web. XML, as a simplified SGML, meets the needs and is designed for the Web.
Setting up a working XML environment requires several steps and tools. The steps have been covered, the tools will come during the next year(s). The most important part of an XML application is the DTD. The DTD is the "heart" of the application and controls data, applications, and spin-offs generated from the coded data. It depends on the requirements of every application if XML or SGML is the appropriate format. SGML gives more control for editing and management the data.
XML brings a lot of benefits to the online information market. Together with all the escorting new and hot formats, it is an enabling technology to do things better and to do new things. Be prepared for a revolution from a global information space into a universal knowledge network.
References
[1]Alschuler, L. (1995) ABCD ... SGML, International Thomson Computer Press, ISBN 1-850-32197-3.
[2]Bradley, N. (1998) The XML companion, Addison Wesley, ISBN 0-201-342855.
[3]Donovan, T. (1997) Industrial-Strength SGML, Prentice Hall PTR, ISBN 0-13-216243-1.
[4]Goldfarb, C.F. (1990) The SGML Handbook, Oxford University Press, ISBN 0-19-853737-9.
[5]ICE Ad-hoc Working Group (1998) Information & Content Exchange (ICE), http://www.vignette.com/Products/ice/0,1668,0,00.html.
[6]International Organization for Standardization (1986) Information processing - Text and office systems - Standard Generalized Markup Language (SGML), ISO 8879:1986.
[7]International Organization for Standardization (1992) Information technology - Hypermedia/Time-based Structuring Language (HyTime), ISO/IEC 10744:1992.
[8]International Organization for Standardization (1996) Information technology - Processing languages - Document Style Semantics and Specification Language (DSSSL), ISO/IEC 10179:1996.
[9]Maler, E.; Andaloussi, J.E. (1996) Developing SGML DTDs, Prentice Hall PTR, ISBN 0-13-309881-8.
[10]World-Wide Web Consortium (1998) Document Object Model (DOM), http://www.w3c.org/DOM/.
[11]World-Wide Web Consortium (1998) Extensible Markup Language (XML),http://www.w3c.org/TR/1998/REC-xml-19980210.
[12]World-Wide Web Consortium (1998) Extensible Style Language (XSL), http://www.w3.org/Style/XSL/.
[13]World-Wide Web Consortium (1998) Namespaces in XML, http://www.w3.org/TR/WD-xml-names.
[14]World-Wide Web Consortium (1998) Resource Description Framework (RDF), http://www.w3.org/TR/WD-rdf-syntax.
[15]World-Wide Web Consortium (1998) XLink, http://www.w3.org/TR/WD-xlink.
[16]World-Wide Web Consortium (1998) XML-Data, http://www.w3.org/TR/1998/NOTE-XML-data.
[17]World-Wide Web Consortium (1998) XPointer, http://www.w3.org/TR/WD-xptr.
[18]XML/EDI Group (1998) XML/EDI, http://www.xmledi.net/
"XML: Chance and Challenge for Online Information Providers" was written by Dr. Hans Holger Rath, STEP Strtz Electronic Publishing GmbH (www.step.de).
STEP is a sponsor member of OASIS, the Organization for the Advancement of Structured Information Standards (www.oasis-open.org
OASIS is a nonprofit, international consortium dedicated to accelerating the adoption of product-independent formats based on public standards. These standards include XML, SGML and HTML as well as others that are related to structured information processing. Members of OASIS are providers, users and specialists of the technologies that make these standards work in practice. 1998 STEP Stürtz Electronic Publishing GmbH. All rights reserved.
The information in this document is subject to change without notice and does not represent a commitment on the part of STEP or OASIS. No part of this document may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying and recording, or for any purpose without the express written consent of STEP.
STEP Stürtz Electronic Publishing GmbH
Department Consulting
Pavillon 7
Technologiepark Würzburg-Rimpar
D-97222 Rimpar Germany
Tel: +49.(0)9365.8062.0
Fax: +49.(0)9365.8062.66
Email: consulting@step.de
|