Related Topics: XML Magazine

XML: Article

Maximizing the Usefulness of XML

Maximizing the Usefulness of XML

Is XML documents or data? Should XML be managed with a database or a document management system? Should XML be managed at all, or is it simply a data interchange standard? These are among the most common questions people ask about XML when they're trying to get a grip on what XML is.

These questions get at what we do with XML, not what it is. Another question: Is a computer a word processor or a video game? Neither? Both? Actually, a computer is more. Word processors are computers, as are video games; but it doesn't work the other way around. We cannot define a computer in terms of a single application, or even a multitude of applications for that matter. Similarly, we cannot restrict our definition of XML in terms of its uses.

XML is a standard for expressing information in a complete form. It contains not only data, but also context and attributes. XML is, in other words, informationally complete. That's why it's useful for so many different things. XML plays an important role in unifying information from disparate applications. It can likewise play a role in unifying disparate functions within applications. Does this mean that we can use XML for every aspect of what an application does - gather, present, store, manage, transport, transform, and process information? In a word - yes. There are a number of important advantages in doing this, among them:

  • XML is intuitive: It's easy to understand for technical and nontechnical users alike.
  • XML is simple and extensible: We can build applications faster, and change them much faster.
  • XML is standardized and ubiquitous: Applications can be built to be compatible with almost everything.
  • XML is flexible and heterogeneous: We can eliminate the rather excruciating exercise of trying to fit information in predefined containers.
  • XML expresses complete information: We can build unified and universal information models behind applications, instead of a series of redundant but different data models serving individual functions.
  • XML encompasses standards: (such as XML Schema) for providing robust and flexible controls for maintaining information integrity.

    As long as XML was used as a container for data it was sufficient to consider only syntax when building documents. To do more, we must consider grammar and style as well. Obviously, proper syntax is necessary for XML to be usable at all. Good grammar ensures that once XML information has been created, it can be subsequently interpreted without an inordinate need for specific (and redundant) domain knowledge on the part of application program components. Good style ensures good application performance, particularly when it comes to storing, retrieving, and managing information.

    Most application programs encompass the same basic functions - input, presentation, communication, processing, and information management. Although the same information underlies each of these functions, different data models have traditionally been employed to accommodate application components that accomplish these tasks. Even with XML, remodeling the same information in different ways for each component of an application is inefficient and yields programs that are buggy and difficult to maintain and change. A better way is to create a central, unified model for information that can adequately accommodate all functions of an application (see Figure 1).

    To create XML information models that can do more than accommodate single functions, we have to look at XML in a different way - as an information domain rather than simply a transport standard. For simple data transport applications we can use XML any way we like, as long as it is syntactically correct. To do more, complete, provable, and unambiguous XML models must be built. Creating robust universal information models in XML is easier than it may seem at first. The first step is to understand the basic patterns inherent to information expressed in XML.

    XML consists primarily of tags, attributes, and data elements. Tags provide context; in other words, tags describe what data elements in their scope are. Attributes provide information about or indicate how to interpret data elements in their scope. Data elements represent data in the traditional sense. The structure of XML also provides information - about hierarchy, groupings, relationships, etc. It is possible to create meaningless XML. For example, you could create perfectly correct XML by taking the entire text from a telephone book and simply putting "<Telephone_ Book>" at the beginning and "</Telephone_ Book>" at the end. It would be perfectly correct XML, but not useful XML. In this case XML is being used solely as a container for a block of data, and provides no context for the information contained therein. At the other extreme we have XML where all information is expressed in a semantically meaningful way. For example, consider the following XML fragment:

    <Last> Smith </Last>
    <First> Tracy </First>
    <Area_Code> 719 </Area_Code>
    <Number> 555-1234 </Number>

    The explicit patterns in the example above are:

    • This is a telephone phone book listing.
    • The listing's last name is "Smith".
    • The listing's first name is "Tracy".
    • The telephone number's area code is "719".
    • The telephone number is "555-1234".
    Patterns implicit in the example are:
  • "Smith" and "Tracy" belong to the instance group "Name" and are the last and first name, respectively.
  • "719" and "555-1234" belong to the instance group "Telephone" and are the area code and number respectively.
  • Everything belongs to the instance group "Telephone_Book_Listing".

    Explicit patterns are typically converted to database columns with some or all of them being available as query terms. At least one XML information management system (NeoCore XMS) automatically organizes itself around these natural XML patterns and indexes them all without the need for any database design. Implicit patterns are used to determine groupings, relationships, and sometimes convergence points for query set intersections.

    Much like a Web browser can determine how to display HTML information based on presentation metadata embedded in the HTML, application components can determine how to treat and interpret XML information based on semantic metadata embedded in it. XML can contain any kind of metadata. This is where XML differs from HTML. Where HTML was targeted to a single function - the presentation of information - XML fulfills a more universal purpose - the complete characterization of information.

    Key to creating useful XML is to create semantically meaningful XML first. The easiest way to do this is to simply create XML representations that are easy for humans to read and understand, which is what we did in the sample XML fragment. If you were to create a manual entry form for the XML fragment it would probably look something like Figure 2.

    Note that the XML fragment is a direct and obvious analog of how you would represent the information on a manual entry form. All we have done is represent the preprinted parts as tags and the filled-in parts as data elements. The hierarchy we created is likewise obvious. This method of creating grammatically valid XML may seem so simple as to hardly be worth mentioning, but the fact is that grammatically valid XML is relatively rare. To illustrate why, we will examine some common mistakes programmers make when creating XML.

    These examples will use an application dealing with colorimeter readings. A colorimeter is a device that measures color using tristimulus readings utilizing a number of color models. For example, a colorimeter can be used to measure computer monitor colors using red-green-blue components of light (this being the "RGB" color model). This particular colorimeter can provide readings in multiple resolutions. A manual form for transcribing colorimeter readings might look something like Figure 3 after being filled in.

    A typical way to characterize this information in XML follows:

    <device> DigiColor Reader </device>
    <patch> pure cyan </patch>
    <RGB resolution=8 red=0 green=25blue=255 />

    Here, the information modeler has made a couple of optimizations in the interest of saving space. First, the "Color Model: RGB" field has been collapsed into the single tag, "<RGB>". This is a reasonable thing to do as the color model information arguably (and unambiguously) indicates what the readings are for, and therefore qualifies as true context. Second, the readings themselves have been collapsed into three attributes: "red=0", "green=255", and "blue=255". Expressing data elements as attributes is a common practice. Although syntactically correct, this approach brings a number of problems:

  • Attributes provide information about or information on how to interpret data elements in their scope. The readings are not attributes; they represent data from the colorimeter.
  • The first attribute "resolution=8" is needed to correctly interpret the readings. Attributes apply to items in their scope, not each other. We have a mixture of attributes and data expressed as attributes with no indication of how they relate.
  • All four attributes apply to nothing because there is nothing in their scope.

    Another common way this form might be modeled is shown in Listing 1. Here, the entire contents of the form have been flattened into one level of hierarchy. There are several problems with this model:

  • Two of the data elements, "RGB" and "8", provide context that is necessary to interpret the readings correctly. They are not cast in a way that puts the readings in their scope.
  • Which readings values belong to which bands is ambiguous. Even though you could argue that the ordering implies grouping, a database may not know the difference and return false query results. For example, suppose you wanted to return any colorimeter reading with a band of "green" and a value of "0". According to this model, there is no unambiguous indication of which band belongs to which reading, and this XML fragment contains both terms; so a database could falsely return this fragment even though it actually has a green band reading of 255.
  • This model is not extensible. If we wanted to include readings in another color model (CIE xyY, for example), there would not be a practical way to do so.

    Listing 2 shows the most literal XML representation of the colorimeter readings form.

    Although this listing is grammatically correct, it is not optimal - nor is it particularly good style. This model represents another common practice: expressing context as data elements in name/value pairs rather than as tags. The weaknesses of this model include:

  • Two of the information items, "color_ model" and "resolution", do not mean anything by themselves. They actually represent metadata necessary to correctly interpret and understand the meaning of the readings. Moreover, they are stand-alone (they have no scope), so their applicability is unspecified. The "patch" information item, on the other hand, is correctly cast as a separate item because it is not needed to determine how to interpret the readings; rather it specifies what the readings are for.
  • The "component" information items are not actually data. They represent the color bands for the readings, so they are really context. The major drawback to breaking information down into name/value pairs is that it can adversely affect performance because simple queries against the information item result in join operations. This is often done to emulate the ability to support heterogeneous information in relational database management systems. Information modelers who find themselves doing a lot of this sort of thing may want to consider using an XML information management system that natively supports heterogeneity - performance and scalability will be much better.

    There's no one "perfect" XML information model for this application. By applying a few simple techniques, however, a model can be built that is unambiguous, performs well, and will serve all components of the application program. First we examine each information item in order to determine what it really is - data, context, or attribute. Going down the list of items in the original colorimeter reading form, the following interpretations and actions would certainly be reasonable:

  • "Colorimeter Reading" is the name of the form, so using this item as the root name is appropriate.
  • "Device: DigiColor Reader" represents a stand-alone information item about the specific equipment used to gather the readings, so it will be represented as a stand-alone tag/data element pair.
  • "Patch: pure cyan" represents a stand-alone information item about the entire dataset, so it will be represented as a stand-alone tag/data element pair.
  • "Color Model: RGB" specifies a context for all the readings. It could be cast as an attribute or as a tag. Because it specifies what the readings are in the aggregate, rather than how to interpret the readings individually, it will be included as a tag at the root of the readings.
  • "Resolution: 8-bit" directly applies to the interpretation of the readings below it, so it will be cast as an attribute encompassing all three readings.
  • The "Component" and "Reading" items are arguably redundant. You could cast these as tags, but in the interest of efficiency we will eliminate them.
  • The readings each consist of two parts: a color band component and the reading. We will make the color band components tags, as they are context for the readings. The readings will be cast as data elements.

    Converting all this into an XML fragment yields the following:

    <device> DigiColor Reader </device>
    <patch> pure cyan </patch>
    <color_model_RGB resolution=8>
    <red> 0 </red>
    <green> 255 </green>
    <blue> 255 </blue>

    This model fulfills all the requirements of good grammar and good style, yet remains relatively terse:

  • All tags truly represent context and have the proper scope.
  • There are no meaningless or redundant tags.
  • All data elements truly represent data. No data elements are necessary for the proper interpretation of other data elements.
  • Attributes are cast such that they directly apply to the interpretation of all data elements in their scope and no others.
  • There are no tag/data element pairs masqueraded as attributes.
  • The model is extensible. If we wanted to add readings in the CIE xyY model, for example, we could do so by simply adding a group called "<color_model_ CIE_xyY>" to the XML fragment.

    So far, we have discussed creating semantically valid XML and how it applies to how application components interpret information. The next step involves how applications can be architected to leverage XML as a central information model in an unambiguous and provable environment. To accomplish this, XML must be used in a somewhat more stringent way than many application developers are accustomed to. From the discussions above we can establish a series of rules for modeling information in XML:

  • Data elements should be used to express only data, not metadata.
  • Tags should indicate what the data elements in their scope are, not how to interpret them.
  • Tags should apply directly to all data elements in their scope.
  • Tag hierarchies should clearly group information items belonging together.
  • Tag hierarchies should clearly separate information items not directly related to each other.
  • Attributes should indicate how to interpret or provide information about information items in their scope.
  • Attributes should not be critical to interpreting what the information items in their scope are.
  • Attributes should apply directly to all information items in their scope, and no others.
  • Attributes should be understood by all application components that will encounter them.

    The last rule listed is very important and bears some explanation. XML is very flexible, in fact, too flexible the way it is often used. The one practice most responsible for making XML information models difficult to prove and control is the abuse of attributes. Developers regularly use tag/data element pairs and attributes interchangeably in the interest of avoiding data-bloat (an unnecessary practice because even trivial compression techniques can eliminate that problem). This brings an unfortunate ambiguity to information expressed in XML. Attributes are intended to provide information about or how to interpret items in their scope. If an application program encounters a tag that it does not understand, it will ignore everything in its scope, including all attributes. If an application program recognizes all the tags leading up to a data element, for example, it knows what the data element is; but if an attribute is encountered that is not understood, the application program may or may not know how to interpret the data element, and this situation is ambiguous. In order to enforce information integrity controls we need a provable, unambiguous mechanism. Attributes and XML Schema definitions provide suitable mechanisms to do this, as long as an exception is thrown whenever an attribute that is not understood is encountered. To control XML information we need to use a combination of attribute interpretation and schema validation. This gives us an arbitrary and fine-grained degree of control unachievable with traditional databases. This should be done once, with a centrally controlled mechanism. To accomplish this we need to architect enterprise applications in a new way. Traditionally, application programs serve as user interfaces and database systems manage and control data. In the XML world, a three-component architecture should be employed: the user-facing application, the XML information management system, and an information integrity enforcer (schema validator and attribute interpreter) (see Figure 4).

    The information integrity enforcer can be implemented in a number of ways: as a server-side extension, as a standard application program component, or as a Web service. The trick is to have a common information integrity enforcer for each category of application. In order to have applications interact with XML information in a consistent and provable manner, the following model can be adopted:

  • If all tags leading up to a data element are understood, the application can assume it can correctly determine what that data element represents.
  • If a tag leading up to a data element is not understood, the application must assume it does not have a complete context for that data element.
  • An application component may ignore information containing a tag that is not understood.
  • If all attributes having a data element in their scope are understood, the application component can assume it can correctly interpret that data element.
  • An application must not assume it can correctly interpret any data elements that are in the scope of an attribute that is not understood, or violate it. It is up to the information integrity enforcer module (whether a separate module or an integral part of the application) to indicate that such an attribute has been encountered and to which information items it applies. Such exceptions should be treated as catastrophic in so far as the information items in question are concerned.
  • An application must be able to interpret attributes and their contents unless there is a specific rule stipulating that an attribute can be ignored (a font attribute, for example, may be ignored by application components not responsible for presentation).
  • Schema validation violations should be treated as potentially catastrophic information integrity failures.

    Following these XML information modeling guidelines is not substantially more difficult than building semantically weak XML, and the payback can be considerable. Not only can XML be used to integrate disparate data sources, but it can be used as a unified means to express the information underlying functional components of application programs as well. This results in better application programs that can be built faster and maintained with less effort. One thing is inevitable: as information technology progresses into things like the semantic Web and more and more computer systems need to be integrated, information expressed in XML will have to become increasingly semantically valid in order for us to be able to keep up. XML already contains all the elements necessary to do this - we just have to be a little more thoughtful about the ways we use it.

  • Comments (0)

    Share your thoughts on this story.

    Add your comment
    You must be signed in to add a comment. Sign-in | Register

    In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.