articles

HTML, XHTML, semantics and the future of the web

Abstract

History: Presented by John at Open Publish 29th July, 2004.

Westciv's John Allsopp clarifies exactly what XHTML is, explains why you need to be learning about it from today, and steps through the process of transitioning to the standards based way of marking up for the web, and beyond.

Introduction

I'll begin with a simple question. If you answer right we might be able to all go home early.

Who here is developing content that is valid XHTML 1 or 1.1? Who here is thinking about it? Who here thinks it is a waste of time and energy Who here is thinking WTF is XHTML?

For more than a decade, content publishing on the web has largely been an ad hoc process. HTML has grown organically, anarchically, largely focussed on presentation and human readers. This has benefits, but also significant downsides.

The HTML of the end of the 1990s had become large, cumbersome to process, presentation oriented. This had dramatic impacts on accessibility and the web's device independence. Rather than being "access to information by anyone, anywhere, regardless of disabilities" the web had become "access to anyone with a browser on a PC with reasonable eyesight and few if any disabilities."

Recognizing this, the World Wide Web Consortium founded by now Sir Timothy Berners-Lee (inventor of the Web, HTML and HTTP) developed XHTML.

The underlying aim of this project is to build a solid foundation for the future of content on the World Wide Web. It is expressly designed to overcome many of the issues with device dependence and accessibility that developed along with the ad hoc development of HTML.

What is XHTML?

Like HTML, XHTML is a markup language. It has the same expressive capabilities as HTML, but a stricter syntax. This is because XHTML is an application of XML, while HTML is an application of the much older, pre web SGML. XML is an international standard language for developing other markup languages. It was designed as a newer generation of SGML, specifically for markup languages in a networked world.

What is the point of XHTML?

XHTML's purpose first and foremost to be HTML as an application of XML. With this, come the benefits associated with all languages which are applications of XML.

XHTML is a much simpler, cleaner language than HTML. This makes learning, coding, and maintaining XHTML code much more straightforward. XHTML was designed explicitly as a device independent language. HTML 4 is a large, complex language. For many web enabled devices, such as mobile phones, with limited computing power, the complexity of HTML presents significant obstacles. XHTML was explicitly designed with accessibility in mind. It is not an afterthought.

One benefit that is not widely understood is that as an application of XML, XHTML is XSL ready. XSL, or the eXtensible Stylesheets Language not only allows for the styling of XML documents, more importantly it allows for the transformation of documents.

But above all, XHTML is the future of the World Wide Web, at least for many years to come. There will be no new versions of HTML. XHTML works on any reasonably modern web browser today (that would mean just about every one in use). And it will work on all web browsers for the foreseeable future.

XHTML versus HTML

The similarities

XHTML is designed to be as much like HTML as possible. It has the same "semantics" as HTML, meaning that it allows the use of the same elements (such as headings, paragraphs and lists) as well as a very similar syntax as HTML. Without looking closely you'd never see the difference in syntax.

The differences

So how do the languages differ? In essence syntactically. While the syntax is very similar there are some important differences.

HTML is not case sensitive. XHTML is case sensitive, and all element and attribute names must be in lower case.

In HTML, some elements, such as paragraphs and list items can be implicitly closed. That means you may leave off the closing tag and you still have valid HTML. In XHTML, all elements must be explicitly closed.

Empty elements have a slightly different syntax. In HTML, an empty element looks like the start tag for any other element. In XHTML empty elements self close, by including a trailing slash before the closing angle bracket. Cunningly, this is in fact still valid HTML, although a space before the slash is recommended for compatibility with older browsers.

There are other small but important syntax differences as well. All attribute values must be quoted. In HTML, attributes without spaces in their values do not have to be quoted. Also, and this is a trick for many developers, attribute values in HTML allowed the use of unescaped special characters, such as the ampersand. In XHTML, such characters must be escaped in attribute values, just as they must be in the content of an element.

Transitioning to XHTML

One of the major steps in transitioning to XHTML is the need for a change in outlook. HTML is seen by most developers and content creators as purely a human oriented technology. But increasingly machine processing, whether by search engines, accessibility devices such as screen readers, or browsers themselves is becoming as important for online information as the human readers of content. HTML developers still largely conceive of their content in terms of how it will appear as rendered in a browser to human readers. XHTML is much more oriented toward the information architecture of content, to the structure and semantics of the information.

Consequently, most HTML content is quite frankly an invalid mish-mash of invalid pseudo HTML, which "works" in a couple of major browsers (or probably only on IE6 for windows). This is a disaster in terms of device independence and accessibility, as well as forward compatibility of content. Such sites are largely inaccessible to machine processing, which with the increasing sophistication of search engines such as Google, and similar information processing devices is popularity suicide.

What should we be doing to transition to XHTML? There are a number of steps in the process of transforming your HTML content to XHTML. These can be don one at a time, or simultaneously.

Step 1: understand document types

First, we have to recognize that there are in fact several flavors, or Document Types of XHTML. Transitional, Frameset and Strict. These match the three HTML Document Types. For now let's concentrate on two, Transitional (or "loose") and Strict. A document type defines the rules of a version of XHTML or HTML.

The transitional document type contains all elements and attributes of HTML, including those which are "deprecated". Deprecated elements are those which will be dropped from future versions of the language. For maximum forward compatibility it is sensible to avoid deprecated elements, and so the transitional document type. The most significant of the deprecated aspects of HTML are its presentational elements and attributes. XHTML documents should not contain information about their presentation. This is left to Cascading Style Sheets.

The Strict document type, as you might have guessed, does not contain these deprecated elements.

Step 2: use a document type

First, you choose a document type. Then declare which type you are using in your document. Then you can check the validity of the document against its document type.

This way you can ensure valid documents. Validators will warn you when the document does not conform to the document type, making it much simpler to debug you XHTML content.

Having chosen and declared the document type we need to get our HTML content to conform to the XHTML syntax. This involves closing any optionally closed elements, and auto closing empty elements. This can be done by hand, but the W3C has a tool, Tidy, now available on many platforms for transforming HTML to XHTML syntax.

Step 3: transform to XHTML syntax

The next step is to remove presentational elements and attributes. This particularly includes elements such as font, bold and italic, and attributes such as bgcolor and other color attributes. HTML Tidy will also help in this process.

To add presentation to your XHTML documents, you'll need to use CSS. That's a little outside the scope of this presentation, but I've included some references in the presentation materials.

Step 4: validating against the doc type

The last significant step in the transition is to validate your document against the DTD. The W3C provides an XHTML validator for this, as do other organisations.

When you first validate an XHTML document, particularly one that might have been updated several times, and transformed from HTML, the original results can be alarming. Those of us who remember word processing before interactive spell checking (where we batch spell checked our documents at the end of a draft, only to find dozens of mistakes) may know that sinking feeling when we have dozens or even hundreds of mistakes. This however is a learning process, where developers quickly identify mistakes they may have been making for a long time.

The validation process is a form of quality assurance. Developers usually quickly learn the common pitfalls and adopt much better coding practices.

The benefits

The process of future proofing your web content by transitioning to XHTML may initially appear daunting. The fact is that developers and organizations usually find it less difficult and more rewarding than expected.

With the increasing use of a wide range of devices, and the growing emphasis on accessibility, as well as the increasing sophistication of web processing by search engines and other services, can you afford to continue with outdated content development and management practices?

John Allsopp is a director at westciv and the lead developer of Style Master CSS editor. He writes widely on web standards and software development issues and maintains the blog dog or higher.

tools & resources for web professionals